-
Notifications
You must be signed in to change notification settings - Fork 886
JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
i can't get this to compile
am i doing something wrong ? |
Is there a separate ticket for vector similarity functions ? |
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.
core/src/main/java/com/datastax/oss/driver/internal/core/type/DefaultVectorType.java
Outdated
Show resolved
Hide resolved
|
||
public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) { | ||
return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested out on C* 5.0.1 and it says ANN only supports float.
cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ;
cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c'];
InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Yes, you're correct; Apache Cassandra 5.0.x supports vectors of any subtype as a type but the ANN index used there only supports floats.
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but it seems more natural to support something like the following:
Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann);
Select orderByAnnOf(String columnName, CqlVector ann);
Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.
You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).
To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree to save DESC ordering for later because:
- I find it weird to add a feature that does not work yet and we cannot even test
- Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?
- If we want to add DESC later, we just need to add another function overload
Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);
. I assume this is not hard.
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTableTest.java
Outdated
Show resolved
Hide resolved
...uilder/src/test/java/com/datastax/oss/driver/api/querybuilder/delete/DeleteSelectorTest.java
Outdated
Show resolved
Hide resolved
...builder/src/test/java/com/datastax/oss/driver/api/querybuilder/insert/RegularInsertTest.java
Outdated
Show resolved
Hide resolved
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTypeTest.java
Outdated
Show resolved
Hide resolved
...y-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTableTest.java
Outdated
Show resolved
Hide resolved
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTypeTest.java
Outdated
Show resolved
Hide resolved
One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well. |
The revapi thing is fixed and the vector similarity function is already supported by the existing Lines 235 to 274 in 19148d5
In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how? |
Good call on the similarity_* functions @SiyaoIsHiding! |
|
||
/** Adds the ORDER BY ... ANN OF ... clause */ | ||
@NonNull | ||
Select orderByAnnOf(@NonNull CqlIdentifier columnId, @NonNull CqlVector<? extends Number> ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm changing my answer on this: we should remove the type bound here and just make this a CqlVector<?>
. My rationale goes as follows.
The fact that ANN comparisons only support float vectors actually isn't a constraint on the underlying Java type used here. In theory any Java type whose codec will generate a serialized value that can be understood as a float when the server receives the message would work here. That means we could pass in a CqlVector of floats, decimals or doubles and still have the server handle that without issue. I'll also note that there's no common supertype for Float, BigDecimal or Double (the corresponding Java types based on the docs) other than Number... and we can't use that here because it includes types other than these three. So there's no meaningful supertype we can use for a type bound at this point.
A second point: the query builder is built around the notion of generating CQL based on whatever the user passes in; there's almost no checking whether types correlate to things the user specified. So, for example, OngoingValues doesn't have any kind of type bounds around it's various value() methods. I accept that this isn't exactly the same as the float constraint we're discussing here... but it is an indication that the query builder largely doesn't concern itself with type checking with the expectation that the server will handle that when it receives the query.
Finally, I'll note that avoiding the type bound here at least exposes the idea (at an API level) of using external/custom types which serialize to CQL float values by way of their own codecs. Scala users, for example, might want to support using Spire numerics for their CQL queries. This actually won't work for now since we constrain the codec registry used to the (immutable) default which doesn't include any support for Spire types... but that's an implementation detail. By avoiding type bounds at an API level we've left that door open for such a change without having to make a (potentially breaking) API change and without sacrificing our design principles elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting back to this conversation because the current implementation will result in a compilation error when we extend vector support to arbitrary subtype.
public Select orderByAnnOf(@NonNull CqlIdentifier columnId, @NonNull CqlVector<?> ann) {
return withAnn(new Ann(columnId, ann));
}
The argument we get is a CqlVector<?>
, but the Ann
constructor wants a CqlVector<? extends Number>
. Turns out compilation error.
We either:
- change Ann constructor to
CqlVector<?>
, or - change
orderByAnnOf
API toCqlVector<? extends Number>
.
I think option 2 makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@absurdfarce bumping this discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SiyaoIsHiding and I discussed in a meeting today. Both approaches have arguments for them but in the end we decided to go with CqlVector<?> to get this moving forward and merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there @SiyaoIsHiding! Let's sort out the type bounds on Select.orderByAnnOf()
and then I think this guy is (finally) ready to be merged!
I assume @lukasz-antoniak comment here about adding |
I'm not 💯 sure what the accessors on DefaultSelect are intended to do but I suppose it does make sense to keep it consistent and add an accessor for the Ann object. Either way I'm satisfied with where this stands now... 👍 from me! |
Raised by another user in separate communication: we should add some updates to the documentation reflecting these new features. Specifically we should update relevant areas of the query_builder section of the manual. |
@SiyaoIsHiding I'm okay with the idea behind these changes but we do need to fix a few things before this is merge-ready. Specifically:
Once these fixes are in (and we get a second +1) I think we're okay to move forward. |
Testing this with spring-projects/spring-ai#1817 I'm getting from this code:
this failure:
|
@michaelsembwever Can you see the debug log from this line? logger.debug("Executing {}", createTable.asCql()); I wonder what assertThat(SchemaBuilder.createTable("ks", "mytb")
.ifNotExists()
.withPartitionKey("k", DataTypes.INT)
.withClusteringColumn("v", DataTypes.TEXT)
.withColumn("embedding", DataTypes.vectorOf(DataTypes.FLOAT, 3)))
.hasCql("CREATE TABLE IF NOT EXISTS ks.mytb (k int,v text,embedding vector<float, 3>,PRIMARY KEY(k,v))"); And this will pass. I ran the cql statement in cqlsh, it turns out no problem, either. |
103b859
to
d0d3461
Compare
Solved merge conflict and updated doc. I just realized our CI broke. Can any one of you committers click the button to run tests for this PR please? Here: https://ci-cassandra.apache.org/blue/organizations/jenkins/cassandra-java-driver/detail/PR-1931/8/pipeline |
done: https://ci-cassandra.apache.org/job/cassandra-java-driver/job/PR-1931/9/ |
Why am I getting |
Are you sure you are using my PR version? That full class name looks like an outdated 4.x branch install before this commit: e843786 |
And it seems like our CI isn't configured with credentials. It met GitHub quota limit. |
Thank you for testing the downstream too |
is there outstanding items or more work to do here ? |
@michaelsembwever Short version is that this is basically done; we've essentially been waiting on a second +1 to move this one forward. Only wrinkle there is that @SiyaoIsHiding did change a couple things recently in relation to work she's been doing on the PR for JAVA-3143... but those changes should be relatively small. |
No outstanding item here, this PR is pending for reviews. |
DS Jenkins looks good. There was one test failure which looks to be a test issue (ccm timeout on startup) so I'm calling this good. @SiyaoIsHiding can you squash this down to a single commit so we can get this in? |
…uilder patch by Jane He; reviewed by Mick Semb Wever and Bret McGuire for JAVA-3118 reference: apache#1931
fa75d69
to
399582f
Compare
Thanks! Squashed. @absurdfarce |
* CASSANDRA-19635: Run integration tests with C* 5.x patch by Lukasz Antoniak; reviewed by Andy Tolbert, and Bret McGuire for CASSANDRA-19635 * CASSANDRA-19635: Configure Jenkins to run integration tests with C* 5.x patch by Lukasz Antoniak; reviewed by Bret McGuire for CASSANDRA-19635 * update badge URL to org.apache.cassandra/java-driver-core * Limit calls to Conversions.resolveExecutionProfile Those repeated calls account for a non-negligible portion of my application CPU (0.6%) and can definitly be a final field so that it gets resolved only once per CqlRequestHandler. patch by Benoit Tellier; reviewed by Andy Tolbert, and Bret McGuire reference: apache#1623 * autolink JIRA tickets in commit messages patch by Stefan Miklosovic; reviewed by Michael Semb Wever for CASSANDRA-19854 * Don't return empty routing key when partition key is unbound DefaultBoundStatement#getRoutingKey has logic to infer the routing key when no one has explicitly called setRoutingKey or otherwise set the routing key on the statement. It however doesn't check for cases where nothing has been bound yet on the statement. This causes more problems if the user decides to get a BoundStatementBuilder from the PreparedStatement, set some fields on it, and then copy it by constructing new BoundStatementBuilder objects with the BoundStatement as a parameter, since the empty ByteBuffer gets copied to all bound statements, resulting in all requests being targeted to the same Cassandra node in a token-aware load balancing policy. patch by Ammar Khaku; reviewed by Andy Tolbert, and Bret McGuire reference: apache#1620 * JAVA-3167: CompletableFutures.allSuccessful() may return never completed future patch by Lukasz Antoniak; reviewed by Andy Tolbert, and Bret McGuire for JAVA-3167 * ninja-fix Various test fixes * Run integration tests with DSE 6.9.0 patch by Lukasz Antoniak; reviewed by Bret McGuire reference: apache#1955 * JAVA-3117: Call CcmCustomRule#after if CcmCustomRule#before fails to allow subsequent tests to run patch by Henry Hughes; reviewed by Alexandre Dutra and Andy Tolbert for JAVA-3117 * JAVA-3149: Support request cancellation in request throttler patch by Lukasz Antoniak; reviewed by Andy Tolbert and Chris Lohfink for JAVA-3149 * Fix C* 3.0 tests failing on Jenkins patch by Lukasz Antoniak; reviewed by Bret McGuire reference: apache#1939 * Reduce lock held duration in ConcurrencyLimitingRequestThrottler It might take some (small) time for callback handling when the throttler request proceeds to submission. Before this change, the throttler proceed request will happen while holding the lock, preventing other tasks from proceeding when there is spare capacity and even preventing tasks from enqueuing until the callback completes. By tracking the expected outcome, we can perform the callback outside of the lock. This means that request registration and submission can proceed even when a long callback is being processed. patch by Jason Koch; Reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-19922 * Annotate BatchStatement, Statement, SimpleStatement methods with CheckReturnValue Since the driver's default implementation is for BatchStatement and SimpleStatement methods to be immutable, we should annotate those methods with @CheckReturnValue. Statement#setNowInSeconds implementations are immutable so annotate that too. patch by Ammar Khaku; reviewed by Andy Tolbert and Bret McGuire reference: apache#1607 * Remove "beta" support for Java17 from docs patch by Bret McGuire; reviewed by Andy Tolbert and Alexandre Dutra reference: apache#1962 * Fix uncaught exception during graceful channel shutdown after exceeding max orphan ids patch by Christian Aistleitner; reviewed by Andy Tolbert, and Bret McGuire for apache#1938 * Build a public CI for Apache Cassandra Java Driver patch by Siyao (Jane) He; reviewed by Mick Semb Wever for CASSANDRA-19832 * CASSANDRA-19932: Allow to define extensions while creating table patch by Lukasz Antoniak; reviewed by Bret McGuire and Chris Lohfink * Fix DefaultSslEngineFactory missing null check on close patch by Abe Ratnofsky; reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-20001 * Query builder support for NOT CQL syntax patch by Bret McGuire; reviewed by Bret McGuire and Andy Tolbert for CASSANDRA-19930 * Fix CustomCcmRule to drop `CURRENT` flag no matter what If super.after() throws an Exception `CURRENT` flag is never dropped which leads next tests to fail with IllegalStateException("Attempting to use a Ccm rule while another is in use. This is disallowed") Patch by Dmitry Kropachev; reviewed by Andy Tolbert and Bret McGuire for JAVA-3117 * JAVA-3051: Memory leak patch by Jane He; reviewed by Alexandre Dutra and Bret McGuire for JAVA-3051 * Automate latest Cassandra versions when running CI patch by Siyao (Jane) He; reviewed by Mick Semb Wever for CASSJAVA-25 * Refactor integration tests to support multiple C* distributions. Test with DataStax HCD 1.0.0 patch by Lukasz Antoniak; reviewed by Bret McGuire reference: apache#1958 * Fix TableMetadata.describe() when containing a vector column patch by Stefan Miklosovic; reviewed by Bret McGuire for CASSJAVA-2 * Move Apache Cassandra 5.x off of beta1 and remove some older Apache Cassandra versions. patch by Bret McGuire; reviewed by Bret McGuire for CASSJAVA-54 * Update link to Jira to be CASSJAVA Updating the link to Jira. Previously we had a component in the CASSANDRA Jira project but now we have a project for each driver - in the case of Java, it's CASSJAVA. Added CASSJAVA to .asf.yaml patch by Jeremy Hanna; reviewed by Bret McGuire for CASSJAVA-61 * Move DataStax shaded Guava module into Java driver patch by Lukasz Antoniak; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-52 * JAVA-3057 Allow decoding a UDT that has more fields than expected patch by Ammar Khaku; reviewed by Andy Tolbert and Bret McGuire reference: apache#1635 * CASSJAVA-55 Remove setting "Host" header for metadata requests. With some sysprops enabled this will actually be respected which completely borks Astra routing. patch by Bret McGuire; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-55 * JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder patch by Jane He; reviewed by Mick Semb Wever and Bret McGuire for JAVA-3118 reference: apache#1931 * Upgrade Guava to 33.3.1-jre patch by Lukasz Antoniak; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-53 * Do not always cleanup Guava shaded module before packaging * Revert "Do not always cleanup Guava shaded module before packaging" This reverts commit 5be52ec. * Conditionally compile shaded Guava module * JAVA-3143: Extend driver vector support to arbitrary subtypes and fix handling of variable length types (OSS C* 5.0) patch by Jane He; reviewed by Bret McGuire and João Reis reference: apache#1952 * JAVA-3168 Copy node info for contact points on initial node refresh only from first match by endpoint patch by Alex Sasnouskikh; reviewed by Andy Tolbert and Alexandre Dura for JAVA-3168 * JAVA-3055: Prevent PreparedStatement cache to be polluted if a request is cancelled. There was a critical issue when the external code cancels a request, indeed the cached CompletableFuture will then always throw a CancellationException. This may happens, for example, when used by reactive like Mono.zip or Mono.firstWithValue. patch by Luc Boutier; reviewed by Alexandre Dutra and Bret McGuire reference: apache#1757 * Expose a decorator for CqlPrepareAsyncProcessor cache rather than the ability to specify an arbitrary cache from scratch. Also bringing tests from apache#2003 forward with a few minor changes due to this implementation patch by Bret McGuire; reviewed by Bret McGuire and Andy Tolbert reference: apache#2008 * ninja-fix Using shaded Guava classes for import in order to make OSGi class paths happy. Major hat tip to Dmitry Konstantinov for the find here! * Changelog updates for 4.19.0 * [maven-release-plugin] prepare release 4.19.0 * [maven-release-plugin] prepare for next development iteration * Specify maven-clean-plugin version Sets the version to 3.4.1 in parent pom. Having it unspecified causes the following warning: ``` [WARNING] Some problems were encountered while building the effective model for com.scylladb:java-driver-guava-shaded:jar:4.19.0.0-SNAPSHOT [WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-clean-plugin is missing. @ line 97, column 15 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] ``` * Install guava-shaded before running core's compile in CI Without this module available "Build" and "Unit tests" job fail with "package <x> does not exist" or "cannot find symbol" pointing to `[...].shaded.guava.[...]` packages. * Remove exception catch in `prepared_stmt_metadata_update_loopholes_test` Since incorporating "JAVA-3057 Allow decoding a UDT that has more fields than expected" the asuumptions of removed check are no longer valid. * Switch shaded guava's groupId in osgi-tests Switches from `org.apache.cassandra` to `com.scylladb` in `BundleOptions#commonBundles()`. * Typo: increase line's loglevel in CcmBridge --------- Co-authored-by: Lukasz Antoniak <[email protected]> Co-authored-by: Brad Schoening <[email protected]> Co-authored-by: Benoit Tellier <[email protected]> Co-authored-by: Stefan Miklosovic <[email protected]> Co-authored-by: Ammar Khaku <[email protected]> Co-authored-by: absurdfarce <[email protected]> Co-authored-by: Henry Hughes <[email protected]> Co-authored-by: Jason Koch <[email protected]> Co-authored-by: Christian Aistleitner <[email protected]> Co-authored-by: janehe <[email protected]> Co-authored-by: Abe Ratnofsky <[email protected]> Co-authored-by: absurdfarce <[email protected]> Co-authored-by: Dmitry Kropachev <[email protected]> Co-authored-by: janehe <[email protected]> Co-authored-by: Jeremy Hanna <[email protected]> Co-authored-by: SiyaoIsHiding <[email protected]> Co-authored-by: Alex Sasnouskikh <[email protected]> Co-authored-by: Luc Boutier <[email protected]>
Currently, the
SchemaBuilder
works with vector like this:Or
Please let me know if you want something like
.withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3))
.