-
-
Notifications
You must be signed in to change notification settings - Fork 596
Add support for VECTOR column type #9758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…upport VECTOR types.
5c996ba to
54ef702
Compare
54ef702 to
2638558
Compare
|
@nicktobey DOLT
|
|
@coffeegoddd DOLT
|
macneale4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small changes requested. There is plenty here I am unfamiliar with, but the existing tests pass which are a big plus. If you can provide anything which educates me about index builds, that would be great.
| if lengthStr, ok := params[vectorTypeParam_Length]; ok { | ||
| length, err = strconv.ParseInt(lengthStr, 10, 64) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will provide a pretty cryptic error, I assume. This is really a runtime failure that should never happen. I've been adding context with:
fmt.Errorf("runtime error: vector length unparsable: %w", err)
This way we can search for string and find the code that generated the error quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| var length int64 | ||
| var err error | ||
| if lengthStr, ok := params[vectorTypeParam_Length]; ok { | ||
| length, err = strconv.ParseInt(lengthStr, 10, 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support 64 bits of resolution here? Isn't 32 bits way beyond reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole function was unused. I removed it.
| case types.NullKind: | ||
| _ = reader.ReadKind() | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about this area of the code, but shouldn't we be doing something like:
case types.NullKind:
_ = reader.SkipValue(f)
return types.NullValue, nil
}f is the NomsBinFormat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even know if this is reachable. It looks like it's only used for converting Noms values, which we probably never do anymore? I just copied in the implementation used for binary types, but I wonder if we can just delete all this instead.
| level = 255 - level // we currently store the level as 255 - the actual level for sorting purposes. | ||
| depth := int(maxLevel - level) | ||
|
|
||
| // hashPath is a list of concatenated hashes, representing the sequence of closest vectors at each level of the tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is for a var being removed.
|
@nicktobey DOLT
|
|
@coffeegoddd DOLT
|
macneale4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
|
@nicktobey DOLT
|
This PR implements the changes required for Dolt to support VECTOR column types (implemented in GMS in dolthub/go-mysql-server#3162)
Actually managing storage read/writes was the easy part and pretty much worked out of the box.
The most extensive changes are to the algorithm for creating vector indexes, which previously assumed that every vector was represented as a 20-byte Hash and represented paths through the tree as multiple 20-byte hashes concatenated to each other. Since VECTOR columns are stored using adaptive encoding, they may be a hash but may also be a variable-length byte buffer. Thus, the index builder needed to be smarter and represent these paths as a proper tuple. Doing it this way also allowed me to clean up the index building code in a way that is hopefully both more readable and eliminates some unnecessary memory copies. I added some clarifying comments, but it could potentially benefit from even more comments.
The other big change was to the test suites. The vector index tests had some hardcoded assumptions about the representation of vectors that needed to be fixed, so I used this as an opportunity to clean that up to.