Skip to content

Conversation

@nicktobey
Copy link
Contributor

@nicktobey nicktobey commented Sep 2, 2025

This PR implements the changes required for Dolt to support VECTOR column types (implemented in GMS in dolthub/go-mysql-server#3162)

Actually managing storage read/writes was the easy part and pretty much worked out of the box.

The most extensive changes are to the algorithm for creating vector indexes, which previously assumed that every vector was represented as a 20-byte Hash and represented paths through the tree as multiple 20-byte hashes concatenated to each other. Since VECTOR columns are stored using adaptive encoding, they may be a hash but may also be a variable-length byte buffer. Thus, the index builder needed to be smarter and represent these paths as a proper tuple. Doing it this way also allowed me to clean up the index building code in a way that is hopefully both more readable and eliminates some unnecessary memory copies. I added some clarifying comments, but it could potentially benefit from even more comments.

The other big change was to the test suites. The vector index tests had some hardcoded assumptions about the representation of vectors that needed to be fixed, so I used this as an opportunity to clean that up to.

@nicktobey nicktobey force-pushed the nicktobey/vector3 branch 3 times, most recently from 5c996ba to 54ef702 Compare September 2, 2025 10:00
@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
2638558 ok 5937471
version total_tests
2638558 5937471
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
0a84064 ok 5937471
version total_tests
0a84064 5937471
correctness_percentage
100.0

@macneale4 macneale4 self-requested a review September 2, 2025 19:56
Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small changes requested. There is plenty here I am unfamiliar with, but the existing tests pass which are a big plus. If you can provide anything which educates me about index builds, that would be great.

if lengthStr, ok := params[vectorTypeParam_Length]; ok {
length, err = strconv.ParseInt(lengthStr, 10, 64)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will provide a pretty cryptic error, I assume. This is really a runtime failure that should never happen. I've been adding context with:

fmt.Errorf("runtime error: vector length unparsable: %w", err)

This way we can search for string and find the code that generated the error quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

var length int64
var err error
if lengthStr, ok := params[vectorTypeParam_Length]; ok {
length, err = strconv.ParseInt(lengthStr, 10, 64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support 64 bits of resolution here? Isn't 32 bits way beyond reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole function was unused. I removed it.

Comment on lines +86 to +89
case types.NullKind:
_ = reader.ReadKind()
return nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about this area of the code, but shouldn't we be doing something like:

	case types.NullKind:
		_ = reader.SkipValue(f)
		return types.NullValue, nil
	}

f is the NomsBinFormat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know if this is reachable. It looks like it's only used for converting Noms values, which we probably never do anymore? I just copied in the implementation used for binary types, but I wonder if we can just delete all this instead.

level = 255 - level // we currently store the level as 255 - the actual level for sorting purposes.
depth := int(maxLevel - level)

// hashPath is a list of concatenated hashes, representing the sequence of closest vectors at each level of the tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is for a var being removed.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
6f09282 ok 5937471
version total_tests
6f09282 5937471
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
1aa6047 ok 5937471
version total_tests
1aa6047 5937471
correctness_percentage
100.0

Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
35d8800 ok 5937471
version total_tests
35d8800 5937471
correctness_percentage
100.0

@nicktobey nicktobey merged commit 5c2a2d9 into main Sep 3, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants