Skip to content

Conversation

@dnnspaul
Copy link

Hey @philippgille, thanks for this great package!

I've had a hard time finding out that there is no possibility to kinda merge 2 existing collections. I appreciate your focus on staying as a simple package (as I've read in other Issues and Pull requests), so I avoided to extend the Import.. functions with enableMerge attributes - but instead implemented the simplest approach I could come up with: getting all existing documents of a collection. This way the end-user (or -developer?) is at least able to fetch all documents and import them into another collection on their own way.

I'm interested in your feedback and leave behind some happy greetings from Hamburg

Copy link
Owner

@philippgille philippgille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dennis 👋, Thanks for contributing!

I think the method is useful and makes sense 👍.

But there's a DB.ListCollections(), so for consistency I'd prefer to name the new method Collection.ListDocuments().

And can you please move it between the Collection.GetByID() and Collection.Delete()?

Thanks!

@dnnspaul
Copy link
Author

Hi @philippgille, your wished changes make absolute sense and are applied now. ✌️

Copy link
Owner

@philippgille philippgille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, sorry for the long delay!

First of all, thanks for implementing the requested changes! 🙇‍♂️

I had another more thorough look and found some more things that can be improved.

Due to my delayed review I'd understand if improving the PR doesn't fit your schedule anymore, so just let me know if you prefer me to make those changes on my own.

Comment on lines +321 to +322
// The returned documents are a copy of the original documents, so they can be safely
// modified without affecting the collection.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice is new and can be modified without affecting the internal c.documents, but the documents themselves are not full copies. When you do x := y, then y's simple fields (int, string) are entirely new, but maps and slices are only_shallow_ copies.

Demonstration: https://go.dev/play/p/0OccI4ibtS2

See the above GetByID where the Metadata and Embedding fields are cloned separately to create an entirely new document.

So here we have two options:

  • Change the Godoc to clarify that the documents are shallow copies, and only the slice is new. This still allows the receiver to work with the slice, like iterating over it and reading the documents, without concurrency issues during regular operations. For example chromem-go can still add new documents to its c.Documents map, or delete them, and it doesn't affect the returned slice. Here's an example in chromem-go where something similar is done:

    chromem-go/db.go

    Lines 517 to 522 in 8311eb0

    // The returned map is a copy of the internal map, so it's safe to directly modify
    // the map itself. Direct modifications of the map won't reflect on the DB's map.
    // To do that use the DB's methods like [DB.CreateCollection] and [DB.DeleteCollection].
    // The map is not an entirely deep clone, so the collections themselves are still
    // the original ones. Any methods on the collections like Add() for adding documents
    // will be reflected on the DB's collections and are concurrency-safe.
  • Or create a deep copy of documents. This can either be done by calling the GetByID for each document, or by copying the code from that method. The former leads to less code, but one extra operation per document (the c.Documents lookup).

ids := []string{"1", "2", "3", "4"}
metadatas := []map[string]string{{"foo": "bar"}, {"a": "b"}, {"foo": "bar"}, {"e": "f"}}
contents := []string{"hello world", "hallo welt", "bonjour le monde", "hola mundo"}
c.Add(context.Background(), ids, nil, metadatas, contents)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the returned error should be checked

Comment on lines +507 to +511
for _, doc := range docs {
if doc.Content == "hello world" {
break
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the test doesn't assert whether the content was found or not. You can introduce a new variable found := false before the loop, set it found = true just before the break, and after the loop assert that its value is true.

@philippgille
Copy link
Owner

There was a later PR from another contributor, which I think supersedes this: #118

Can you check if that enables you to do the merge of collections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants