Skip to content

Conversation

@danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Oct 25, 2025

Fix: Resolved chunk storage inconsistency caused by implicit node in document indexing and deletion

Problem

This PR addresses two critical data consistency bugs related to storage synchronization:

  1. Implicit Node Creation Inconsistency:
    • Problem: When processing relationships in _merge_edges_then_upsert and _rebuild_single_relationship, if source or target nodes did not exist, "implicit nodes" were created. These nodes were correctly added to the main knowledge_graph_inst but were not propagated to the entity_vdb (vector DB) or the entity_chunks_storage (chunk tracking).
    • Impact: This led to data inconsistency across the three storage layers. The entity_chunks_storage could have fewer nodes than the entity_vdb, and tracking information for these implicit nodes was missing, leading to potential query failures or incomplete results.
  2. Incomplete Data Cleanup on Document Deletion:
    • Problem: When deleting documents via adelete_by_doc_id(), the system properly removed entities and relationships from the main graph storage and vector databases. However, it failed to clean up the corresponding entries in the chunk tracking storages (entity_chunks and relation_chunks).
    • Impact: This resulted in orphaned chunk tracking data accumulating over time, creating an inconsistent state between the graph/vector layers and the chunk tracking layer. This could also lead to incorrect chunk reference counts and potential memory leaks in long-running systems.

Here is the "Solution" section for your PR, written in English.

Solution

This PR implements the following changes to ensure data consistency across all storage layers:

1. Synchronized Implicit Node Creation:

  • Updated _merge_edges_then_upsert to accept the entity_chunks_storage parameter.
  • Updated _rebuild_single_relationship to accept both entities_vdb and entity_chunks_storage parameters.
  • Core Fix: When implicit or missing nodes are created in these functions, they are now written to all three storage layers simultaneously:
    • Knowledge graph storage (existing behavior)
    • Entity chunks storage (new behavior)
    • Entity vector database (new behavior, synchronized with chunk storage)
  • Updated the call sites (rebuild_knowledge_from_chunks and merge_nodes_and_edges) to pass the new storage parameters down.
  • Added node identifier sorting in both functions for consistent processing order.

2. Implemented Proper Deletion Cleanup:

  • Added logic to the deletion pipeline to properly clean up the chunk tracking storage.
  • Entity Cleanup: When entities are removed, their corresponding entries in entity_chunks are now also deleted.
  • Relationship Cleanup: When relationships are removed, their corresponding entries in relation_chunks are now also deleted using properly formatted storage keys.

Backward Compatibility

✅ Fully backward compatible - new parameters are optional with None defaults

…xity

• Replace Cypher with native SQL queries
• Fix O(N²) to O(E) performance issue
• Add error handling for parse failures
• Use direct table access pattern
• Eliminate Cartesian product joins
• Sort src/tgt for consistent ordering
• Create missing nodes before edges
• Update entity chunks storage
• Pass entity_vdb to rebuild function
• Ensure entities exist in all storages
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

• Delete from entity_chunks storage
• Delete from relation_chunks storage
@danielaskdd
Copy link
Collaborator Author

@codex review again

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 11f1f36 into HKUDS:main Oct 25, 2025
1 check passed
@danielaskdd danielaskdd deleted the sort-edge branch October 25, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant