Skip to content

Conversation

@danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Oct 11, 2025

Preserve ordering in get_by_ids methods across all storage implementations

🎯 Problem

The get_by_ids function in certain storage implementations returns results in an order that does not match the input IDs list, causing a misalignment between retrieved text blocks and their corresponding IDs. This issue affects the correctness of data returned by the aquery_data function and the /aquery_data API endpoint.

📝 Changes

Modified get_by_ids implementations in 8 storage backends to preserve input order and handle missing IDs consistently:

Modified Files:

  • lightrag/kg/deprecated/chroma_impl.py
  • lightrag/kg/json_doc_status_impl.py
  • lightrag/kg/milvus_impl.py
  • lightrag/kg/mongo_impl.py
  • lightrag/kg/nano_vector_db_impl.py
  • lightrag/kg/postgres_impl.py
  • lightrag/kg/qdrant_impl.py
  • lightrag/kg/redis_impl.py
  • lightrag/kg/faiss_impl.py

Implementation Pattern:

All implementations now follow a consistent 3-step pattern:

# 1. Fetch data from storage
results = await storage.find({"_id": {"$in": ids}})

# 2. Build lookup map
result_map: dict[str, dict[str, Any]] = {}
for result in results:
    result_map[str(result["_id"])] = result

# 3. Preserve input order with None for missing IDs
ordered_results: list[dict[str, Any] | None] = []
for id_value in ids:
    ordered_results.append(result_map.get(str(id_value)))

return ordered_results

⚠️ Breaking Changes

API Contract Change

Before:

get_by_ids([1, 2, 3]) → [{id:1}, {id:3}]  # Missing ID omitted
len(result) may be < len(ids)

After:

get_by_ids([1, 2, 3]) → [{id:1}, None, {id:3}]  # None for missing IDs
len(result) == len(ids) always

Impact on Consumers

✅ Compatible:

  • Code using index-based iteration: for i, result in enumerate(results)
  • Code checking individual results: if results[i]: process(results[i])

❌ Requires Updates:

  • Code assuming all results are non-None: for r in results: r['field']
  • Code assuming len(results) == len(found_items)

Existing Code Compatibility

All 4 existing call sites in lightrag/operate.py already have proper None checks:

  1. _get_cached_extraction_results (line 1304): if chunk_data and isinstance(chunk_data, dict)
  2. _get_cached_extraction_results (line 1317): if cache_entry is not None
  3. _find_related_text_unit_from_entities (line 3959): if chunk_data is not None and "content" in chunk_data
  4. _find_related_text_unit_from_relations (line 4173): if chunk_data is not None and "content" in chunk_data

Conclusion: This change is backward compatible with existing codebase.

🎯 Benefits

  1. Predictable Order: Results match input ids order exactly
  2. 1:1 Correspondence: Easy to map results back to requests
  3. Consistent Behavior: All storage backends behave identically
  4. Missing ID Handling: Explicit None values for missing IDs instead of silent omission
  5. Type Safety: Clear | None union type for better IDE support

@danielaskdd danielaskdd merged commit 8239783 into HKUDS:main Oct 11, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant