Skip to content

Conversation

@danielaskdd
Copy link
Collaborator

Add Entity Identifier Length Truncation to Prevent Storage Failures

Problem

LLM extraction occasionally generates extremely long entity names that can cause failures in downstream storage systems:

  • Database limitations: PostgreSQL and other databases have index key length limits (typically 2704 bytes)
  • Vector database performance: Excessively long strings impact embedding storage and retrieval
  • Token efficiency: Long entity identifiers waste valuable context window space
  • Graph operations: Long keys degrade lookup and comparison performance

Without truncation, a single malformed LLM extraction could crash the entire ingestion pipeline or create unqueryable knowledge graphs.

Solution

This PR implements defensive truncation of entity identifiers at the extraction boundary:

Changes

lightrag/constants.py

  • Added DEFAULT_ENTITY_NAME_MAX_LENGTH = 256 constant to define the maximum allowed length

lightrag/operate.py

  • Introduced _truncate_entity_identifier() helper function that:
    • Returns identifiers unchanged if within limit
    • Truncates to 256 characters if exceeded
    • Logs warnings with full context (chunk_key, role, expected/actual lengths)
  • Modified _process_extraction_result() to apply truncation to:
    • Entity names (entity_name)
    • Relationship source entities (src_id)
    • Relationship target entities (tgt_id)

Key Design Decisions

Why 256 characters?

  • Comfortably below most database key limits
  • Sufficient for legitimate entity names across all languages
  • Conservative enough to prevent issues while rarely impacting real data

Why truncate at extraction time?

  • Ensures consistency throughout the entire pipeline
  • Prevents cascading failures in graph/vector operations
  • Centralizes the defensive logic at a single boundary

Why simple prefix truncation?

  • Preserves the most semantically important part (beginning) of the identifier
  • Minimal computational overhead
  • Predictable and debuggable behavior

Example Warning Log

chunk-abc123: Entity name exceeded 256 characters ( (len: %d, preview: 'abcde...')

Backward Compatibility

Fully backward compatible

  • Existing entities are unaffected
  • No database migrations required
  • No API changes

Potential Considerations

  1. Collision risk: If two entities differ only after character 256, they will merge (extremely rare in practice)
  2. Information loss: Full entity names beyond 256 characters are not preserved
  3. Not configurable: The 256-character limit is currently hardcoded

These edge cases are acceptable trade-offs for production stability. Future enhancements could make the limit configurable or add hash-based collision detection if needed.

@danielaskdd danielaskdd merged commit cf2174b into HKUDS:main Oct 22, 2025
1 check passed
@danielaskdd danielaskdd deleted the entity-name-len branch October 22, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant