Refact: Add Entity Identifier Length Truncation to Prevent Storage Failures #2245

danielaskdd · 2025-10-22T06:05:11Z

Add Entity Identifier Length Truncation to Prevent Storage Failures

Problem

LLM extraction occasionally generates extremely long entity names that can cause failures in downstream storage systems:

Database limitations: PostgreSQL and other databases have index key length limits (typically 2704 bytes)
Vector database performance: Excessively long strings impact embedding storage and retrieval
Token efficiency: Long entity identifiers waste valuable context window space
Graph operations: Long keys degrade lookup and comparison performance

Without truncation, a single malformed LLM extraction could crash the entire ingestion pipeline or create unqueryable knowledge graphs.

Solution

This PR implements defensive truncation of entity identifiers at the extraction boundary:

Changes

lightrag/constants.py

Added DEFAULT_ENTITY_NAME_MAX_LENGTH = 256 constant to define the maximum allowed length

lightrag/operate.py

Introduced _truncate_entity_identifier() helper function that:
- Returns identifiers unchanged if within limit
- Truncates to 256 characters if exceeded
- Logs warnings with full context (chunk_key, role, expected/actual lengths)
Modified _process_extraction_result() to apply truncation to:
- Entity names (entity_name)
- Relationship source entities (src_id)
- Relationship target entities (tgt_id)

Key Design Decisions

Why 256 characters?

Comfortably below most database key limits
Sufficient for legitimate entity names across all languages
Conservative enough to prevent issues while rarely impacting real data

Why truncate at extraction time?

Ensures consistency throughout the entire pipeline
Prevents cascading failures in graph/vector operations
Centralizes the defensive logic at a single boundary

Why simple prefix truncation?

Preserves the most semantically important part (beginning) of the identifier
Minimal computational overhead
Predictable and debuggable behavior

Example Warning Log

chunk-abc123: Entity name exceeded 256 characters ( (len: %d, preview: 'abcde...')

Backward Compatibility

✅ Fully backward compatible

Existing entities are unaffected
No database migrations required
No API changes

Potential Considerations

Collision risk: If two entities differ only after character 256, they will merge (extremely rare in practice)
Information loss: Full entity names beyond 256 characters are not preserved
Not configurable: The 256-character limit is currently hardcoded

These edge cases are acceptable trade-offs for production stability. Future enhancements could make the limit configurable or add hash-based collision detection if needed.

danielaskdd added 2 commits October 22, 2025 14:02

Add entity name length truncation with configurable limit

904b1f4

Fix linting

c92ab83

danielaskdd merged commit cf2174b into HKUDS:main Oct 22, 2025
1 check passed

danielaskdd deleted the entity-name-len branch October 22, 2025 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refact: Add Entity Identifier Length Truncation to Prevent Storage Failures #2245

Refact: Add Entity Identifier Length Truncation to Prevent Storage Failures #2245

Uh oh!

danielaskdd commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refact: Add Entity Identifier Length Truncation to Prevent Storage Failures #2245

Refact: Add Entity Identifier Length Truncation to Prevent Storage Failures #2245

Uh oh!

Conversation

danielaskdd commented Oct 22, 2025

Add Entity Identifier Length Truncation to Prevent Storage Failures

Problem

Solution

Changes

Key Design Decisions

Example Warning Log

Backward Compatibility

Potential Considerations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant