Replies: 11 comments
-
Hi Rod, which questions should we ask ? By dependencies do you just mean Maven dependencies or also Infrastructure dependencies like separate Database deployment or Docker for laptops demos? Which are the other pain points to consider if any? Is Neo4j overkill compared to what it is currently used for in the project? Here are a few that come to mind:
Q0: Q1: "What does 'complicates dependencies' specifically mean?"
Q2: "Do we need to migrate existing Neo4j data?"
Q3: "What level of 'own' are you envisioning?"
Q4: "Is vector similarity search still required?"
Q5: "What chunking strategies do you have in mind?"
Q6: "Should chunking strategies be runtime-pluggable?"
Q7: "What's the expected data scale?"
Q8: "Single-node sufficient or need distribution?"
Q9: "How important is Spring AI VectorStore compatibility?"
Q10: "Text similarity still needed?"
Q11: Do we correctly assume that we want:
But we need to confirm:
Q12: Would you prefer a minimal 500-line brute-force solution that works today, or a 'proper' embedded solution with Lucene/SQLite that scales better? Is a pain point - not the Neo4j technology itself, but the operational overhead of external DB and/or Docker on laptops? |
Beta Was this translation helpful? Give feedback.
-
Hi Rod, here is a slightly different angle, in this explorations of the problem space before diving into too detailed solution space: Current State Understanding We see you're using:
Key Questions
Our instinct based on your Spring philosophy: Start with Option A (works today, zero deps), then evolve to B if/when needed. But want to confirm. |
Beta Was this translation helpful? Give feedback.
-
Proposal of potential Analysis of Neo4J capabilities vs what is used in the project (helps me understand the requirements). Please let me know how can i improve my understanding in this. What Neo4j is actually used for at this time: Vector Search (2 simple queries): chunk_vector_search.cypher: Basic vector similarity with threshold Simple Entity Operations (3 queries): create_entity.cypher: Create entity with single relationship [:HAS_ENTITY] Graph Relationships: Only ONE relationship type: (chunk)-[:HAS_ENTITY]->(entity) Neo4j Features we are not Using: ❌ Complex multi-hop graph traversals What we are Using: ✅ Vector similarity search (key feature) |
Beta Was this translation helpful? Give feedback.
-
Neo4j uses cosine similarity for vectors. Since OpenAI/Anthropic embeddings are normalized, we can use dot product with proper vectorization for a speed multiplier, with identical ranking results. Do we see any reason to preserve exact cosine scores vs just ranking order? Q: If our application truly only needs ranking (top-K results) and doesn't use similarity scores for thresholds, filtering, or display, then dot product ranking is the clear winner. But if we have any score-dependent logic downstream, the interpretation benefits of cosine similarity might outweigh the performance gains. What's our primary use case - pure ranking or do you need the actual similarity values? Ok seems that we Use Similarity Scores for:
We see similarity scores are used for threshold filtering (WHERE score > 0.8), confidence cutoffs, and optimization loops. While dot product is faster, it produces different absolute values than cosine (though same ranking). Options:
Given the extensive threshold usage, do we lean toward Option 1 (keep cosine) for compatibility? |
Beta Was this translation helpful? Give feedback.
-
Hi Rod, about alternative vector database style solutions we might use instead of Neo4j, some random ideas we have: We've analyzed the current Neo4j usage and we might understand the core issue: Neo4j requires Docker/external server, which complicates local development and testing. We probably need vector similarity search with threshold filtering but in a truly embedded(able) solution. Current Requirements We See:
Embedded Vector Database Options: Option A: Pure Kotlin/Java Solution ⭐ Phase 1: In-Memory + File Persistence (Week 1-2) Phase 2: Memory-Mapped Files (If/when needed)
Phase 3: Distributed (Future)
Option B: H2 Database + Custom Functions Phase 1: H2 Embedded + Java Functions
Phase 2: Optimize with Indexes
Phase 3: Migrate to Distributed SQL
Option C: Apache Lucene Phase 1: Lucene Core + Custom Codec
Phase 2: Lucene Vector Module (When released)
Phase 3: Elasticsearch/OpenSearch
Option D: DuckDB Embedded Phase 1: DuckDB + Vector Extension
Phase 2-3: DuckDB Clustering
Questions:
We can have a working prototype of Option A in 2-3 days that replaces the current Neo4j functionality. Shall we proceed? Key benefit: Any option above means developers can just run mvn test without Docker - achieving your "laptop-friendly" goal. |
Beta Was this translation helpful? Give feedback.
-
Hi Rod, in order to gather some early feedback from you, Here's what we'd build for the pure Kotlin embedded vector store to replace Neo4j:
What We're reasonably Confident About ✅
What We're Reasonably Sure About 🤔
What We're Uncertain About ❓
Migration Path // Week 1: Feature parity
// Week 2: Performance optimization
// Future: If needed
Performance Guarantees
Risks & Mitigations Risk: Brute force too slow at scale Risk: File corruption Risk: Memory pressure What We Need From You
Bottom line: We can have Neo4j completely replaced with 700 lines of Kotlin by end of week. No Docker, no external dependencies, just mvn test and it works. Shall we proceed? |
Beta Was this translation helpful? Give feedback.
-
The 2 usages of Neo4j:
Architecture Flow:
✅ Ready to Proceed - Implementation Plan: Phase 1: Core Replacement
Phase 2: Integration
Key Decision Points I'll Document:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Should we use two different storage strategies for chunks vs for vectors? Should we use a true vector database for the latter, if it isn't carrying any dependencies? |
Beta Was this translation helpful? Give feedback.
-
The idea of an embedded vector store is interesting. What I envisioned at this point was removing the Spring AI Neo vector store dependency and writing our own Cypher support for Neo querying. So, keeping Neo but through simpler, more direct means. Ultimately the |
Beta Was this translation helpful? Give feedback.
-
This is now removed. I'm going to convert this to a discussion to capture the ideas about other vector management. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This complicates dependencies. We should instead fully manage our own indexing and querying, possibly with multiple chunking strategies.
Beta Was this translation helpful? Give feedback.
All reactions