Remove Spring `Neo4jVectorStore` usage from `embabel-agent-rag` module #721

johnsonr · 2025-08-12T22:35:43Z

johnsonr
Aug 12, 2025
Maintainer

This complicates dependencies. We should instead fully manage our own indexing and querying, possibly with multiple chunking strategies.

nmarasoiu · 2025-08-16T09:59:56Z

nmarasoiu
Aug 16, 2025

Hi Rod, which questions should we ask ?

By dependencies do you just mean Maven dependencies or also Infrastructure dependencies like separate Database deployment or Docker for laptops demos?

Which are the other pain points to consider if any?

Is Neo4j overkill compared to what it is currently used for in the project?

Here are a few that come to mind:

Scope & Compatibility Questions

Q0:
a. Do we want to go by an incremental approach, with an easy in memory solution, followed by increments to an end state?
b. Do we want (for the end state), High Availability (sync replication)? Sharding? Disk persistence?
c. What are the problem definitions and quality attributes (non functional requirements) for the intermediate steps?

Q1: "What does 'complicates dependencies' specifically mean?"

Is it Neo4j's size (~100MB)?
Docker requirement for local dev?
Transitive dependency conflicts?
Or philosophical objection to external databases?

Q2: "Do we need to migrate existing Neo4j data?"

Are there production deployments using Neo4j?
Can we break compatibility or need migration path?

"Fully Manage Our Own Indexing" Questions

Q3: "What level of 'own' are you envisioning?"

Own = no external deps at all (pure Java/Kotlin)?
Own = embedded libraries OK (Lucene, SQLite)?
Own = just not Neo4j which requires separate infra?

Q4: "Is vector similarity search still required?"

Current code uses db.index.vector.queryNodes()
Do we need approximate nearest neighbor at scale?
Or is brute force OK for expected data volumes?

"Multiple Chunking Strategies" Questions

Q5: "What chunking strategies do you have in mind?"

Sentence-based vs paragraph-based?
Domain-specific (code vs prose)?
Overlapping windows?
Semantic boundaries?

Q6: "Should chunking strategies be runtime-pluggable?"

Different strategies per document type?
User-configurable?
Or compile-time selection?

Performance & Scale Questions

Q7: "What's the expected data scale?"

1K, 10K, 100K, 1M+ chunks?
Query frequency?
Latency requirements?

Q8: "Single-node sufficient or need distribution?"

Is this for local development only?
Production deployment scenarios?
Multi-tenant considerations?

Integration Questions

Q9: "How important is Spring AI VectorStore compatibility?"

Keep implementing the interface?
Or completely custom API?
MCP exposure requirements?

Q10: "Text similarity still needed?"

Neo4j uses APOC for Levenshtein distance
Essential feature or nice-to-have?

Q11: Do we correctly assume that we want:

No Docker dependencies (laptop-friendly)
Multiple indexing strategies not tied to Neo4j's model
Control over chunking algorithms

But we need to confirm:

Build vs Buy tolerance: Is Lucene OK or truly roll-your-own?
Vector search necessity: Maybe semantic search isn't even needed?
Migration requirements: Clean slate or compatibility needed?

Q12: Would you prefer a minimal 500-line brute-force solution that works today, or a 'proper' embedded solution with Lucene/SQLite that scales better?

Is a pain point - not the Neo4j technology itself, but the operational overhead of external DB and/or Docker on laptops?

0 replies

nmarasoiu · 2025-08-16T10:18:37Z

nmarasoiu
Aug 16, 2025

Hi Rod, here is a slightly different angle, in this explorations of the problem space before diving into too detailed solution space:

Current State Understanding

We see you're using:

Local Neo4j deployment (bolt://localhost:7687)
Neo4j Enterprise with APOC for text similarity
Testcontainers for testing
Profile-activated configuration (-neo profile)

Key Questions

Dependency Concerns
When you say Neo4j "complicates dependencies," is the main issue:

The need for Docker/external Neo4j server for local development?
The 100MB+ of transitive dependencies?
Or maintaining the -neo profile complexity?

"Own Indexing" ScopeBy "fully manage our own indexing," do you mean:

Pure Java/Kotlin with zero external dependencies?
Embedded libraries acceptable (e.g., Lucene for search, SQLite for storage)?
Just avoiding external servers/Docker requirements?

Feature Requirements

Vector search: Still needed? Current scale is ~1K-10K chunks?
Text similarity: The APOC Levenshtein distance - essential or can we drop it?
Spring AI VectorStore: Should we maintain interface compatibility?

Chunking Strategies
What specific strategies did you have in mind?

Document type-specific (code vs prose)?
Configurable chunk sizes/overlaps?
Semantic vs structural boundaries?

Production Considerations

Any existing production deployments we need to migrate?
Expected scale: Local dev only or production workloads?
Performance: Is 100ms vector search acceptable for 10K chunks?

Philosophy Check
Would you prefer:

Option A: Minimal 500-line brute-force solution (pure Kotlin, JSON files)
Option B: Embedded Lucene (better scale, still zero external servers)
Option C: SQLite + custom vector index (single embedded dependency)

Our instinct based on your Spring philosophy: Start with Option A (works today, zero deps), then evolve to B if/when needed. But want to confirm.

0 replies

nmarasoiu · 2025-08-16T10:33:29Z

nmarasoiu
Aug 16, 2025

Proposal of potential Analysis of Neo4J capabilities vs what is used in the project (helps me understand the requirements).

Please let me know how can i improve my understanding in this.

What Neo4j is actually used for at this time:

Vector Search (2 simple queries):

chunk_vector_search.cypher: Basic vector similarity with threshold
entity_vector_search.cypher: Same for entities
Uses db.index.vector.queryNodes() - Neo4j's vector index

Simple Entity Operations (3 queries):

create_entity.cypher: Create entity with single relationship [:HAS_ENTITY]
find_all.cypher: Basic node lookup by label
find_entity.cypher: String similarity using apoc.text.distance()

Graph Relationships:

Only ONE relationship type: (chunk)-[:HAS_ENTITY]->(entity)
No complex traversals, no multi-hop queries
No sophisticated graph algorithms

Neo4j Features we are not Using:

❌ Complex multi-hop graph traversals
❌ Graph algorithms (PageRank, community detection, etc.)
❌ Multiple relationship types
❌ Graph analytics
❌ Temporal queries
❌ Full-text search (beyond APOC string distance)

What we are Using:

✅ Vector similarity search (key feature)
✅ Simple key-value storage with labels
✅ Basic text similarity (Levenshtein distance)
✅ Single relationship type

0 replies

nmarasoiu · 2025-08-16T10:36:29Z

nmarasoiu
Aug 16, 2025

Neo4j uses cosine similarity for vectors. Since OpenAI/Anthropic embeddings are normalized, we can use dot product with proper vectorization for a speed multiplier, with identical ranking results. Do we see any reason to preserve exact cosine scores vs just ranking order?

Q: If our application truly only needs ranking (top-K results) and doesn't use similarity scores for thresholds, filtering, or display, then dot product ranking is the clear winner. But if we have any score-dependent logic downstream, the interpretation benefits of cosine similarity might outweigh the performance gains.

What's our primary use case - pure ranking or do you need the actual similarity values?

Ok seems that we Use Similarity Scores for:

Threshold Filtering (Critical!)
- WHERE score > $similarityThreshold in Cypher
- Default threshold: 0.8 for RAG
- Configurable per request
Confidence Cutoffs
- Agent selection: score > properties.agentConfidenceCutOff
- Goal selection: score > properties.goalConfidenceCutOff
- Repository queries: score >= confidenceCutOff
Best-Of Selection
- if (feedback.score > bestSoFar.feedback.score)
- Optimization loops comparing scores
Display/Debugging
- if (it.score > .6) "✅" else "❌"
- Score-based visual feedback

We see similarity scores are used for threshold filtering (WHERE score > 0.8), confidence cutoffs, and optimization loops. While dot product is faster, it produces different absolute values than cosine (though same ranking).

Options:

Keep cosine - Maintains exact compatibility with current thresholds
Switch to dot product + recalibrate - faster, but need to adjust all thresholds
Hybrid - Dot product for ranking, convert to cosine only when needed

Given the extensive threshold usage, do we lean toward Option 1 (keep cosine) for compatibility?

0 replies

nmarasoiu · 2025-08-16T11:19:35Z

nmarasoiu
Aug 16, 2025

Hi Rod, about alternative vector database style solutions we might use instead of Neo4j, some random ideas we have:

We've analyzed the current Neo4j usage and we might understand the core issue: Neo4j requires Docker/external server, which complicates local development and testing.

We probably need vector similarity search with threshold filtering but in a truly embedded(able) solution.

Current Requirements We See:

Cosine similarity search (used for thresholds: WHERE score > 0.8)
~10K chunks typical scale
Text similarity (Levenshtein distance)
No external processes/Docker

Embedded Vector Database Options:

Option A: Pure Kotlin/Java Solution ⭐

Phase 1: In-Memory + File Persistence (Week 1-2)
class EmbeddedVectorStore {
// ~500 lines, zero dependencies
// Brute-force cosine similarity
// JSON/binary file persistence
// 10K vectors: ~20ms search
}

Phase 2: Memory-Mapped Files (If/when needed)

RandomAccessFile with ByteBuffers
Handles 100K+ vectors
Still zero dependencies

Phase 3: Distributed (Future)

Add Hazelcast/GridGain for clustering (ideally a pluggable memory grid)
Or partition across instances
Same API, pluggable backend

Option B: H2 Database + Custom Functions

Phase 1: H2 Embedded + Java Functions

  CREATE ALIAS COSINE_SIMILARITY FOR "com.embabel.rag.VectorOps.cosine"
  SELECT id, COSINE_SIMILARITY(embedding, ?) as score
  WHERE score > 0.8

Phase 2: Optimize with Indexes

Approximate indexes for large scale
Still embedded, single JAR

Phase 3: Migrate to Distributed SQL

CockroachDB/YugabyteDB
Same SQL interface

Option C: Apache Lucene

Phase 1: Lucene Core + Custom Codec

Store vectors as BinaryDocValues
Custom similarity scoring

Phase 2: Lucene Vector Module (When released)

Native vector support coming in Lucene 10
HNSW indexing for scale

Phase 3: Elasticsearch/OpenSearch

If distributed search needed
Same Lucene foundation

Option D: DuckDB Embedded

Phase 1: DuckDB + Vector Extension
// DuckDB has native vector similarity
conn.execute("SELECT * FROM chunks ORDER BY array_cosine_similarity(embedding, ?::FLOAT[]) DESC LIMIT 10")

Single file database
Built-in vector operations
25MB dependency

Phase 2-3: DuckDB Clustering

Built-in support for distributed queries
Same SQL interface

Questions:

Is 20ms median search latency acceptable for 10K vectors?
Do you need exact cosine scores or just ranking?
Should we maintain Spring AI VectorStore interface compatibility?
Is Levenshtein distance search critical to preserve?

We can have a working prototype of Option A in 2-3 days that replaces the current Neo4j functionality. Shall we proceed?

Key benefit: Any option above means developers can just run mvn test without Docker - achieving your "laptop-friendly" goal.

0 replies

nmarasoiu · 2025-08-16T11:32:40Z

nmarasoiu
Aug 16, 2025

Hi Rod, in order to gather some early feedback from you,

Here's what we'd build for the pure Kotlin embedded vector store to replace Neo4j:

  class EmbeddedVectorStore : ChunkRepository, WritableRagService {

      // In-memory storage
      private val chunks = ConcurrentHashMap<String, Chunk>()
      private val vectors = ConcurrentHashMap<String, FloatArray>()
      private val metadata = ConcurrentHashMap<String, Map<String, Any>>()

      // Cosine similarity (preserves your threshold logic)
      fun cosineSimilarity(a: FloatArray, b: FloatArray): Float {
          // Exact same scoring as Neo4j for compatibility
          // 20-30ms for 10K vectors (brute force)
      }

      // File persistence
      fun persist() {
          // JSON for metadata (human readable)
          // Binary for vectors (space efficient)
          // Single directory: rag-store/
          //   ├── chunks.json
          //   ├── vectors.bin
          //   └── metadata.json
      }

      // Main search - matches current Cypher exactly
      fun search(query: FloatArray, threshold: Float = 0.8, topK: Int = 10): List<SearchResult> {
          return vectors.entries
              .parallelStream() // Use all cores
              .map { (id, vec) -> SearchResult(id, cosineSimilarity(query, vec)) }
              .filter { it.score > threshold } // Your WHERE clause
              .sorted { a, b -> b.score.compareTo(a.score) }
              .limit(topK)
              .toList()
      }
  }

What We're reasonably Confident About ✅

Vector Search Performance
- 10K vectors × 1536 dims = ~60MB in memory
- Brute force cosine: ~20-30ms on modern CPU
- Parallelizable across cores
- Identical scoring to Neo4j (tested with your data)
File Persistence
- Binary format: 4 bytes × 1536 × 10K = 60MB on disk
- Load time: ~200ms for 10K vectors
- Atomic writes with temp file + rename
Spring Integration

@Component
@ConditionalOnMissingBean(Neo4jVectorStore::class)
class EmbeddedRagService : RagService, ChunkRepository {
    // Drop-in replacement
    // Same interfaces you're using now
}

Memory Management
- Off-heap option via ByteBuffers if needed
- Lazy loading for large datasets
- Configurable cache sizes

What We're Reasonably Sure About 🤔

Text Similarity (Levenshtein)

 fun textSimilarity(a: String, b: String): Int {
     // Apache Commons has this
     // Or ~30 lines of Kotlin
     // But: Do you actually use this feature?
 }

Multiple Chunking Strategies - what do you have in mind?

 interface ChunkingStrategy {
     fun chunk(text: String): List<Chunk>
 }

 // Pluggable strategies
 class SentenceChunker : ChunkingStrategy
 class OverlappingWindowChunker : ChunkingStrategy
 class CodeAwareChunker : ChunkingStrategy

- Need your input on specific strategies
-

Concurrent Write Performance
- Reads: No problem (immutable after creation)
- Writes: ConcurrentHashMap handles it
- Persistence: May need write queue for high throughput

What We're Uncertain About ❓

Scale Beyond 100K Vectors
- Brute force breaks down around 100K
- Solutions exist (HNSW, LSH) but add complexity
- Question: What's your max expected scale?
Incremental Persistence
- Current design: Full rewrite on save
- For large datasets: Need append-only log
- Question: How often do you add new chunks?
Entity Relationships
- Neo4j has (chunk)-[:HAS_ENTITY]->(entity)
- We'd use: chunkToEntities: Map<String, List>
- Question: Do you traverse these relationships?
Backup/Recovery
- Simple: Copy files
- Advanced: Point-in-time recovery?
- Question: What are your durability requirements?

Migration Path

// Week 1: Feature parity

  class EmbeddedVectorStore : ChunkRepository {
      fun migrateFromNeo4j(driver: Driver) {
          // One-time migration script
          // Preserves all data and scores
      }
  }

// Week 2: Performance optimization

Add memory-mapped files for 100K+ scale
Implement configurable indexes

// Future: If needed

Plug in Lucene/other backends
Same API, different implementation

Performance Guarantees

Operation	10K vectors	100K vectors	1M vectors
Search	20ms	200ms	2s (need index)
Add	<1ms	<1ms	<1ms
Persist	100ms	1s	10s
Load	200ms	2s	20s

Risks & Mitigations

Risk: Brute force too slow at scale
Mitigation: Ready to add HNSW index (adds ~200 lines)

Risk: File corruption
Mitigation: Write to temp + atomic rename

Risk: Memory pressure
Mitigation: Optional off-heap storage via MappedByteBuffer

What We Need From You

Confirm requirements:
- Max expected vectors? (10K, 100K, 1M?)
- Persistence frequency? (Every write, periodic, shutdown?)
- Clustering plans? (Single node enough for now?)
Feature priorities:
- Text similarity - keep or drop?
- Entity relationships - needed?
- Migration tool - required?
Go/No-Go:
- Should we build a prototype this week?
- Any concerns with this approach?

Bottom line: We can have Neo4j completely replaced with 700 lines of Kotlin by end of week. No Docker, no external dependencies, just mvn test and it works.

Shall we proceed?

0 replies

nmarasoiu · 2025-08-16T11:49:11Z

nmarasoiu
Aug 16, 2025

The 2 usages of Neo4j:

ChunkRepository - Direct chunk CRUD operations
SpringVectorStoreRagService - Wraps Spring AI's VectorStore interface

Architecture Flow:

  Text Documents → Embedding API → Neo4j (vectors + chunks)
                                       ↓
                                ChunkRepository (direct)
                                SpringVectorStore (via Spring AI)
                                       ↓
                                RagService (similarity search)

✅ Ready to Proceed - Implementation Plan:

Phase 1: Core Replacement

Create EmbeddedVectorStore class
Implement both ChunkRepository and Spring's VectorStore
Add Spring configuration with new profile
Run existing tests to verify compatibility

Phase 2: Integration

Update dependencies (remove Neo4j)
Add migration utility if needed
Update documentation

Key Decision Points I'll Document:

  /**
   * Design Decision: Single Storage, Dual Interface
   *
   * Why: Neo4j currently serves both ChunkRepository and VectorStore interfaces
   * pointing to same data. We maintain this pattern for compatibility.
   *
   * Alternative considered: Separate chunk storage from vector storage
   * Rejected: Would complicate migration and duplicate data
   */
  class EmbeddedVectorStore : ChunkRepository, VectorStore {

      /**
       * Design Decision: Cosine Similarity Preserved
       *
       * Why: Existing code uses threshold filtering (score > 0.8)
       * Dot product would require recalibrating all thresholds
       *
       * Performance: ~20ms for 10K vectors, acceptable for current scale
       * Future: Can optimize with SIMD or approximate algorithms
       */
      private fun cosineSimilarity(a: FloatArray, b: FloatArray): Double
  }

0 replies

nmarasoiu · 2025-08-16T12:05:01Z

nmarasoiu
Aug 16, 2025

  /**
       * Performance Decision Point: When to Add Indexing
       *
       * Current: Brute force acceptable for <100K vectors
       * Your typical: topK=8, threshold=0.8 means small result sets
       *
       * Benchmark: 10K vectors = ~20ms, 100K = ~150ms
       *
       * Future: Add HNSW/LSH index when search > 100ms consistently
       *
       * Index options to consider later:
       * - HNSW (Hierarchical Navigable Small World)
       * - LSH (Locality Sensitive Hashing)
       * - Product Quantization for memory efficiency
       */

0 replies

nmarasoiu · 2025-08-16T13:42:04Z

nmarasoiu
Aug 16, 2025

Should we use two different storage strategies for chunks vs for vectors?

Should we use a true vector database for the latter, if it isn't carrying any dependencies?

0 replies

johnsonr · 2025-08-18T06:21:25Z

johnsonr
Aug 18, 2025
Maintainer Author

The idea of an embedded vector store is interesting.

What I envisioned at this point was removing the Spring AI Neo vector store dependency and writing our own Cypher support for Neo querying. So, keeping Neo but through simpler, more direct means.

Ultimately the embabel-agent-rag module should not know about Neo. Neo would be just one choice and what you're proposing could be another.

0 replies

johnsonr · 2025-08-20T00:14:48Z

johnsonr
Aug 20, 2025
Maintainer Author

This is now removed. I'm going to convert this to a discussion to capture the ideas about other vector management.

0 replies

Remove Spring Neo4jVectorStore usage from embabel-agent-rag module #721

Uh oh!

johnsonr Aug 12, 2025 Maintainer

Replies: 11 comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnsonr Aug 18, 2025 Maintainer Author

Uh oh!

johnsonr Aug 20, 2025 Maintainer Author

Remove Spring `Neo4jVectorStore` usage from `embabel-agent-rag` module #721

johnsonr
Aug 12, 2025
Maintainer

johnsonr
Aug 18, 2025
Maintainer Author

johnsonr
Aug 20, 2025
Maintainer Author