Skip to content

A minimal pipeline for semantic search over documents using a BERT-based model (ColBERT via PyLate).

Notifications You must be signed in to change notification settings

saim-x/ModernColBERT-Search

Repository files navigation

ModernColBERT Semantic Search

A minimal pipeline for semantic search over documents using a BERT-based model (ColBERT via PyLate). This repo includes demo documents in documents/ for immediate experimentation.

About ColBERT

This project implements the ColBERT (Contextualized Late Interaction over BERT) approach for efficient and effective passage/document search, as described in:

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Omar Khattab, Matei Zaharia
[arXiv:2004.12832]

ColBERT introduces a late interaction architecture that independently encodes queries and documents using BERT, then models their fine-grained similarity with a fast interaction step. This enables highly expressive retrieval with the efficiency of pre-computed document representations.

This implementation uses the Reason-ModernColBERT model via PyLate, which is a state-of-the-art late interaction model trained for reasoning-intensive retrieval.

What It Does

  • Indexes documents (PDFs auto-converted to text, then indexed with ColBERT)
  • Semantic search: Query your document collection with natural language and get relevant results (with filenames and scores)
  • Logs all search queries and results to search_log.txt for future reference and analysis

Quickstart

  1. Install dependencies
    pip install -r requirements.txt
  2. Run the full pipeline (convert, index, and search):
    python run_all.py
    • This will convert PDFs in documents/ to text, index them, and launch the search interface.

Project Structure

  • documents/ — Example PDFs (included for demo)
  • convert.py — Converts PDFs to text (runs automatically)
  • index_documents.py — Indexes text files for search (runs automatically)
  • query_documents.py — Search interface (runs automatically)
  • run_all.py — Runs the full pipeline in order
  • pylate_index/, text_files/, doc_id_map.json — Generated/ignored files
  • search_log.txt — Log of all search queries and results

Notes

  • You can add your own PDFs to documents/.
  • Text files and index are generated automatically and ignored by git.
  • The search interface will show filenames for easy reference.
  • All search activity is logged to search_log.txt for later analysis.

About

A minimal pipeline for semantic search over documents using a BERT-based model (ColBERT via PyLate).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages