A minimal pipeline for semantic search over documents using a BERT-based model (ColBERT via PyLate). This repo includes demo documents in documents/
for immediate experimentation.
This project implements the ColBERT (Contextualized Late Interaction over BERT) approach for efficient and effective passage/document search, as described in:
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Omar Khattab, Matei Zaharia
[arXiv:2004.12832]
ColBERT introduces a late interaction architecture that independently encodes queries and documents using BERT, then models their fine-grained similarity with a fast interaction step. This enables highly expressive retrieval with the efficiency of pre-computed document representations.
This implementation uses the Reason-ModernColBERT model via PyLate, which is a state-of-the-art late interaction model trained for reasoning-intensive retrieval.
- Indexes documents (PDFs auto-converted to text, then indexed with ColBERT)
- Semantic search: Query your document collection with natural language and get relevant results (with filenames and scores)
- Logs all search queries and results to
search_log.txt
for future reference and analysis
- Install dependencies
pip install -r requirements.txt
- Run the full pipeline (convert, index, and search):
python run_all.py
- This will convert PDFs in
documents/
to text, index them, and launch the search interface.
- This will convert PDFs in
documents/
— Example PDFs (included for demo)convert.py
— Converts PDFs to text (runs automatically)index_documents.py
— Indexes text files for search (runs automatically)query_documents.py
— Search interface (runs automatically)run_all.py
— Runs the full pipeline in orderpylate_index/
,text_files/
,doc_id_map.json
— Generated/ignored filessearch_log.txt
— Log of all search queries and results
- You can add your own PDFs to
documents/
. - Text files and index are generated automatically and ignored by git.
- The search interface will show filenames for easy reference.
- All search activity is logged to
search_log.txt
for later analysis.