ModernColBERT Semantic Search

A minimal pipeline for semantic search over documents using a BERT-based model (ColBERT via PyLate). This repo includes demo documents in documents/ for immediate experimentation.

About ColBERT

This project implements the ColBERT (Contextualized Late Interaction over BERT) approach for efficient and effective passage/document search, as described in:

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Omar Khattab, Matei Zaharia
[arXiv:2004.12832]

ColBERT introduces a late interaction architecture that independently encodes queries and documents using BERT, then models their fine-grained similarity with a fast interaction step. This enables highly expressive retrieval with the efficiency of pre-computed document representations.

This implementation uses the Reason-ModernColBERT model via PyLate, which is a state-of-the-art late interaction model trained for reasoning-intensive retrieval.

What It Does

Indexes documents (PDFs auto-converted to text, then indexed with ColBERT)
Semantic search: Query your document collection with natural language and get relevant results (with filenames and scores)
Logs all search queries and results to search_log.txt for future reference and analysis

Quickstart

Install dependencies
```
pip install -r requirements.txt
```
Run the full pipeline (convert, index, and search):
```
python run_all.py
```
- This will convert PDFs in documents/ to text, index them, and launch the search interface.

Project Structure

documents/ — Example PDFs (included for demo)
convert.py — Converts PDFs to text (runs automatically)
index_documents.py — Indexes text files for search (runs automatically)
query_documents.py — Search interface (runs automatically)
run_all.py — Runs the full pipeline in order
pylate_index/, text_files/, doc_id_map.json — Generated/ignored files
search_log.txt — Log of all search queries and results

Notes

You can add your own PDFs to documents/.
Text files and index are generated automatically and ignored by git.
The search interface will show filenames for easy reference.
All search activity is logged to search_log.txt for later analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ModernColBERT Semantic Search

About ColBERT

What It Does

Quickstart

Project Structure

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
documents		documents
.gitignore		.gitignore
README.md		README.md
convert.py		convert.py
index_documents.py		index_documents.py
query_documents.py		query_documents.py
requirements.txt		requirements.txt
run_all.py		run_all.py
search_log.txt		search_log.txt

saim-x/ModernColBERT-Search

Folders and files

Latest commit

History

Repository files navigation

ModernColBERT Semantic Search

About ColBERT

What It Does

Quickstart

Project Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages