A comprehensive toolkit for creating and validating Greek OCR training datasets for Tesseract finetuning. This project provides tools to extract line-image pairs from Greek PDFs and validate/correct them through an interactive viewer.
This toolkit consists of two main components:
- Dataset Builder (
tools/build_dataset.py
) - Extracts line-image/text pairs from Greek PDFs - Dataset Viewer (
tools/viewer_app.py
) - Interactive web app for validating and correcting pairs
- Extracts line-level image/text pairs from Greek PDFs using Docling Parse v2
- Creates train/test splits (default: 10,000 train, 1,000 test pairs)
- High-performance processing with multiprocessing support
- Robust caching mechanism for resumable processing
- Outputs TIFF images with corresponding ground truth text files
- Maintains provenance tracking in
index.csv
- Web-based interface for browsing line-image/text pairs
- Edit and correct ground truth text
- Mark problematic pairs as faulty
- Keyboard navigation (arrow keys)
- Greek character palette for easy input of special characters
- Progress tracking with
pairs_status.parquet
- Preserves all edits and annotations
- Python 3.8 or higher
- Linux, macOS, or Windows
- At least 8GB RAM recommended for processing large PDFs
See requirements.txt
for full list. Key dependencies:
docling-parse
- For PDF parsing (requires special installation)opencv-python
- For image processingPyMuPDF
- For PDF renderingstreamlit
- For the viewer web interface
- Clone or extract this repository:
cd /mnt/data/Greek-Tesseract
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install docling-parse (special installation required):
# Docling-parse requires specific installation steps
# Follow the official docling-parse documentation for your platform
pip install docling-parse
Edit config.json
to set your paths and preferences:
{
"dataset_builder": {
"pdf_dir": "/path/to/your/pdfs",
"csv_path": "/path/to/selected_documents.csv",
"output_dir": "./greek_dataset",
"cache_dir": "./cache",
"train_lines": 10000,
"test_lines": 1000,
"workers": 7,
"dpi": 300,
"seed": 42
},
"viewer": {
"data_dir": "./datasets",
"default_image_width": 900,
"default_font_size": 18
}
}
- Prepare a CSV file listing your PDFs with a
filename
column - Update
config.json
with your paths - Run the dataset builder:
python tools/build_dataset.py \
--csv /path/to/selected_documents.csv \
--pdf-dir /path/to/pdfs \
--out-dir ./greek_dataset \
--cache-dir ./cache \
--train-lines 10000 \
--test-lines 1000 \
--workers 7 \
--verbose
The tool will:
- Process PDFs in parallel
- Extract line-level text and images
- Save as TIFF/text pairs in train/ and test/ folders
- Create index.csv with metadata
- Resume from where it left off if interrupted
- Update
config.json
with your dataset path - Launch the viewer:
streamlit run tools/viewer_app.py
- Open http://localhost:8501 in your browser
- Select your dataset folder
- Navigate through pairs using:
- Previous/Next buttons
- Arrow keys (← →)
- "Jump to Next Un-viewed" button
- Edit ground truth text as needed
- Mark faulty pairs with the checkbox
- All changes are saved automatically
←
/→
- Navigate previous/next↓
- Toggle faulty checkboxEsc
- Exit text areaCtrl+←
/Ctrl+→
- Navigate while in text area
greek_dataset/
├── train/
│ ├── 000001.tif
│ ├── 000001.gt.txt
│ └── ...
├── test/
│ ├── 000001.tif
│ ├── 000001.gt.txt
│ └── ...
└── index.csv
pairs_status.parquet
- Tracks viewed/faulty status and corrections.tess_viewer_config.json
- User preferences (in home directory)
- Prepare PDFs: Collect Greek PDFs and create a CSV listing them
- Build Dataset: Run the dataset builder to extract line pairs
- Initial Review: Use the viewer to browse through extracted pairs
- Correction: Fix any OCR errors in the ground truth text
- Quality Control: Mark problematic pairs as faulty
- Export: The corrected dataset is ready for Tesseract training
- Uses Docling Parse v2 for accurate PDF text extraction
- Implements O(N) baseline bucketing for line grouping
- Caches intermediate results for efficiency
- Handles Greek text encoding properly
- Preserves exact text layout and special characters
- Built with Streamlit for easy deployment
- Preserves all state between sessions
- Supports concurrent multi-user editing
- Character palette includes Greek letters and symbols
- Responsive design adapts to different screen sizes
-
Docling-parse installation fails
- Ensure you have the required system libraries
- Check the official docling documentation
-
No pairs found in PDF
- Verify the PDF contains extractable text (not just images)
- Check if the PDF is corrupted
-
Viewer doesn't save changes
- Ensure write permissions in the dataset directory
- Check that pairs_status.parquet isn't locked
-
Greek characters display incorrectly
- Ensure UTF-8 encoding throughout
- Verify your terminal/browser supports Unicode
When modifying these tools:
- Preserve the core extraction logic in the dataset builder
- Maintain the viewer's state management system
- Test with Greek PDFs before committing changes
- Document any new configuration options
This toolkit is provided as-is for research and educational purposes.
- Uses Docling Parse v2 for PDF processing
- Built on Streamlit for the web interface
- Designed for Greek Tesseract OCR improvement