Greek Tesseract Finetuning Dataset Tools

A comprehensive toolkit for creating and validating Greek OCR training datasets for Tesseract finetuning. This project provides tools to extract line-image pairs from Greek PDFs and validate/correct them through an interactive viewer.

Overview

This toolkit consists of two main components:

Dataset Builder (tools/build_dataset.py) - Extracts line-image/text pairs from Greek PDFs
Dataset Viewer (tools/viewer_app.py) - Interactive web app for validating and correcting pairs

Features

Dataset Builder

Extracts line-level image/text pairs from Greek PDFs using Docling Parse v2
Creates train/test splits (default: 10,000 train, 1,000 test pairs)
High-performance processing with multiprocessing support
Robust caching mechanism for resumable processing
Outputs TIFF images with corresponding ground truth text files
Maintains provenance tracking in index.csv

Dataset Viewer

Web-based interface for browsing line-image/text pairs
Edit and correct ground truth text
Mark problematic pairs as faulty
Keyboard navigation (arrow keys)
Greek character palette for easy input of special characters
Progress tracking with pairs_status.parquet
Preserves all edits and annotations

Requirements

System Requirements

Python 3.8 or higher
Linux, macOS, or Windows
At least 8GB RAM recommended for processing large PDFs

Python Dependencies

See requirements.txt for full list. Key dependencies:

docling-parse - For PDF parsing (requires special installation)
opencv-python - For image processing
PyMuPDF - For PDF rendering
streamlit - For the viewer web interface

Installation

Clone or extract this repository:

cd /mnt/data/Greek-Tesseract

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install docling-parse (special installation required):

# Docling-parse requires specific installation steps
# Follow the official docling-parse documentation for your platform
pip install docling-parse

Configuration

Edit config.json to set your paths and preferences:

{
  "dataset_builder": {
    "pdf_dir": "/path/to/your/pdfs",
    "csv_path": "/path/to/selected_documents.csv",
    "output_dir": "./greek_dataset",
    "cache_dir": "./cache",
    "train_lines": 10000,
    "test_lines": 1000,
    "workers": 7,
    "dpi": 300,
    "seed": 42
  },
  "viewer": {
    "data_dir": "./datasets",
    "default_image_width": 900,
    "default_font_size": 18
  }
}

Usage

Building a Dataset

Prepare a CSV file listing your PDFs with a filename column
Update config.json with your paths
Run the dataset builder:

python tools/build_dataset.py \
    --csv /path/to/selected_documents.csv \
    --pdf-dir /path/to/pdfs \
    --out-dir ./greek_dataset \
    --cache-dir ./cache \
    --train-lines 10000 \
    --test-lines 1000 \
    --workers 7 \
    --verbose

The tool will:

Process PDFs in parallel
Extract line-level text and images
Save as TIFF/text pairs in train/ and test/ folders
Create index.csv with metadata
Resume from where it left off if interrupted

Viewing and Correcting the Dataset

Update config.json with your dataset path
Launch the viewer:

streamlit run tools/viewer_app.py

Open http://localhost:8501 in your browser
Select your dataset folder
Navigate through pairs using:
- Previous/Next buttons
- Arrow keys (← →)
- "Jump to Next Un-viewed" button
Edit ground truth text as needed
Mark faulty pairs with the checkbox
All changes are saved automatically

Keyboard Shortcuts (Viewer)

← / → - Navigate previous/next
↓ - Toggle faulty checkbox
Esc - Exit text area
Ctrl+← / Ctrl+→ - Navigate while in text area

Output Structure

Dataset Builder Output

greek_dataset/
├── train/
│   ├── 000001.tif
│   ├── 000001.gt.txt
│   └── ...
├── test/
│   ├── 000001.tif
│   ├── 000001.gt.txt
│   └── ...
└── index.csv

Viewer Output

pairs_status.parquet - Tracks viewed/faulty status and corrections
.tess_viewer_config.json - User preferences (in home directory)

Workflow Example

Prepare PDFs: Collect Greek PDFs and create a CSV listing them
Build Dataset: Run the dataset builder to extract line pairs
Initial Review: Use the viewer to browse through extracted pairs
Correction: Fix any OCR errors in the ground truth text
Quality Control: Mark problematic pairs as faulty
Export: The corrected dataset is ready for Tesseract training

Technical Details

Dataset Builder

Uses Docling Parse v2 for accurate PDF text extraction
Implements O(N) baseline bucketing for line grouping
Caches intermediate results for efficiency
Handles Greek text encoding properly
Preserves exact text layout and special characters

Viewer App

Built with Streamlit for easy deployment
Preserves all state between sessions
Supports concurrent multi-user editing
Character palette includes Greek letters and symbols
Responsive design adapts to different screen sizes

Troubleshooting

Common Issues

Docling-parse installation fails
- Ensure you have the required system libraries
- Check the official docling documentation
No pairs found in PDF
- Verify the PDF contains extractable text (not just images)
- Check if the PDF is corrupted
Viewer doesn't save changes
- Ensure write permissions in the dataset directory
- Check that pairs_status.parquet isn't locked
Greek characters display incorrectly
- Ensure UTF-8 encoding throughout
- Verify your terminal/browser supports Unicode

Contributing

When modifying these tools:

Preserve the core extraction logic in the dataset builder
Maintain the viewer's state management system
Test with Greek PDFs before committing changes
Document any new configuration options

License

This toolkit is provided as-is for research and educational purposes.

Acknowledgments

Uses Docling Parse v2 for PDF processing
Built on Streamlit for the web interface
Designed for Greek Tesseract OCR improvement

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
tools		tools
.gitignore		.gitignore
README.md		README.md
TRAINING_RECIPE.md		TRAINING_RECIPE.md
config.json		config.json
greek_tessearct_FiTu.tar		greek_tessearct_FiTu.tar
pergamos_tesseract_v1.traineddata		pergamos_tesseract_v1.traineddata
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Greek Tesseract Finetuning Dataset Tools

Overview

Features

Dataset Builder

Dataset Viewer

Requirements

System Requirements

Python Dependencies

Installation

Configuration

Usage

Building a Dataset

Viewing and Correcting the Dataset

Keyboard Shortcuts (Viewer)

Output Structure

Dataset Builder Output

Viewer Output

Workflow Example

Technical Details

Dataset Builder

Viewer App

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

eellak/Greek-Tesseract

Folders and files

Latest commit

History

Repository files navigation

Greek Tesseract Finetuning Dataset Tools

Overview

Features

Dataset Builder

Dataset Viewer

Requirements

System Requirements

Python Dependencies

Installation

Configuration

Usage

Building a Dataset

Viewing and Correcting the Dataset

Keyboard Shortcuts (Viewer)

Output Structure

Dataset Builder Output

Viewer Output

Workflow Example

Technical Details

Dataset Builder

Viewer App

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages