Skip to content

eellak/Greek-Tesseract

Repository files navigation

Greek Tesseract Finetuning Dataset Tools

A comprehensive toolkit for creating and validating Greek OCR training datasets for Tesseract finetuning. This project provides tools to extract line-image pairs from Greek PDFs and validate/correct them through an interactive viewer.

Overview

This toolkit consists of two main components:

  1. Dataset Builder (tools/build_dataset.py) - Extracts line-image/text pairs from Greek PDFs
  2. Dataset Viewer (tools/viewer_app.py) - Interactive web app for validating and correcting pairs

Features

Dataset Builder

  • Extracts line-level image/text pairs from Greek PDFs using Docling Parse v2
  • Creates train/test splits (default: 10,000 train, 1,000 test pairs)
  • High-performance processing with multiprocessing support
  • Robust caching mechanism for resumable processing
  • Outputs TIFF images with corresponding ground truth text files
  • Maintains provenance tracking in index.csv

Dataset Viewer

  • Web-based interface for browsing line-image/text pairs
  • Edit and correct ground truth text
  • Mark problematic pairs as faulty
  • Keyboard navigation (arrow keys)
  • Greek character palette for easy input of special characters
  • Progress tracking with pairs_status.parquet
  • Preserves all edits and annotations

Requirements

System Requirements

  • Python 3.8 or higher
  • Linux, macOS, or Windows
  • At least 8GB RAM recommended for processing large PDFs

Python Dependencies

See requirements.txt for full list. Key dependencies:

  • docling-parse - For PDF parsing (requires special installation)
  • opencv-python - For image processing
  • PyMuPDF - For PDF rendering
  • streamlit - For the viewer web interface

Installation

  1. Clone or extract this repository:
cd /mnt/data/Greek-Tesseract
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install docling-parse (special installation required):
# Docling-parse requires specific installation steps
# Follow the official docling-parse documentation for your platform
pip install docling-parse

Configuration

Edit config.json to set your paths and preferences:

{
  "dataset_builder": {
    "pdf_dir": "/path/to/your/pdfs",
    "csv_path": "/path/to/selected_documents.csv",
    "output_dir": "./greek_dataset",
    "cache_dir": "./cache",
    "train_lines": 10000,
    "test_lines": 1000,
    "workers": 7,
    "dpi": 300,
    "seed": 42
  },
  "viewer": {
    "data_dir": "./datasets",
    "default_image_width": 900,
    "default_font_size": 18
  }
}

Usage

Building a Dataset

  1. Prepare a CSV file listing your PDFs with a filename column
  2. Update config.json with your paths
  3. Run the dataset builder:
python tools/build_dataset.py \
    --csv /path/to/selected_documents.csv \
    --pdf-dir /path/to/pdfs \
    --out-dir ./greek_dataset \
    --cache-dir ./cache \
    --train-lines 10000 \
    --test-lines 1000 \
    --workers 7 \
    --verbose

The tool will:

  • Process PDFs in parallel
  • Extract line-level text and images
  • Save as TIFF/text pairs in train/ and test/ folders
  • Create index.csv with metadata
  • Resume from where it left off if interrupted

Viewing and Correcting the Dataset

  1. Update config.json with your dataset path
  2. Launch the viewer:
streamlit run tools/viewer_app.py
  1. Open http://localhost:8501 in your browser
  2. Select your dataset folder
  3. Navigate through pairs using:
    • Previous/Next buttons
    • Arrow keys (← →)
    • "Jump to Next Un-viewed" button
  4. Edit ground truth text as needed
  5. Mark faulty pairs with the checkbox
  6. All changes are saved automatically

Keyboard Shortcuts (Viewer)

  • / - Navigate previous/next
  • - Toggle faulty checkbox
  • Esc - Exit text area
  • Ctrl+← / Ctrl+→ - Navigate while in text area

Output Structure

Dataset Builder Output

greek_dataset/
├── train/
│   ├── 000001.tif
│   ├── 000001.gt.txt
│   └── ...
├── test/
│   ├── 000001.tif
│   ├── 000001.gt.txt
│   └── ...
└── index.csv

Viewer Output

  • pairs_status.parquet - Tracks viewed/faulty status and corrections
  • .tess_viewer_config.json - User preferences (in home directory)

Workflow Example

  1. Prepare PDFs: Collect Greek PDFs and create a CSV listing them
  2. Build Dataset: Run the dataset builder to extract line pairs
  3. Initial Review: Use the viewer to browse through extracted pairs
  4. Correction: Fix any OCR errors in the ground truth text
  5. Quality Control: Mark problematic pairs as faulty
  6. Export: The corrected dataset is ready for Tesseract training

Technical Details

Dataset Builder

  • Uses Docling Parse v2 for accurate PDF text extraction
  • Implements O(N) baseline bucketing for line grouping
  • Caches intermediate results for efficiency
  • Handles Greek text encoding properly
  • Preserves exact text layout and special characters

Viewer App

  • Built with Streamlit for easy deployment
  • Preserves all state between sessions
  • Supports concurrent multi-user editing
  • Character palette includes Greek letters and symbols
  • Responsive design adapts to different screen sizes

Troubleshooting

Common Issues

  1. Docling-parse installation fails

    • Ensure you have the required system libraries
    • Check the official docling documentation
  2. No pairs found in PDF

    • Verify the PDF contains extractable text (not just images)
    • Check if the PDF is corrupted
  3. Viewer doesn't save changes

    • Ensure write permissions in the dataset directory
    • Check that pairs_status.parquet isn't locked
  4. Greek characters display incorrectly

    • Ensure UTF-8 encoding throughout
    • Verify your terminal/browser supports Unicode

Contributing

When modifying these tools:

  • Preserve the core extraction logic in the dataset builder
  • Maintain the viewer's state management system
  • Test with Greek PDFs before committing changes
  • Document any new configuration options

License

This toolkit is provided as-is for research and educational purposes.

Acknowledgments

  • Uses Docling Parse v2 for PDF processing
  • Built on Streamlit for the web interface
  • Designed for Greek Tesseract OCR improvement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages