Spacy-Layout PDF to AI-Structured Data POC

A proof of concept demonstrating the use of the spacy-layout library to convert PDFs into AI-ready structured data. This tool leverages spaCy's layout parsing capabilities to extract text, layout information, tables, and other structured elements from PDF documents.

Note from the maintainer: I've been trying to locate the best way to OCR and process out my large collection of PDFs that I've scanned and amassed over the years. There are many great tools on the market, and there's always a learning curve to getting them stood up and deployed. I hope this repository helps you to understand the tools as much as it's helped me!

Features

PDF Processing: Converts PDFs to structured spaCy Doc objects
Layout Analysis: Extracts page layouts, bounding boxes, and section types
Table Extraction: Automatically detects and extracts tables as pandas DataFrames
Text Segmentation: Identifies text spans with labels (e.g., title, section_header, text, table)
Markdown Output: Generates clean markdown representations
JSON Export: Saves processed data in structured JSON format

Installation

Prerequisites

Python 3.10 or higher
Virtual environment (recommended)

Setup

Clone or download this repository

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Development dependencies (for testing and code quality):
- pytest - Test framework with coverage reporting
- pytest-cov - Coverage plugin for pytest
- black - Code formatting
- flake8 - Linting and style checking
- bandit - Security vulnerability scanning

The main dependency is spacy-layout, which automatically installs:

spaCy
Docling (for PDF parsing)
pandas (for table data)
And other required libraries

Usage

Basic Usage

Run the main processor script:

python src/pdf_processor.py

This will:

Download sample PDFs with complex layouts
Process them using spacy-layout
Save results to data/processed_results.json

Programmatic Usage

from src.pdf_processor import PDFProcessor

processor = PDFProcessor()

# Process a PDF file
result = processor.process_pdf("path/to/your/document.pdf")

# Access extracted data
print(result["text"])  # Full text content
print(result["layout"])  # Page layout information
print(result["tables"])  # Extracted tables as DataFrames
print(result["markdown"])  # Markdown representation

# Save results
processor.save_results([result], "output.json")

Data Structure

The processed output includes:

filename: Original PDF filename
text: Complete extracted text
layout: Page dimensions and metadata
spans: Text segments with:
- label (e.g., "text", "title", "section_header", "table")
- text content
- character positions
- bounding box coordinates
- page number
tables: Extracted tabular data as pandas DataFrames
markdown: Clean markdown representation

Examples

Processing a Tax Form (f1040.pdf)

The IRS Form 1040 demonstrates complex form layouts with multiple sections, checkboxes, and structured fields. The tool extracts:

Header information and form metadata
Section labels and field descriptions
Layout coordinates for precise positioning
Text spans categorized by content type

Processing a Table-Heavy Document (table_example.pdf)

The W3C table example shows table extraction capabilities:

{
  "tables": [
    {
      "data": [
        {
          "Disability Category": "Blind",
          "Participants": "5",
          "Ballots Completed": "1",
          "Results.Accuracy": "34.5%, n=1"
        }
        // ... more rows
      ]
    }
  ]
}

Validation

The POC has been tested with:

Complex forms (IRS tax forms)
Documents with tables (accessibility examples)
Multi-page documents
Various fonts and layouts

Running Tests

Run tests with coverage and test analytics reporting:

# Run tests with coverage and JUnit XML output
pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy

# Run tests (using configuration from pyproject.toml)
pytest

# Generate HTML coverage report
pytest --cov-report=html

# Run specific test file
pytest tests/test_pdf_processor.py

Code Coverage & Test Analytics

This project uses Codecov for comprehensive code quality monitoring:

Code Coverage

Coverage Requirement: 80% minimum
Branch Coverage: Enabled
Reports: HTML, XML, and terminal output
CI Integration: Automatic upload on every build

Test Analytics

Test Performance: Track test run times and identify slow tests
Failure Analysis: Monitor failure rates and flaky test detection
CI Insights: Failed tests visible in PR comments and dashboard
JUnit XML: Standard test results format for CI/CD integration

CI/CD & Quality Assurance

This project implements comprehensive continuous integration and quality assurance:

Automated Testing Pipeline

Multi-Python Support: Tests run on Python 3.10, 3.11, 3.12, and 3.13
Code Quality Checks:
- Black: Code formatting validation
- Flake8: Linting and style enforcement
- Bandit: Security vulnerability scanning
Test Analytics: Performance monitoring and failure analysis via Codecov

Codecov Integration

Coverage Reports: Automatic upload of coverage and test analytics data
PR Comments: Failed tests and coverage changes visible in pull requests
Dashboard: Comprehensive test health and performance insights
Branch Protection: 80% coverage requirement enforced

Quality Gates

Coverage Threshold: Minimum 80% code coverage required
Branch Coverage: Enabled for comprehensive coverage analysis
Test Results: JUnit XML output for CI/CD integration
Security Scanning: Automated vulnerability detection

CodeQL Security Analysis

Automated Vulnerability Scanning: GitHub CodeQL performs semantic code analysis
Scheduled Analysis: Weekly security scans on the main branch
Custom Configuration: Excludes test files and focuses on production code
Language Support: Python-specific security queries and vulnerability detection
Security Alerts: Automatic alerts for discovered vulnerabilities

Architecture

src/
├── pdf_processor.py    # Main processing logic
tests/
├── test_pdf_processor.py  # Unit tests
samples/
├── pdfs/               # Downloaded sample PDFs
├── processed_results.json  # Processing output
docs/                    # Documentation

Dependencies

Key dependencies (see requirements.txt):

spacy-layout: Core PDF processing
requests: PDF downloading
pandas: Data manipulation
spacy: NLP framework
docling: Document parsing

Security Considerations

PDFs are processed locally; no data is sent to external services
Downloaded PDFs are cached in the data/pdfs/ directory
No sensitive data handling in this POC

Deployment

Local Development

Set up virtual environment
Install dependencies
Run processor on your PDFs

Production Considerations

For large-scale processing, use spaCyLayout.pipe() for batch processing
Consider serialization with DocBin for caching processed documents
Monitor memory usage for large PDFs

Future Enhancements

Integration with LLM pipelines for content analysis
Support for additional document formats
Web interface for PDF upload and processing
Advanced table structure recognition
Custom layout parsers for domain-specific documents

Contributing

We welcome contributions! Please see our Contributing Guide for detailed information on:

Development setup and workflow
Code standards and best practices
Testing requirements
Pull request process

Quick Start for Contributors

Fork the repository
Create a feature branch: git checkout -b feature/your-feature-name
Make your changes following our contributing guidelines
Run tests: pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy
Submit a pull request using our PR template

Development Requirements

Python 3.10+
All tests pass
Code coverage ≥80%
No linting errors
Documentation updated

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.

Note: This project depends on third-party libraries like spacy-layout (MIT License). When using or redistributing, please comply with their respective licenses. For spacy-layout's terms, refer to its repository or installed package.

Acknowledgments

spaCy for the NLP framework
Docling for document parsing
spaCy Layout for layout analysis

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github		.github
.vscode		.vscode
docs		docs
src		src
tests		tests
.codecov.yml		.codecov.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

stc93025mn/ocrpoc-spacy-layout

Folders and files

Latest commit

History

Repository files navigation

Spacy-Layout PDF to AI-Structured Data POC

Table of Contents

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Programmatic Usage

Data Structure

Examples

Processing a Tax Form (f1040.pdf)

Processing a Table-Heavy Document (table_example.pdf)

Validation

Running Tests

Code Coverage & Test Analytics

Code Coverage

Test Analytics

CI/CD & Quality Assurance

Automated Testing Pipeline

Codecov Integration

Quality Gates

CodeQL Security Analysis

Architecture

Dependencies

Security Considerations

Deployment

Local Development

Production Considerations

Future Enhancements

Contributing

Quick Start for Contributors

Development Requirements

License

Acknowledgments

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages