A proof of concept demonstrating the use of the spacy-layout library to convert PDFs into AI-ready structured data. This tool leverages spaCy's layout parsing capabilities to extract text, layout information, tables, and other structured elements from PDF documents.
Note from the maintainer: I've been trying to locate the best way to OCR and process out my large collection of PDFs that I've scanned and amassed over the years. There are many great tools on the market, and there's always a learning curve to getting them stood up and deployed. I hope this repository helps you to understand the tools as much as it's helped me!
- Features
- Installation
- Usage
- Examples
- Validation
- Architecture
- Dependencies
- Security Considerations
- Deployment
- Future Enhancements
- Contributing
- License
- Acknowledgments
- PDF Processing: Converts PDFs to structured spaCy Doc objects
- Layout Analysis: Extracts page layouts, bounding boxes, and section types
- Table Extraction: Automatically detects and extracts tables as pandas DataFrames
- Text Segmentation: Identifies text spans with labels (e.g., title, section_header, text, table)
- Markdown Output: Generates clean markdown representations
- JSON Export: Saves processed data in structured JSON format
- Python 3.10 or higher
- Virtual environment (recommended)
-
Clone or download this repository
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Development dependencies (for testing and code quality):
pytest- Test framework with coverage reportingpytest-cov- Coverage plugin for pytestblack- Code formattingflake8- Linting and style checkingbandit- Security vulnerability scanning
The main dependency is spacy-layout, which automatically installs:
- spaCy
- Docling (for PDF parsing)
- pandas (for table data)
- And other required libraries
Run the main processor script:
python src/pdf_processor.pyThis will:
- Download sample PDFs with complex layouts
- Process them using spacy-layout
- Save results to
data/processed_results.json
from src.pdf_processor import PDFProcessor
processor = PDFProcessor()
# Process a PDF file
result = processor.process_pdf("path/to/your/document.pdf")
# Access extracted data
print(result["text"]) # Full text content
print(result["layout"]) # Page layout information
print(result["tables"]) # Extracted tables as DataFrames
print(result["markdown"]) # Markdown representation
# Save results
processor.save_results([result], "output.json")The processed output includes:
- filename: Original PDF filename
- text: Complete extracted text
- layout: Page dimensions and metadata
- spans: Text segments with:
- label (e.g., "text", "title", "section_header", "table")
- text content
- character positions
- bounding box coordinates
- page number
- tables: Extracted tabular data as pandas DataFrames
- markdown: Clean markdown representation
The IRS Form 1040 demonstrates complex form layouts with multiple sections, checkboxes, and structured fields. The tool extracts:
- Header information and form metadata
- Section labels and field descriptions
- Layout coordinates for precise positioning
- Text spans categorized by content type
The W3C table example shows table extraction capabilities:
{
"tables": [
{
"data": [
{
"Disability Category": "Blind",
"Participants": "5",
"Ballots Completed": "1",
"Results.Accuracy": "34.5%, n=1"
}
// ... more rows
]
}
]
}The POC has been tested with:
- Complex forms (IRS tax forms)
- Documents with tables (accessibility examples)
- Multi-page documents
- Various fonts and layouts
Run tests with coverage and test analytics reporting:
# Run tests with coverage and JUnit XML output
pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy
# Run tests (using configuration from pyproject.toml)
pytest
# Generate HTML coverage report
pytest --cov-report=html
# Run specific test file
pytest tests/test_pdf_processor.pyThis project uses Codecov for comprehensive code quality monitoring:
- Coverage Requirement: 80% minimum
- Branch Coverage: Enabled
- Reports: HTML, XML, and terminal output
- CI Integration: Automatic upload on every build
- Test Performance: Track test run times and identify slow tests
- Failure Analysis: Monitor failure rates and flaky test detection
- CI Insights: Failed tests visible in PR comments and dashboard
- JUnit XML: Standard test results format for CI/CD integration
This project implements comprehensive continuous integration and quality assurance:
- Multi-Python Support: Tests run on Python 3.10, 3.11, 3.12, and 3.13
- Code Quality Checks:
- Black: Code formatting validation
- Flake8: Linting and style enforcement
- Bandit: Security vulnerability scanning
- Test Analytics: Performance monitoring and failure analysis via Codecov
- Coverage Reports: Automatic upload of coverage and test analytics data
- PR Comments: Failed tests and coverage changes visible in pull requests
- Dashboard: Comprehensive test health and performance insights
- Branch Protection: 80% coverage requirement enforced
- Coverage Threshold: Minimum 80% code coverage required
- Branch Coverage: Enabled for comprehensive coverage analysis
- Test Results: JUnit XML output for CI/CD integration
- Security Scanning: Automated vulnerability detection
- Automated Vulnerability Scanning: GitHub CodeQL performs semantic code analysis
- Scheduled Analysis: Weekly security scans on the main branch
- Custom Configuration: Excludes test files and focuses on production code
- Language Support: Python-specific security queries and vulnerability detection
- Security Alerts: Automatic alerts for discovered vulnerabilities
src/
├── pdf_processor.py # Main processing logic
tests/
├── test_pdf_processor.py # Unit tests
samples/
├── pdfs/ # Downloaded sample PDFs
├── processed_results.json # Processing output
docs/ # Documentation
Key dependencies (see requirements.txt):
- spacy-layout: Core PDF processing
- requests: PDF downloading
- pandas: Data manipulation
- spacy: NLP framework
- docling: Document parsing
- PDFs are processed locally; no data is sent to external services
- Downloaded PDFs are cached in the
data/pdfs/directory - No sensitive data handling in this POC
- Set up virtual environment
- Install dependencies
- Run processor on your PDFs
- For large-scale processing, use
spaCyLayout.pipe()for batch processing - Consider serialization with
DocBinfor caching processed documents - Monitor memory usage for large PDFs
- Integration with LLM pipelines for content analysis
- Support for additional document formats
- Web interface for PDF upload and processing
- Advanced table structure recognition
- Custom layout parsers for domain-specific documents
We welcome contributions! Please see our Contributing Guide for detailed information on:
- Development setup and workflow
- Code standards and best practices
- Testing requirements
- Pull request process
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes following our contributing guidelines
- Run tests:
pytest --cov --cov-branch --cov-report=xml --junitxml=junit.xml -o junit_family=legacy - Submit a pull request using our PR template
- Python 3.10+
- All tests pass
- Code coverage ≥80%
- No linting errors
- Documentation updated
This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for details.
Note: This project depends on third-party libraries like spacy-layout (MIT License). When using or redistributing, please comply with their respective licenses. For spacy-layout's terms, refer to its repository or installed package.
- spaCy for the NLP framework
- Docling for document parsing
- spaCy Layout for layout analysis