Vision-Language-Model pipeline that intelligently identifies and extracts clauses from scanned PDFs, handwritten contracts, and image-based documents, outputting structured JSON data.
- Multimodal Processing: Handles scanned PDFs, images, and handwritten documents
- Clause Detection: Advanced OCR + Vision-Language Models for precise clause identification
- Structured Output: Exports extracted data as JSON, XML, or CSV formats
- Legal Template Recognition: Pre-trained on common contract types (NDAs, employment, leases)
- Batch Processing: Handle multiple documents simultaneously
- Confidence Scoring: Quality assessment for each extracted clause
- Human-in-the-Loop: Review interface for verification and corrections
# Install dependencies
pip install -r requirements.txt
# Process a single contract
python extract.py --file contract.pdf --output extracted_data.json
# Batch process multiple files
python batch_extract.py --input-dir ./contracts --output-dir ./results
# Enable debug logging
python extract.py --file contract.pdf --log-level debug
# Start web interface for interactive processing
streamlit run web_app.py
# Check CLI version
python extract.py --version
python batch_extract.py --version
Create a virtual environment and install both runtime and development dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .
Run linting, security checks and the tests to verify your setup:
ruff check .
bandit -r src -q
pytest -q
These same checks run automatically on every pull request via GitHub Actions.
- PDF Documents: Native and scanned PDFs
- Image Files: PNG, JPEG, TIFF, BMP
- Handwritten Documents: Cursive and print handwriting
- Multi-page Contracts: Automatic page sequencing
- Low-quality Scans: Advanced preprocessing and enhancement
- Non-Disclosure Agreements (NDAs)
- Employment Contracts
- Lease Agreements
- Service Agreements
- Purchase Orders
- Partnership Agreements
- Licensing Agreements
Document Input → Preprocessing → OCR Engine → VLM Analysis → Clause Extraction → JSON Output
↓ ↓ ↓ ↓ ↓
Image Enhance Text Extract Semantic Parse Structure Map Validate
The application supports flexible configuration via YAML files and environment variables, following the Twelve-Factor App methodology.
Create a config.yml
file in your project directory:
# Multimodal Contract Extractor Configuration
ocr:
cache_size_limit: 100
context_window_size: 100
extraction:
base_confidence_score: 0.75
length_bonus_divisor: 1000
max_confidence_cap: 0.95
file_size_threshold_mb: 10
streaming_chunk_size: 5
security:
max_file_size_mb: 100
request_id_length_limit: 64
health:
check_timeout_seconds: 5
document:
default_streaming_chunk_size: 10
See config.example.yml for a complete example with detailed documentation.
Override any configuration setting using environment variables with the format MCE_<SECTION>_<SETTING>
:
# OCR settings
export MCE_OCR_CACHE_SIZE_LIMIT=200
export MCE_OCR_CONTEXT_WINDOW_SIZE=150
# Extraction settings
export MCE_EXTRACTION_BASE_CONFIDENCE_SCORE=0.8
export MCE_EXTRACTION_MAX_CONFIDENCE_CAP=0.98
# Security settings
export MCE_SECURITY_MAX_FILE_SIZE_MB=150
# Health check settings
export MCE_HEALTH_CHECK_TIMEOUT_SECONDS=10
Environment variables take precedence over file settings.
from multimodal_contract_extractor import load_config, get_config
# Load configuration from file and environment
config = load_config(config_path='config.yml')
# Get current configuration (loads defaults if not configured)
config = get_config()
# Access configuration values
print(f"Cache limit: {config.ocr.cache_size_limit}")
print(f"Max file size: {config.security.max_file_size_mb}MB")
The application implements comprehensive security measures for file handling and processing:
- Automatic Cleanup: Temporary files are automatically cleaned up using context managers
- Restricted Permissions: Temporary files created with owner-only access (
0o600
) - Path Sanitization: File extensions are sanitized to prevent security issues
- Size Limits: Configurable file size limits prevent denial-of-service attacks
- Exception Safety: Files are cleaned up even when processing fails
# Secure file processing with automatic cleanup
from web_app import TempFileManager
with TempFileManager(uploaded_file) as tmp_path:
# File is automatically cleaned up when exiting this block
result = process_document(tmp_path)
from multimodal_contract_extractor import load_document, detect_clauses
document = load_document("nda.pdf")
clauses = detect_clauses(document)
for clause in clauses:
print(clause.type, clause.text)
# Process a directory of files
python batch_extract.py --input-dir ./contracts --output-dir ./extracted
from multimodal_contract_extractor import load_document, detect_clauses
custom = {"renewal_terms": ["renewal", "extend", "continuation"]}
doc = load_document("service_agreement.pdf")
clauses = detect_clauses(doc, keywords=custom)
{
"document_info": {
"filename": "employment_contract.pdf",
"pages": 5,
"processing_time": 23.4,
"overall_confidence": 0.89,
"document_type": "employment_agreement"
},
"parties": [
{
"role": "employer",
"name": "TechCorp Inc.",
"address": "123 Silicon Valley, CA 94025"
},
{
"role": "employee",
"name": "John Doe",
"address": "456 Residential St, CA 94025"
}
],
"clauses": [
{
"id": "clause_001",
"type": "termination",
"title": "Termination for Cause",
"text": "The Company may terminate this agreement immediately upon written notice if Employee...",
"page": 3,
"coordinates": [50, 300, 550, 450],
"confidence": 0.94,
"key_terms": ["immediate termination", "written notice", "cause"]
},
{
"id": "clause_002",
"type": "compensation",
"title": "Base Salary",
"text": "Employee shall receive an annual salary of $85,000, payable in bi-weekly installments...",
"page": 2,
"coordinates": [50, 150, 550, 220],
"confidence": 0.97,
"key_terms": ["$85,000", "bi-weekly", "annual salary"]
}
],
"metadata": {
"extraction_timestamp": "2024-01-15T10:30:00Z",
"model_version": "v2.1.0",
"processing_method": "multimodal_vlm"
}
}
# Train on domain-specific contracts
python train.py --dataset legal_contracts_dataset --epochs 10
# Fine-tune for specific contract types
python fine_tune.py --contract-type "real_estate" --examples ./real_estate_samples
- Confidence Scoring: ML-based confidence assessment
- Cross-validation: Multiple model consensus
- Human Review: Built-in review interface
- Error Detection: Automatic inconsistency flagging
# REST API
POST /api/extract
Content-Type: multipart/form-data
# GraphQL API
mutation {
extractContract(file: $file) {
clauses {
type
text
confidence
}
}
}
# Webhook Integration
POST /webhooks/document-processed
{
"document_id": "doc_123",
"status": "completed",
"clauses_extracted": 15
}
Install optional GPU dependencies if you want CUDA acceleration:
# Install with GPU support
pip install -r requirements-gpu.txt
# Run with CUDA acceleration
python extract.py --gpu --batch-size 8
# docker-compose.yml
version: '3.8'
services:
contract-extractor:
image: contract-extractor:latest
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- AZURE_VISION_KEY=${AZURE_VISION_KEY}
volumes:
- ./contracts:/app/input
- ./results:/app/output
- GDPR Compliance: Data protection and privacy controls
- Audit Trails: Complete processing history
- Role-based Access: User permission management
- SLA Monitoring: Performance and uptime tracking
- Custom Deployment: On-premises or private cloud options
For very large PDFs, use stream_document
to load pages in chunks and reduce
memory usage.
Document Type | Avg Processing Time | Accuracy | Confidence |
---|---|---|---|
Native PDF | 5.2s | 96.3% | 0.94 |
Scanned PDF | 12.8s | 91.7% | 0.88 |
Handwritten | 18.4s | 87.2% | 0.82 |
Low Quality | 25.1s | 83.9% | 0.78 |
We welcome contributions in these areas:
- Support for additional document formats
- New contract type templates
- OCR engine integrations
- Performance optimizations
- Multilingual support
See CONTRIBUTING.md for development guidelines.
- Data Privacy: Processes documents locally by default
- No Data Retention: Documents are not stored unless explicitly configured
- Audit Logging: Complete processing audit trails
- Compliance Standards: SOC 2, GDPR, HIPAA ready
This repository includes an automated hygiene bot that ensures it meets GitHub community standards and security best practices. See HYGIENE_BOT.md for details.
The bot runs weekly via GitHub Actions and can also be run manually:
# Set your GitHub token
export GITHUB_TOKEN=your_token_here
# Run hygiene check
./run_hygiene.sh
MIT License - see LICENSE file for details.
This tool is for document processing assistance only. All extracted information should be reviewed by qualified legal professionals before use in any legal context.