Multimodal-Contract-Extractor

Vision-Language-Model pipeline that intelligently identifies and extracts clauses from scanned PDFs, handwritten contracts, and image-based documents, outputting structured JSON data.

Features

Multimodal Processing: Handles scanned PDFs, images, and handwritten documents
Clause Detection: Advanced OCR + Vision-Language Models for precise clause identification
Structured Output: Exports extracted data as JSON, XML, or CSV formats
Legal Template Recognition: Pre-trained on common contract types (NDAs, employment, leases)
Batch Processing: Handle multiple documents simultaneously
Confidence Scoring: Quality assessment for each extracted clause
Human-in-the-Loop: Review interface for verification and corrections

Quick Start

# Install dependencies
pip install -r requirements.txt

# Process a single contract
python extract.py --file contract.pdf --output extracted_data.json

# Batch process multiple files
python batch_extract.py --input-dir ./contracts --output-dir ./results

# Enable debug logging
python extract.py --file contract.pdf --log-level debug

# Start web interface for interactive processing
streamlit run web_app.py

# Check CLI version
python extract.py --version
python batch_extract.py --version

Development

Create a virtual environment and install both runtime and development dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .

Run linting, security checks and the tests to verify your setup:

ruff check .
bandit -r src -q
pytest -q

These same checks run automatically on every pull request via GitHub Actions.

Supported Document Types

Input Formats

PDF Documents: Native and scanned PDFs
Image Files: PNG, JPEG, TIFF, BMP
Handwritten Documents: Cursive and print handwriting
Multi-page Contracts: Automatic page sequencing
Low-quality Scans: Advanced preprocessing and enhancement

Contract Types

Non-Disclosure Agreements (NDAs)
Employment Contracts
Lease Agreements
Service Agreements
Purchase Orders
Partnership Agreements
Licensing Agreements

Architecture

Document Input → Preprocessing → OCR Engine → VLM Analysis → Clause Extraction → JSON Output
                      ↓              ↓           ↓              ↓               ↓
                Image Enhance   Text Extract  Semantic Parse  Structure Map   Validate

Configuration

The application supports flexible configuration via YAML files and environment variables, following the Twelve-Factor App methodology.

Configuration File

Create a config.yml file in your project directory:

# Multimodal Contract Extractor Configuration
ocr:
  cache_size_limit: 100
  context_window_size: 100

extraction:
  base_confidence_score: 0.75
  length_bonus_divisor: 1000
  max_confidence_cap: 0.95
  file_size_threshold_mb: 10
  streaming_chunk_size: 5

security:
  max_file_size_mb: 100
  request_id_length_limit: 64

health:
  check_timeout_seconds: 5

document:
  default_streaming_chunk_size: 10

See config.example.yml for a complete example with detailed documentation.

Environment Variables

Override any configuration setting using environment variables with the format MCE_<SECTION>_<SETTING>:

# OCR settings
export MCE_OCR_CACHE_SIZE_LIMIT=200
export MCE_OCR_CONTEXT_WINDOW_SIZE=150

# Extraction settings  
export MCE_EXTRACTION_BASE_CONFIDENCE_SCORE=0.8
export MCE_EXTRACTION_MAX_CONFIDENCE_CAP=0.98

# Security settings
export MCE_SECURITY_MAX_FILE_SIZE_MB=150

# Health check settings
export MCE_HEALTH_CHECK_TIMEOUT_SECONDS=10

Environment variables take precedence over file settings.

Loading Configuration

from multimodal_contract_extractor import load_config, get_config

# Load configuration from file and environment
config = load_config(config_path='config.yml')

# Get current configuration (loads defaults if not configured)
config = get_config()

# Access configuration values
print(f"Cache limit: {config.ocr.cache_size_limit}")
print(f"Max file size: {config.security.max_file_size_mb}MB")

Security Features

The application implements comprehensive security measures for file handling and processing:

Secure File Processing

Automatic Cleanup: Temporary files are automatically cleaned up using context managers
Restricted Permissions: Temporary files created with owner-only access (0o600)
Path Sanitization: File extensions are sanitized to prevent security issues
Size Limits: Configurable file size limits prevent denial-of-service attacks
Exception Safety: Files are cleaned up even when processing fails

Production Security

# Secure file processing with automatic cleanup
from web_app import TempFileManager

with TempFileManager(uploaded_file) as tmp_path:
    # File is automatically cleaned up when exiting this block
    result = process_document(tmp_path)

Usage Examples

Basic Extraction

from multimodal_contract_extractor import load_document, detect_clauses

document = load_document("nda.pdf")
clauses = detect_clauses(document)
for clause in clauses:
    print(clause.type, clause.text)

Batch Processing

# Process a directory of files
python batch_extract.py --input-dir ./contracts --output-dir ./extracted

Custom Clause Types

from multimodal_contract_extractor import load_document, detect_clauses

custom = {"renewal_terms": ["renewal", "extend", "continuation"]}
doc = load_document("service_agreement.pdf")
clauses = detect_clauses(doc, keywords=custom)

Sample Output

{
  "document_info": {
    "filename": "employment_contract.pdf",
    "pages": 5,
    "processing_time": 23.4,
    "overall_confidence": 0.89,
    "document_type": "employment_agreement"
  },
  "parties": [
    {
      "role": "employer",
      "name": "TechCorp Inc.",
      "address": "123 Silicon Valley, CA 94025"
    },
    {
      "role": "employee", 
      "name": "John Doe",
      "address": "456 Residential St, CA 94025"
    }
  ],
  "clauses": [
    {
      "id": "clause_001",
      "type": "termination",
      "title": "Termination for Cause",
      "text": "The Company may terminate this agreement immediately upon written notice if Employee...",
      "page": 3,
      "coordinates": [50, 300, 550, 450],
      "confidence": 0.94,
      "key_terms": ["immediate termination", "written notice", "cause"]
    },
    {
      "id": "clause_002", 
      "type": "compensation",
      "title": "Base Salary",
      "text": "Employee shall receive an annual salary of $85,000, payable in bi-weekly installments...",
      "page": 2,
      "coordinates": [50, 150, 550, 220],
      "confidence": 0.97,
      "key_terms": ["$85,000", "bi-weekly", "annual salary"]
    }
  ],
  "metadata": {
    "extraction_timestamp": "2024-01-15T10:30:00Z",
    "model_version": "v2.1.0",
    "processing_method": "multimodal_vlm"
  }
}

Advanced Features

Custom Training

# Train on domain-specific contracts
python train.py --dataset legal_contracts_dataset --epochs 10

# Fine-tune for specific contract types
python fine_tune.py --contract-type "real_estate" --examples ./real_estate_samples

Quality Assurance

Confidence Scoring: ML-based confidence assessment
Cross-validation: Multiple model consensus
Human Review: Built-in review interface
Error Detection: Automatic inconsistency flagging

Integration APIs

# REST API
POST /api/extract
Content-Type: multipart/form-data

# GraphQL API
mutation {
  extractContract(file: $file) {
    clauses {
      type
      text
      confidence
    }
  }
}

# Webhook Integration
POST /webhooks/document-processed
{
  "document_id": "doc_123",
  "status": "completed",
  "clauses_extracted": 15
}

Deployment Options

Local Development

Install optional GPU dependencies if you want CUDA acceleration:

# Install with GPU support
pip install -r requirements-gpu.txt

# Run with CUDA acceleration
python extract.py --gpu --batch-size 8

Cloud Deployment

# docker-compose.yml
version: '3.8'
services:
  contract-extractor:
    image: contract-extractor:latest
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - AZURE_VISION_KEY=${AZURE_VISION_KEY}
    volumes:
      - ./contracts:/app/input
      - ./results:/app/output

Enterprise Features

GDPR Compliance: Data protection and privacy controls
Audit Trails: Complete processing history
Role-based Access: User permission management
SLA Monitoring: Performance and uptime tracking
Custom Deployment: On-premises or private cloud options

Performance Benchmarks

For very large PDFs, use stream_document to load pages in chunks and reduce memory usage.

Document Type	Avg Processing Time	Accuracy	Confidence
Native PDF	5.2s	96.3%	0.94
Scanned PDF	12.8s	91.7%	0.88
Handwritten	18.4s	87.2%	0.82
Low Quality	25.1s	83.9%	0.78

Contributing

We welcome contributions in these areas:

Support for additional document formats
New contract type templates
OCR engine integrations
Performance optimizations
Multilingual support

See CONTRIBUTING.md for development guidelines.

Legal Compliance

Data Privacy: Processes documents locally by default
No Data Retention: Documents are not stored unless explicitly configured
Audit Logging: Complete processing audit trails
Compliance Standards: SOC 2, GDPR, HIPAA ready

Repository Hygiene

This repository includes an automated hygiene bot that ensures it meets GitHub community standards and security best practices. See HYGIENE_BOT.md for details.

The bot runs weekly via GitHub Actions and can also be run manually:

# Set your GitHub token
export GITHUB_TOKEN=your_token_here

# Run hygiene check
./run_hygiene.sh

License

MIT License - see LICENSE file for details.

Disclaimer

This tool is for document processing assistance only. All extracted information should be reviewed by qualified legal professionals before use in any legal context.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.devcontainer		.devcontainer
.github		.github
.terragon		.terragon
.vscode		.vscode
deployment		deployment
docs		docs
governance		governance
infrastructure/terraform		infrastructure/terraform
k8s		k8s
monitoring		monitoring
performance		performance
research_validation_results		research_validation_results
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.trivyignore		.trivyignore
=45.0.0		=45.0.0
=5.9		=5.9
API.md		API.md
ARCHITECTURE.md		ARCHITECTURE.md
AUTONOMOUS_BACKLOG.md		AUTONOMOUS_BACKLOG.md
AUTONOMOUS_SDLC_COMPLETION_REPORT_v4.md		AUTONOMOUS_SDLC_COMPLETION_REPORT_v4.md
AUTONOMOUS_SDLC_ENHANCEMENT_COMPLETION_REPORT.md		AUTONOMOUS_SDLC_ENHANCEMENT_COMPLETION_REPORT.md
AUTONOMOUS_SDLC_EXECUTION_SUMMARY.md		AUTONOMOUS_SDLC_EXECUTION_SUMMARY.md
AUTONOMOUS_SDLC_EXECUTION_SUMMARY_v2.md		AUTONOMOUS_SDLC_EXECUTION_SUMMARY_v2.md
AUTONOMOUS_SDLC_PROGRESSIVE_QUALITY_GATES_DOCUMENTATION.md		AUTONOMOUS_SDLC_PROGRESSIVE_QUALITY_GATES_DOCUMENTATION.md
AUTONOMOUS_SDLC_V4_COMPLETION_REPORT.md		AUTONOMOUS_SDLC_V4_COMPLETION_REPORT.md
AUTONOMOUS_SDLC_V4_FINAL_COMPLETION_REPORT.md		AUTONOMOUS_SDLC_V4_FINAL_COMPLETION_REPORT.md
AUTONOMOUS_SDLC_V4_PRODUCTION_DEPLOYMENT_GUIDE.md		AUTONOMOUS_SDLC_V4_PRODUCTION_DEPLOYMENT_GUIDE.md
AUTONOMOUS_SDLC_V5_COMPLETION_REPORT.md		AUTONOMOUS_SDLC_V5_COMPLETION_REPORT.md
AUTONOMOUS_SDLC_V5_DEPLOYMENT_GUIDE.md		AUTONOMOUS_SDLC_V5_DEPLOYMENT_GUIDE.md
AUTONOMOUS_SDLC_V5_VALIDATION_REPORT.json		AUTONOMOUS_SDLC_V5_VALIDATION_REPORT.json
AUTONOMOUS_SDLC_V6_DEPLOYMENT_GUIDE.md		AUTONOMOUS_SDLC_V6_DEPLOYMENT_GUIDE.md
BACKLOG.md		BACKLOG.md
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CODE_REVIEW.md		CODE_REVIEW.md
COMPREHENSIVE_DOCUMENTATION_v4.md		COMPREHENSIVE_DOCUMENTATION_v4.md
COMPREHENSIVE_PRODUCTION_DEPLOYMENT_GUIDE.md		COMPREHENSIVE_PRODUCTION_DEPLOYMENT_GUIDE.md
CONFIGURATION.md		CONFIGURATION.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
DEVELOPMENT_PLAN.md		DEVELOPMENT_PLAN.md
Dockerfile		Dockerfile
Dockerfile.production		Dockerfile.production
Dockerfile.progressive-quality-gates		Dockerfile.progressive-quality-gates
Dockerfile.security		Dockerfile.security
ENTERPRISE_RELIABILITY_README.md		ENTERPRISE_RELIABILITY_README.md
FINAL_DEPLOYMENT_STATUS.md		FINAL_DEPLOYMENT_STATUS.md
GENERATION_2_IMPLEMENTATION_SUMMARY.md		GENERATION_2_IMPLEMENTATION_SUMMARY.md
GENERATION_3_IMPLEMENTATION_SUMMARY.md		GENERATION_3_IMPLEMENTATION_SUMMARY.md
GENERATION_4_IMPLEMENTATION_SUMMARY.md		GENERATION_4_IMPLEMENTATION_SUMMARY.md
GENERATION_4_OPTIMIZATION_SUMMARY.md		GENERATION_4_OPTIMIZATION_SUMMARY.md
GENERATION_6_IMPLEMENTATION_REPORT.md		GENERATION_6_IMPLEMENTATION_REPORT.md
HYGIENE_BOT.md		HYGIENE_BOT.md
LICENSE		LICENSE
Makefile		Makefile
PRODUCTION_DEPLOYMENT_GUIDE.md		PRODUCTION_DEPLOYMENT_GUIDE.md
PRODUCTION_DEPLOYMENT_GUIDE_v4.md		PRODUCTION_DEPLOYMENT_GUIDE_v4.md
PROJECT_CHARTER.md		PROJECT_CHARTER.md
README.md		README.md
RESEARCH_EXECUTION_SUMMARY_REPORT.md		RESEARCH_EXECUTION_SUMMARY_REPORT.md
RESEARCH_PUBLICATION_SUMMARY.md		RESEARCH_PUBLICATION_SUMMARY.md
SECURITY.md		SECURITY.md
SPRINT_BOARD.md		SPRINT_BOARD.md
SYSTEM_STATUS_REPORT.md		SYSTEM_STATUS_REPORT.md
TECH_DEBT.md		TECH_DEBT.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
_tmp_pytest-of-root_pytest-4_test_cli_writes_result_json0_result.json		_tmp_pytest-of-root_pytest-4_test_cli_writes_result_json0_result.json
_tmp_pytest-of-root_pytest-4_test_extract_cli_output_dir_mi0_missing_result.json		_tmp_pytest-of-root_pytest-4_test_extract_cli_output_dir_mi0_missing_result.json
autonomous_sdlc_progressive_orchestrator.py		autonomous_sdlc_progressive_orchestrator.py
bandit_results.json		bandit_results.json
bandit_security_report.json		bandit_security_report.json
basic_autonomous_validation.py		basic_autonomous_validation.py
basic_quality_tests.py		basic_quality_tests.py
batch_extract.py		batch_extract.py
benchmark_results.json		benchmark_results.json
comprehensive_quality_check.py		comprehensive_quality_check.py
comprehensive_research_benchmarks.py		comprehensive_research_benchmarks.py
comprehensive_research_validation_execution.py		comprehensive_research_validation_execution.py
config.example.yml		config.example.yml
config.generation3.example.yml		config.generation3.example.yml
docker-compose.production.yml		docker-compose.production.yml
docker-compose.progressive-quality-gates.yml		docker-compose.progressive-quality-gates.yml
docker-compose.yml		docker-compose.yml
enhanced_web_app.py		enhanced_web_app.py
extract.py		extract.py
mkdocs.yml		mkdocs.yml
performance_benchmarks.py		performance_benchmarks.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Uh oh!

License

danieleschmidt/multimodal-contract-extractor

Folders and files

Latest commit

History

Repository files navigation

Multimodal-Contract-Extractor

Features

Quick Start

Development

Supported Document Types

Input Formats

Contract Types

Architecture

Configuration

Configuration File

Environment Variables

Loading Configuration

Security Features

Secure File Processing

Production Security

Usage Examples

Basic Extraction

Batch Processing

Custom Clause Types

Sample Output

Advanced Features

Custom Training

Quality Assurance

Integration APIs

Deployment Options

Local Development

Cloud Deployment

Enterprise Features

Performance Benchmarks

Contributing

Legal Compliance

Repository Hygiene

License

Disclaimer

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 3

Uh oh!

Languages

Packages