Skip to content

danieleschmidt/multimodal-contract-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Multimodal-Contract-Extractor

Vision-Language-Model pipeline that intelligently identifies and extracts clauses from scanned PDFs, handwritten contracts, and image-based documents, outputting structured JSON data.

Features

  • Multimodal Processing: Handles scanned PDFs, images, and handwritten documents
  • Clause Detection: Advanced OCR + Vision-Language Models for precise clause identification
  • Structured Output: Exports extracted data as JSON, XML, or CSV formats
  • Legal Template Recognition: Pre-trained on common contract types (NDAs, employment, leases)
  • Batch Processing: Handle multiple documents simultaneously
  • Confidence Scoring: Quality assessment for each extracted clause
  • Human-in-the-Loop: Review interface for verification and corrections

Quick Start

# Install dependencies
pip install -r requirements.txt

# Process a single contract
python extract.py --file contract.pdf --output extracted_data.json

# Batch process multiple files
python batch_extract.py --input-dir ./contracts --output-dir ./results

# Enable debug logging
python extract.py --file contract.pdf --log-level debug

# Start web interface for interactive processing
streamlit run web_app.py

# Check CLI version
python extract.py --version
python batch_extract.py --version

Development

Create a virtual environment and install both runtime and development dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .

Run linting, security checks and the tests to verify your setup:

ruff check .
bandit -r src -q
pytest -q

These same checks run automatically on every pull request via GitHub Actions.

Supported Document Types

Input Formats

  • PDF Documents: Native and scanned PDFs
  • Image Files: PNG, JPEG, TIFF, BMP
  • Handwritten Documents: Cursive and print handwriting
  • Multi-page Contracts: Automatic page sequencing
  • Low-quality Scans: Advanced preprocessing and enhancement

Contract Types

  • Non-Disclosure Agreements (NDAs)
  • Employment Contracts
  • Lease Agreements
  • Service Agreements
  • Purchase Orders
  • Partnership Agreements
  • Licensing Agreements

Architecture

Document Input → Preprocessing → OCR Engine → VLM Analysis → Clause Extraction → JSON Output
                      ↓              ↓           ↓              ↓               ↓
                Image Enhance   Text Extract  Semantic Parse  Structure Map   Validate

Configuration

The application supports flexible configuration via YAML files and environment variables, following the Twelve-Factor App methodology.

Configuration File

Create a config.yml file in your project directory:

# Multimodal Contract Extractor Configuration
ocr:
  cache_size_limit: 100
  context_window_size: 100

extraction:
  base_confidence_score: 0.75
  length_bonus_divisor: 1000
  max_confidence_cap: 0.95
  file_size_threshold_mb: 10
  streaming_chunk_size: 5

security:
  max_file_size_mb: 100
  request_id_length_limit: 64

health:
  check_timeout_seconds: 5

document:
  default_streaming_chunk_size: 10

See config.example.yml for a complete example with detailed documentation.

Environment Variables

Override any configuration setting using environment variables with the format MCE_<SECTION>_<SETTING>:

# OCR settings
export MCE_OCR_CACHE_SIZE_LIMIT=200
export MCE_OCR_CONTEXT_WINDOW_SIZE=150

# Extraction settings  
export MCE_EXTRACTION_BASE_CONFIDENCE_SCORE=0.8
export MCE_EXTRACTION_MAX_CONFIDENCE_CAP=0.98

# Security settings
export MCE_SECURITY_MAX_FILE_SIZE_MB=150

# Health check settings
export MCE_HEALTH_CHECK_TIMEOUT_SECONDS=10

Environment variables take precedence over file settings.

Loading Configuration

from multimodal_contract_extractor import load_config, get_config

# Load configuration from file and environment
config = load_config(config_path='config.yml')

# Get current configuration (loads defaults if not configured)
config = get_config()

# Access configuration values
print(f"Cache limit: {config.ocr.cache_size_limit}")
print(f"Max file size: {config.security.max_file_size_mb}MB")

Security Features

The application implements comprehensive security measures for file handling and processing:

Secure File Processing

  • Automatic Cleanup: Temporary files are automatically cleaned up using context managers
  • Restricted Permissions: Temporary files created with owner-only access (0o600)
  • Path Sanitization: File extensions are sanitized to prevent security issues
  • Size Limits: Configurable file size limits prevent denial-of-service attacks
  • Exception Safety: Files are cleaned up even when processing fails

Production Security

# Secure file processing with automatic cleanup
from web_app import TempFileManager

with TempFileManager(uploaded_file) as tmp_path:
    # File is automatically cleaned up when exiting this block
    result = process_document(tmp_path)

Usage Examples

Basic Extraction

from multimodal_contract_extractor import load_document, detect_clauses

document = load_document("nda.pdf")
clauses = detect_clauses(document)
for clause in clauses:
    print(clause.type, clause.text)

Batch Processing

# Process a directory of files
python batch_extract.py --input-dir ./contracts --output-dir ./extracted

Custom Clause Types

from multimodal_contract_extractor import load_document, detect_clauses

custom = {"renewal_terms": ["renewal", "extend", "continuation"]}
doc = load_document("service_agreement.pdf")
clauses = detect_clauses(doc, keywords=custom)

Sample Output

{
  "document_info": {
    "filename": "employment_contract.pdf",
    "pages": 5,
    "processing_time": 23.4,
    "overall_confidence": 0.89,
    "document_type": "employment_agreement"
  },
  "parties": [
    {
      "role": "employer",
      "name": "TechCorp Inc.",
      "address": "123 Silicon Valley, CA 94025"
    },
    {
      "role": "employee", 
      "name": "John Doe",
      "address": "456 Residential St, CA 94025"
    }
  ],
  "clauses": [
    {
      "id": "clause_001",
      "type": "termination",
      "title": "Termination for Cause",
      "text": "The Company may terminate this agreement immediately upon written notice if Employee...",
      "page": 3,
      "coordinates": [50, 300, 550, 450],
      "confidence": 0.94,
      "key_terms": ["immediate termination", "written notice", "cause"]
    },
    {
      "id": "clause_002", 
      "type": "compensation",
      "title": "Base Salary",
      "text": "Employee shall receive an annual salary of $85,000, payable in bi-weekly installments...",
      "page": 2,
      "coordinates": [50, 150, 550, 220],
      "confidence": 0.97,
      "key_terms": ["$85,000", "bi-weekly", "annual salary"]
    }
  ],
  "metadata": {
    "extraction_timestamp": "2024-01-15T10:30:00Z",
    "model_version": "v2.1.0",
    "processing_method": "multimodal_vlm"
  }
}

Advanced Features

Custom Training

# Train on domain-specific contracts
python train.py --dataset legal_contracts_dataset --epochs 10

# Fine-tune for specific contract types
python fine_tune.py --contract-type "real_estate" --examples ./real_estate_samples

Quality Assurance

  • Confidence Scoring: ML-based confidence assessment
  • Cross-validation: Multiple model consensus
  • Human Review: Built-in review interface
  • Error Detection: Automatic inconsistency flagging

Integration APIs

# REST API
POST /api/extract
Content-Type: multipart/form-data

# GraphQL API
mutation {
  extractContract(file: $file) {
    clauses {
      type
      text
      confidence
    }
  }
}

# Webhook Integration
POST /webhooks/document-processed
{
  "document_id": "doc_123",
  "status": "completed",
  "clauses_extracted": 15
}

Deployment Options

Local Development

Install optional GPU dependencies if you want CUDA acceleration:

# Install with GPU support
pip install -r requirements-gpu.txt

# Run with CUDA acceleration
python extract.py --gpu --batch-size 8

Cloud Deployment

# docker-compose.yml
version: '3.8'
services:
  contract-extractor:
    image: contract-extractor:latest
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - AZURE_VISION_KEY=${AZURE_VISION_KEY}
    volumes:
      - ./contracts:/app/input
      - ./results:/app/output

Enterprise Features

  • GDPR Compliance: Data protection and privacy controls
  • Audit Trails: Complete processing history
  • Role-based Access: User permission management
  • SLA Monitoring: Performance and uptime tracking
  • Custom Deployment: On-premises or private cloud options

Performance Benchmarks

For very large PDFs, use stream_document to load pages in chunks and reduce memory usage.

Document Type Avg Processing Time Accuracy Confidence
Native PDF 5.2s 96.3% 0.94
Scanned PDF 12.8s 91.7% 0.88
Handwritten 18.4s 87.2% 0.82
Low Quality 25.1s 83.9% 0.78

Contributing

We welcome contributions in these areas:

  • Support for additional document formats
  • New contract type templates
  • OCR engine integrations
  • Performance optimizations
  • Multilingual support

See CONTRIBUTING.md for development guidelines.

Legal Compliance

  • Data Privacy: Processes documents locally by default
  • No Data Retention: Documents are not stored unless explicitly configured
  • Audit Logging: Complete processing audit trails
  • Compliance Standards: SOC 2, GDPR, HIPAA ready

Repository Hygiene

This repository includes an automated hygiene bot that ensures it meets GitHub community standards and security best practices. See HYGIENE_BOT.md for details.

The bot runs weekly via GitHub Actions and can also be run manually:

# Set your GitHub token
export GITHUB_TOKEN=your_token_here

# Run hygiene check
./run_hygiene.sh

License

MIT License - see LICENSE file for details.

Disclaimer

This tool is for document processing assistance only. All extracted information should be reviewed by qualified legal professionals before use in any legal context.

About

VLM to parse scanned legal PDFs

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •