Skip to content

Conversation

@danielaskdd
Copy link
Collaborator

Feat: Add PDF Decryption Support for Password-Protected Files

Summary

This PR adds support for processing password-protected PDF files in the document processing pipeline. Users can now decrypt and process encrypted PDFs by setting a password in the environment configuration.

Motivation

Previously, the system would fail to process encrypted PDF files without any clear indication of what went wrong. This enhancement allows users to work with password-protected documents, which is common in enterprise and academic environments where sensitive documents are often encrypted.

Changes Made

1. Configuration Management (lightrag/api/config.py)

  • Added pdf_decrypt_password parameter to global_args
  • Reads from PDF_DECRYPT_PASSWORD environment variable
  • Defaults to None if not set

2. Document Processing (lightrag/api/routers/document_routes.py)

  • Modified pipeline_enqueue_file function to detect encrypted PDFs
  • Implemented decryption logic using PyPDF2's decrypt() method
  • Added comprehensive error handling for three scenarios:
    • No password provided: Clear error message directing users to set PDF_DECRYPT_PASSWORD
    • Incorrect password: Friendly error indicating the password is wrong
    • Decryption failure: Detailed error with exception information

3. Documentation (env.example)

  • Added PDF_DECRYPT_PASSWORD configuration example
  • Included clear comments explaining the feature

Usage

Configuration

Add to your .env file:

# PDF decryption password for protected PDF files
PDF_DECRYPT_PASSWORD=your_password_here

Behavior

  • Unencrypted PDFs: Process normally (no change in behavior)
  • Encrypted PDFs with password set: Automatically decrypt and process
  • Encrypted PDFs without password: Fail gracefully with helpful error message
  • Encrypted PDFs with wrong password: Fail with clear indication of incorrect password

Error Messages

All error messages are user-friendly and appear in English:

  • "PDF is encrypted but no password provided - Please set PDF_DECRYPT_PASSWORD environment variable"
  • "Failed to decrypt PDF - incorrect password - The provided PDF_DECRYPT_PASSWORD is incorrect for this file"
  • "PDF decryption failed - Error during PDF decryption: [details]"

Technical Notes

  • Only affects PyPDF2 processing engine (DEFAULT mode)
  • DOCLING mode is unchanged
  • Password is accessed via global_args.pdf_decrypt_password for consistency with other configuration
  • Backward compatible - no breaking changes for existing deployments

Testing Recommendations

  1. Test with unencrypted PDF - should work as before
  2. Test with encrypted PDF without password set - should show friendly error
  3. Test with encrypted PDF and correct password - should decrypt and process successfully
  4. Test with encrypted PDF and incorrect password - should show password error

Checklist

  • Code follows project style guidelines
  • Comments are in English
  • Environment variable documented in env.example
  • Error handling is comprehensive and user-friendly
  • Backward compatible with existing functionality
  • Configuration managed through global_args pattern

• Add PDF_DECRYPT_PASSWORD env variable
• Check encryption status before reading
• Handle decrypt errors gracefully
• Log detailed error messages
• Support both encrypted/plain PDFs
@danielaskdd danielaskdd merged commit ece0398 into HKUDS:main Nov 1, 2025
1 check passed
@danielaskdd danielaskdd deleted the pdf-decryption branch November 1, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant