Document Inforamtion Extractor

A sophisticated Python-based application that leverages Large Language Models (LLMs) to automatically process large documents based on various system prompts included in the program. This tool is designed to extract meaningful information and create structured data based on the system promppt selected. The solution is simple. Configure your LLM provider, upload your documents, and select the prompt you need. The application will process the documents and provide you with the results based on the selected System Prompt.

Inspired by daniel miessler's h___s://github.com/danielmiessler/fabric

Features

Multiple Document Format Support:
- PDF documents (*.pdf)
- Excel files (*.xlsx, *.xls)
- Word documents (*.docx)
- Text files (*.txt, *.csv, *.json, *.xml, *.md)
LLM Provider Integration:
- OpenAI (GPT-3.5, GPT-4)
- Ollama (local models)
- Deepseek
- Support for OpenAI API-compatible services
  - Azure OpenAI
  - Mistral AI
  - Together AI
  - Anyscale
  - OpenRouter
Key Features:
- Smart text extraction from multiple file formats
- Customizable prompts for different question types
- Batch processing capabilities
- Robust error handling with retries
- Web-based interface using Streamlit
- Markdown output formatting

System Requirements

Python 3.8 or higher
Windows/Linux/MacOS
Internet connection for cloud-based LLM providers
Sufficient disk space for document processing

Installation

Clone the repository:

git clone https://github.com/bamit99/Document-Information-Extractor.git
cd Document-Information-Extractor

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Create a .env file in the project root and configure your LLM providers:

# OpenAI Configuration
OPENAI_API_KEY=your_api_key_here

# For other providers, use the format:
PROVIDER_<NAME>_API_KEY=your_api_key_here
PROVIDER_<NAME>_BASE_URL=provider_api_url
PROVIDER_<NAME>_MODEL=model_name

Customize prompts in the Prompts directory:
- Each prompt template has its own directory
- system.md: Contains system instructions
- user.md: Contains the user prompt template

Usage

Start the application:
```
streamlit run app.py
```
Through the web interface:
- Select your preferred LLM provider
- Configure provider settings
- Upload documents for processing
- Choose prompt templates
- Process files and view results

Using the Python API:

from python import DocumentProcessor, OpenAIProvider

# Initialize provider
provider = OpenAIProvider(api_key="your_api_key")

# Create processor
processor = DocumentProcessor(provider)

# Process files
processor.process_files("input_path", "output_folder")

Project Structure

app.py: Streamlit web application
python.py: Core processing logic and provider implementations
Prompts/: Directory containing prompt templates
requirements.txt: Python dependencies

Error Handling

The application implements robust error handling:

Automatic retries for API calls with exponential backoff
Comprehensive logging
User-friendly error messages
Fallback mechanisms for provider connections

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is licensed under the MIT License.

Acknowledgments

OpenAI and other LLM providers for their APIs
Streamlit for the web interface framework
PyMuPDF, python-docx, and other document processing libraries

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Demo_Prompts		Demo_Prompts
Prompts		Prompts
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CascadeProjects.code-workspace		CascadeProjects.code-workspace
README.md		README.md
app.py		app.py
python.py		python.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Inforamtion Extractor

Features

System Requirements

Installation

Configuration

Usage

Project Structure

Error Handling

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

bamit99/Document-Information-Extractor

Folders and files

Latest commit

History

Repository files navigation

Document Inforamtion Extractor

Features

System Requirements

Installation

Configuration

Usage

Project Structure

Error Handling

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages