A powerful document-to-markdown converter that extracts text and images from PDF, Word documents, and images, using OCR where needed, to generate clean Markdown output.
-
Document Processing:
- Extract text and images from PDF and Word documents
- Process standalone images with OCR
- Use PaddleOCR for high-quality optical character recognition
- Convert document structure to proper Markdown formatting
- Smart handling of multi-column layouts
-
Modern User Interface:
- Clean, responsive design with Bootstrap 5
- Drag & drop file upload interface
- Real-time conversion status display
- Tabbed interface for viewing Markdown, original document, and rendered preview
- Syntax highlighting for Markdown code
- Content scrolling with custom scrollbars for long documents
- Smart paragraph detection and formatting for better readability
- Optimized layout for both desktop and mobile devices
- Dark mode support
-
User Experience:
- Copy Markdown to clipboard with one click
- Download converted Markdown as a file
- Print functionality for rendered Markdown
- Instant rendering of Markdown to HTML preview
- Error handling with descriptive messages
- Auto-refresh UI components
-
Frontend:
- HTML5, CSS3, and JavaScript (ES6+)
- Bootstrap 5 for responsive design
- Highlight.js for syntax highlighting
- Marked.js for Markdown rendering
- Custom CSS for enhanced UI components
- AnimateCSS for smooth transitions
-
Backend:
- Python 3.8+ with FastAPI
- PyPDF2 for PDF processing
- python-docx for Word document processing
- PaddleOCR for image text recognition
- Jinja2 for templating
- Asynchronous processing for improved performance
-
Data Processing:
- Text cleaning and formatting algorithms
- Smart paragraph detection
- Intelligent handling of document structure
- Image extraction and processing
- OCR confidence scoring and validation
document-to-markdown/
│
├── app/
│ ├── core/
│ │ ├── document_extractor/ # Extract content from documents
│ │ │ ├── pdf_extractor.py # Extract text and images from PDF
│ │ │ ├── docx_extractor.py # Extract text and images from Word
│ │ │ └── image_handler.py # Process standalone images
│ │ │
│ │ ├── ocr/
│ │ │ ├── paddle_ocr.py # PaddleOCR implementation
│ │ │ └── ocr_processor.py # OCR processing workflow
│ │ │
│ │ ├── text_processor/ # Text processing module
│ │ │ ├── text_cleaner.py # Text cleaning
│ │ │ └── text_merger.py # Merge document and OCR text
│ │ │
│ │ └── markdown_converter/ # Markdown conversion module
│ │ ├── md_formatter.py # Text to Markdown conversion
│ │ ├── image_formatter.py # Image processing and Markdown syntax
│ │ └── structure_parser.py # Parse document structure
│ │
│ ├── media/ # Store processed images
│ │ └── images/
│ │
│ ├── api/ # API endpoints
│ │ ├── document_api.py # Document processing API
│ │ └── static_files.py # Serve static files
│ │
│ ├── models/ # Data models
│ │ └── document_models.py # Document data models
│ │
│ └── frontend/ # Web interface
│ ├── frontend_api.py # Frontend API
│ └── templates/ # HTML templates
│ └── index.html # Main app page
│
├── config/ # Configuration
│ └── config.py # App configuration
│
├── requirements.txt # Dependencies
└── main.py # Main application
-
Clone the repository:
git clone https://github.com/yourusername/deep-doc2markdown.git cd deep-doc2markdown -
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt -
Install PaddlePaddle:
Depending on your system, you might need specific installation instructions for PaddlePaddle. Visit the PaddlePaddle installation guide for details.
-
Start the application:
python main.py -
Open your browser and navigate to:
http://127.0.0.1:8000 -
Use the web interface to upload your documents and convert them to Markdown.
POST /api/upload: Upload a document for conversionGET /api/status/{doc_id}: Check the status of a conversion jobGET /api/markdown/{doc_id}: Get the generated Markdown content
- Python 3.8+
- FastAPI
- PyPDF2 for PDF processing
- python-docx for Word document processing
- PaddleOCR for image text recognition
- Jinja2 for templating
Edit config/config.py to adjust settings:
- Change the server host and port
- Configure OCR settings
- Modify file storage paths
This project is licensed under the MIT License - see the LICENSE file for details.