PDF Text Diff

This is a Python application that compares two PDF files and finds differences in the text. The application was developed using Python 3.11.

Prerequisites

Windows

Tesseract OCR (OCR = optical character recognition)
- Download and install Tesseract
- Set environment variable
  After installation, you need to set the environment variable of the Tesseract ( e.g. C:\Program Files\Tesseract-OCR)
- Verify Tesseract installation
  Open the Windows Command Prompt and run the next command:
```
tesseract --version
```
  You should get a result in form:
```
tesseract v5.3.1.20230401
...
```
- Replace <tesseract_exe_path> with path to executable tesseract.exe
```
pytesseract.pytesseract.tesseract_cmd = (
  r"<tesseract_exe_path>"  # e.g. r"C:\Program Files\Tesseract-OCR\tesseract.exe"
)
```
Poppler
- Download the latest binary from here
- Unzip the folder wherever you want and add the path to the environmental variables ( e.g. C:\Software\poppler-0.68.0\bin)
- Change the value of variable POPPLER_PATH in file constants/constants.py
```
  # Windows also needs poppler_exe
  POPPLER_PATH = Path(r"<poppler_path>")  # e.g. C:\Software\poppler-0.68.0\bin
```

Linux

Tesseract OCR

To install Tesseract OCR on Linux, run the following commands:

git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
sudo ./autogen.sh
sudo ./configure
sudo make
sudo make install
sudo ldconfig
sudo make training
sudo make training-install

Installation

Clone the repository
Install the required packages: pip install -r requirements.txt

Usage

Add two PDF files to the input folder: template.pdf and template_changed.pdf
Run the main.py script: python main.py
The script will compare the two PDF files and output the differences in the text.

Project structure

The project has the following file structure:

constants/: a package containing constants used throughout the application.
functions/: a package containing all the functions used in the application.
input/: a folder containing the input PDF files.
main.py: the main script that runs the application.
requirements.txt: a file containing the required Python packages to run the application.

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue on the repository.

Running Docker Compose

Starting up Docker Compose is easy. To begin, ensure you're in the pdf-text-diff folder and run the following from the Command Prompt:

docker compose up -d

To bring down the environment and remove the volume — which we defined within compose.yaml — run the following command:

docker compose down -v

Remove all unused containers, networks, images, and volumes

docker system prune -af

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
constants		constants
flask		flask
functions		functions
input		input
nginx		nginx
.env		.env
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Text Diff

Prerequisites

Windows

Linux

Installation

Usage

Project structure

Contributing

Running Docker Compose

Remove all unused containers, networks, images, and volumes

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mihai-dulgheru/pdf-text-diff

Folders and files

Latest commit

History

Repository files navigation

PDF Text Diff

Prerequisites

Windows

Linux

Installation

Usage

Project structure

Contributing

Running Docker Compose

Remove all unused containers, networks, images, and volumes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages