This is a Python application that compares two PDF files and finds differences in the text. The application was developed using Python 3.11.
-
Tesseract OCR (OCR = optical character recognition)
-
Set environment variable
After installation, you need to set the environment variable of the Tesseract ( e.g.C:\Program Files\Tesseract-OCR
) -
Verify Tesseract installation
Open the Windows Command Prompt and run the next command:tesseract --version
You should get a result in form:
tesseract v5.3.1.20230401 ...
-
Replace
<tesseract_exe_path>
with path to executable tesseract.exepytesseract.pytesseract.tesseract_cmd = ( r"<tesseract_exe_path>" # e.g. r"C:\Program Files\Tesseract-OCR\tesseract.exe" )
-
Poppler
-
Download the latest binary from here
-
Unzip the folder wherever you want and add the path to the environmental variables ( e.g.
C:\Software\poppler-0.68.0\bin
) -
Change the value of variable
POPPLER_PATH
in fileconstants/constants.py
# Windows also needs poppler_exe POPPLER_PATH = Path(r"<poppler_path>") # e.g. C:\Software\poppler-0.68.0\bin
-
-
Tesseract OCR
To install Tesseract OCR on Linux, run the following commands:
git clone https://github.com/tesseract-ocr/tesseract.git cd tesseract sudo ./autogen.sh sudo ./configure sudo make sudo make install sudo ldconfig sudo make training sudo make training-install
- Clone the repository
- Install the required packages:
pip install -r requirements.txt
- Add two PDF files to the input folder:
template.pdf
andtemplate_changed.pdf
- Run the
main.py
script:python main.py
- The script will compare the two PDF files and output the differences in the text.
The project has the following file structure:
constants/
: a package containing constants used throughout the application.functions/
: a package containing all the functions used in the application.input/
: a folder containing the input PDF files.main.py
: the main script that runs the application.requirements.txt
: a file containing the required Python packages to run the application.
Contributions are welcome! If you find a bug or have a feature request, please open an issue on the repository.
Starting up Docker Compose is easy. To begin, ensure you're in the pdf-text-diff
folder and run the following from the
Command Prompt:
docker compose up -d
To bring down the environment and remove the volume — which we defined within compose.yaml — run the following command:
docker compose down -v
docker system prune -af
This project is licensed under the MIT License. See the LICENSE file for more information.