Skip to content

Commit 0e906ee

Browse files
authored
7 move logging to a database (#8)
* #7 Update environemnt to support database container. * #7 Add database models, logging and refactor a bit. * #7 Remove json combining script. * #7 Add database utility and clean up. * #7 Fix upsert operations. * #7 Update documentation. * #7 Remove cruft from previous effort. * #7 Refactor PII scrubber. * #7 Fix bad import. Update shell script. * #7 Clean up lint and sort output. * #7 Flip requirements. Make smaller CI requirements file the additional one. * #7 Remove some old dependencies. Fix Dockerfile. * #7 Address feedback. * #7 Add mysql password to example environment file. * #7 Alert users to required sql variable. * #1337 Fix typo in run script message. * #7 Add mysqlclient to requirements.
1 parent 1bc0199 commit 0e906ee

22 files changed

+664
-582
lines changed

.github/workflows/sort_and_lint.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
- name: Install dependencies
2828
run: |
2929
python -m pip install --upgrade pip
30-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
30+
if [ -f requirements/ci_requirements.txt ]; then pip install -r requirements/ci_requirements.txt; fi
3131
- name: Lint with flake8 and isort
3232
run: |
3333
touch settings.py

Dockerfile

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
FROM python:3.11
22
COPY requirements.txt .
3-
COPY requirements/full.txt .
43
ENV PYTHONPATH=\/workspace
54
ARG USER_ID
65
RUN apt-get update -y && \
7-
apt-get -y install poppler-utils tesseract-ocr yq
8-
RUN git clone https://github.com/tesseract-ocr/tessdata.git /usr/share/tessdata
9-
RUN useradd -l -u ${USER_ID} -g sudo jenkins && \
6+
apt-get -y install poppler-utils tesseract-ocr yq &&\
7+
git clone https://github.com/tesseract-ocr/tessdata.git /usr/share/tessdata &&\
8+
useradd -l -u ${USER_ID} -g sudo jenkins && \
109
mkdir -m 0755 /home/jenkins && chown jenkins /home/jenkins
1110
USER jenkins
12-
RUN pip install -r full.txt -r requirements.txt --trusted-host pypi.python.org --no-cache-dir && \
11+
RUN pip install -r requirements.txt --trusted-host pypi.python.org --no-cache-dir &&\
1312
python -m spacy download en_core_web_sm && \
1413
python -m spacy download en_core_web_lg
1514
ENV PATH="/home/jenkins/.local/bin:$PATH"

README.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,26 @@
11
# Veterinary Medical Record Transcriber (VMRT) Tesseract Utilities
2+
23
The Golden Retriever lifetime study has thousands of electronic medical records (EMRs) that have valuable information. The VMRT project is an attempt to automate data extraction from these EMRs. This repository contains some very simple and crude Tesseract scripts to help evaluate our dataset. The unstructured text extracted from the EMRs may or may not be valuable, but understanding the quantity of low confidence records is very useful.
34

4-
Goals
5-
* Build dataset to understand composition of EMRs
6-
* Kind of files
7-
* Enrollment status
8-
* Determine confidence scores for ORC extraction from PDFs
9-
* Evaluate extracted text to determine Tesseract fit for project
5+
Goals:
6+
7+
- Build dataset to understand composition of EMRs
8+
- Homogenize format of PDF and text files (more to come)
9+
- Determine confidence scores for optical character recognition from PDFs
10+
- Automatically scrub personally (or dog) identifiable information (PII)
11+
- Perform plain text substitution on corpus
12+
- Extract metadata, such as subject id, study year, related visit
1013

1114
# Running the scripts
15+
1216
The scripts are easily run via the Dockerfile included in this repo.
13-
1. Build the container like usual. `docker build -t <container name> .` Run the scripts `docker run --rm -v <path to data>/data -v <path to code>/workspace <image name> <script name>`
14-
2. To produce a file map that is compatible with the Tesseract utilitiy run the file_info.py script over the data directory. Output is printed to stdout.
15-
3. To process the file map produced above run the image_to_text.py with the file map. An output directory with an `unstructured_text` folder is also required. Output is dumped to output folder.
17+
18+
1. Copy the example.env file to .env and fill in the values.
19+
a. The value for `SQL_CONNECTION_STRING` should be the connection string for the database container. (i.e. `mysql://user:password@vmrt-emr-process-log-mysql:3306/vmrt_emr_transcription`)
20+
2. The easiest way to spin up the docker images is to run the `run.sh` script within the repository root directory. You can also build the containers using the docker compose file. `docker build -t <container name> .`
21+
3. Set up your DB by running `python /workspace/scripts/database_setup.py install` within the container.
22+
4. Get ready for the transcription process by running `python scripts/create_transcription_process.py /data`
23+
5. Use the `transcribe_pdfs.py` script to transcribe the files needed.
24+
- `python /workspace/scripts/transcribe_pdfs.py /workspace/output`
25+
6. Use the `pii_scrubber.py` script to remove PII from the text.
26+
- `python /workspace/scripts/scrubbers/pii_scrubber.py`

docker-compose.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
services:
2+
vmrt-emr-process-log-mysql:
3+
env_file:
4+
- path: ./.env
5+
required: true
6+
container_name: vmrt-emr-process-log-mysql
7+
image: mysql:8.0
8+
command: --default-authentication-plugin=mysql_native_password
9+
restart: always
10+
environment:
11+
MYSQL_ROOT_PASSWORD: $SQL_PASSWORD
12+
vmrt-emr-workspace:
13+
container_name: vmrt-emr-workspace
14+
build: .
15+
command: tail -f /dev/null
16+
volumes:
17+
- ./:/workspace
18+
- emr-source:/data
19+
20+
volumes:
21+
emr-source:
22+
driver: local
23+
driver_opts:
24+
o: bind
25+
type: none
26+
device: "$HOME/MAF\ Dropbox/GRLS/Operations/ENROLLED\ DOGS"

example.env

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
SQL_CONNECTION_STRING=''
2+
SQL_PASSWORD=''

pyproject.toml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ build-backend = "setuptools.build_meta"
66
name = "vmrt_tesseract_utilities"
77
version = "1.0"
88
description = "A utility script for extracting text and scrubbing PII from EMR data."
9-
packages=["vmrt_tesseract_utilities"]
109
authors = [
1110
{name = "Morris Animal Foundation", email = "[email protected]"},
12-
]
11+
]
12+
13+
[tool.setuptools]
14+
packages=["vmrt_tesseract_utilities"]

requirements.txt

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,103 @@
1+
annotated-types==0.6.0
2+
azure-core==1.32.0
3+
beautifulsoup4==4.12.3
4+
blis==1.1.0
5+
catalogue==2.0.10
6+
certifi==2024.12.14
7+
cffi==1.16.0
8+
charset-normalizer==3.4.0
9+
click==8.1.8
10+
cloudpathlib==0.20.0
11+
colorclass==2.2.2
12+
compressed-rtf==1.0.6
13+
confection==0.1.5
14+
cryptography==43.0.0
15+
cymem==2.0.10
16+
dnspython==2.6.1
17+
easygui==0.98.3
18+
ebcdic==1.1.1
19+
email_validator==2.1.1
20+
en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl#sha256=293e9547a655b25499198ab15a525b05b9407a75f10255e405e8c3854329ab63
21+
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85
22+
extract-msg==0.48.7
23+
filelock==3.16.1
124
flake8==7.1.1
25+
fsspec==2024.12.0
26+
greenlet==3.1.1
27+
huggingface-hub==0.27.0
28+
idna==3.7
229
isort==5.13.2
30+
Jinja2==3.1.5
31+
langcodes==3.5.0
32+
language_data==1.3.0
33+
lark==1.1.9
34+
marisa-trie==1.2.1
35+
markdown-it-py==3.0.0
36+
MarkupSafe==3.0.2
337
mccabe==0.7.0
38+
mdurl==0.1.2
39+
mpmath==1.3.0
40+
msoffcrypto-tool==5.4.1
41+
murmurhash==1.0.11
42+
mysqlclient==2.2.6
43+
networkx==3.4.2
44+
numpy==2.0.2
45+
olefile==0.47
46+
oletools==0.60.2
47+
packaging==24.0
48+
pandas==2.2.2
49+
pcodedmp==1.2.6
50+
pdf2image==1.17.0
51+
phonenumbers==8.13.52
52+
pillow==10.3.0
53+
preshed==3.0.9
54+
presidio_analyzer==2.2.355
55+
presidio_anonymizer==2.2.355
456
pycodestyle==2.12.1
57+
pycparser==2.22
58+
pycryptodome==3.21.0
59+
pydantic==2.7.1
60+
pydantic_core==2.18.2
561
pyflakes==3.2.0
62+
Pygments==2.18.0
63+
PyMySQL==1.1.1
64+
pyparsing==3.1.2
65+
pytesseract==0.3.10
66+
python-dateutil==2.9.0.post0
67+
python-dotenv==1.0.1
68+
pytz==2024.1
69+
PyYAML==6.0.2
70+
red-black-tree-mod==1.20
71+
regex==2024.11.6
72+
requests==2.32.3
73+
requests-file==2.1.0
74+
rich==13.9.4
75+
RTFDE==0.1.2
76+
safetensors==0.4.5
77+
shellingham==1.5.4
78+
six==1.16.0
79+
smart-open==7.1.0
80+
soupsieve==2.5
81+
spacy==3.8.2
82+
spacy-huggingface-pipelines==0.0.4
83+
spacy-legacy==3.0.12
84+
spacy-loggers==1.0.5
85+
SQLAlchemy==2.0.36
86+
srsly==2.5.0
87+
sympy==1.13.1
88+
tesseract==0.1.3
89+
tesserocr==2.7.0
90+
thinc==8.3.3
91+
tldextract==5.1.3
92+
tokenizers==0.21.0
93+
torch==2.5.1
94+
tqdm==4.67.1
95+
transformers==4.47.1
96+
typer==0.15.1
97+
typing_extensions==4.11.0
98+
tzdata==2024.1
99+
tzlocal==5.2
100+
urllib3==2.3.0
101+
wasabi==1.1.3
102+
weasel==0.4.1
103+
wrapt==1.17.0

requirements/ci_requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
flake8==7.1.1
2+
isort==5.13.2
3+
mccabe==0.7.0
4+
pycodestyle==2.12.1
5+
pyflakes==3.2.0

requirements/full.txt

Lines changed: 0 additions & 46 deletions
This file was deleted.

run.sh

Lines changed: 41 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,47 @@
11
#!/usr/bin/env bash
22

3-
# Should provide the directory where this script lives in most cases.
4-
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
3+
# Exit on Error.
4+
set -e
55

6-
# The name of our image from the Gitlab Container Registry.
7-
IMAGE_NAME="registry.gitlab.com/morrisanimalfoundation/grls:vmrt-tesseract-utilities"
6+
# Read our .env file.
7+
export $(grep -v '^#' .env | xargs)
88

9-
# Build the image with our special build args.
10-
# These matter more on Jenkins, but need to be placeheld anyway.
11-
docker image build -t $IMAGE_NAME --cache-from $IMAGE_NAME --cache-to type=inline --build-arg USER_ID=$(id -u ${USER}) .
9+
if [[ -z $SQL_PASSWORD ]]; then
10+
echo "Error: Please set the SQL_PASSWORD variable in your .env file."
11+
exit 1
12+
fi
1213

1314

14-
# Run the container in a disposable manner.
15-
# Add a volume to the current working dir.
16-
docker run --rm -it -v $HOME/MAF\ Dropbox/GRLS/Operations/ENROLLED\ DOGS:/data -v $SCRIPT_DIR:/workspace -v $HOME/.ssh:/home/jenkins/.ssh $IMAGE_NAME bash
15+
# Build the Docker images with the current user's ID.
16+
docker compose build --build-arg USER_ID=$(id -u ${USER})
17+
18+
# Start the containers in detached mode.
19+
docker compose up -d
20+
21+
# Wait for the database container to start (with a timeout).
22+
TIMEOUT=30
23+
COUNTER=0
24+
until $(docker exec -i vmrt-emr-process-log-mysql mysql -uroot -p$SQL_PASSWORD -e "DROP DATABASE IF EXISTS vmrt_emr_transcription; CREATE DATABASE vmrt_emr_transcription;") || [[ $COUNTER -eq $TIMEOUT ]]; do
25+
echo "Waiting for database container to start... ($COUNTER/$TIMEOUT)"
26+
sleep 1
27+
COUNTER=$((COUNTER+1))
28+
done
29+
30+
if [[ $COUNTER -eq $TIMEOUT ]]; then
31+
echo "Error: Timeout waiting for database container."
32+
exit 1
33+
fi
34+
35+
echo "Database initialized successfully."
36+
37+
# Execute the Python script.
38+
if ! docker exec -t vmrt-emr-workspace python ./scripts/database_setup.py install; then
39+
echo "Error: Failed to execute Python script."
40+
exit 1
41+
fi
42+
43+
# Provide an interactive Bash shell within the container.
44+
docker exec -it vmrt-emr-workspace bash
45+
46+
# Stop and remove the containers.
47+
docker compose down

0 commit comments

Comments
 (0)