Skip to content

Commit dd463fa

Browse files
authored
#3 - visit date extraction (#9)
* 3 - initial work * 3 - clean up a few things * 3 - add pytests * 3 - address feedback on the pull request * 3 - double to single quotes for consistency
1 parent 0e906ee commit dd463fa

File tree

15 files changed

+943
-66
lines changed

15 files changed

+943
-66
lines changed

.github/workflows/run_pytests.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# This workflow will install Python dependencies and run pytests
2+
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
3+
4+
name: Run Pytests
5+
6+
on:
7+
push:
8+
branches: [ main ]
9+
pull_request:
10+
branches: [ main ]
11+
12+
jobs:
13+
build:
14+
15+
runs-on: ubuntu-latest
16+
17+
steps:
18+
- uses: actions/checkout@v3
19+
- name: Set up Python
20+
uses: actions/setup-python@v3
21+
with:
22+
python-version: '3.11'
23+
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -U pip setuptools wheel
28+
if [ -f requirements/pytest_requirements.txt ]; then pip install -r requirements/pytest_requirements.txt; fi
29+
30+
- name: Run tests
31+
run: |
32+
pytest | tee output.txt
33+
cat output.txt >> $GITHUB_STEP_SUMMARY

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ COPY requirements.txt .
33
ENV PYTHONPATH=\/workspace
44
ARG USER_ID
55
RUN apt-get update -y && \
6-
apt-get -y install poppler-utils tesseract-ocr yq &&\
6+
apt-get -y install poppler-utils tesseract-ocr yq build-essential libopenblas-dev libomp-dev &&\
77
git clone https://github.com/tesseract-ocr/tessdata.git /usr/share/tessdata &&\
88
useradd -l -u ${USER_ID} -g sudo jenkins && \
99
mkdir -m 0755 /home/jenkins && chown jenkins /home/jenkins

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,13 @@ Goals:
1616
The scripts are easily run via the Dockerfile included in this repo.
1717

1818
1. Copy the example.env file to .env and fill in the values.
19-
a. The value for `SQL_CONNECTION_STRING` should be the connection string for the database container. (i.e. `mysql://user:password@vmrt-emr-process-log-mysql:3306/vmrt_emr_transcription`)
20-
2. The easiest way to spin up the docker images is to run the `run.sh` script within the repository root directory. You can also build the containers using the docker compose file. `docker build -t <container name> .`
21-
3. Set up your DB by running `python /workspace/scripts/database_setup.py install` within the container.
22-
4. Get ready for the transcription process by running `python scripts/create_transcription_process.py /data`
23-
5. Use the `transcribe_pdfs.py` script to transcribe the files needed.
19+
a. The value for `SQL_CONNECTION_STRING` should be the connection string for the database container. (i.e. `mysql://user:password@vmrt-emr-process-log-mysql:3306/vmrt_emr_transcription`)
20+
2. The easiest way to spin up the docker images is to run the `run.sh` script within the repository root directory.
21+
3. Set up your DB by running `python /workspace/scripts/database_setup.py install` within the container.
22+
4. Get ready for the transcription process by running `python scripts/create_transcription_process.py /data`
23+
5. Use the `transcribe_pdfs.py` script to transcribe the files needed.
2424
- `python /workspace/scripts/transcribe_pdfs.py /workspace/output`
25-
6. Use the `pii_scrubber.py` script to remove PII from the text.
26-
- `python /workspace/scripts/scrubbers/pii_scrubber.py`
25+
6. Use the `pii_scrubber.py` script to remove PII from the text.
26+
- `python /workspace/scripts/scrubbers/pii_scrubber.py /workspace/output`
27+
7. Use the scripts in the `scripts/metadata_miners` directory to find data in the text.
28+
- `python /workspace/scripts/metadata_miners/visit_date_miner.py /workspace/output --visit_date_tsv=/path/to/vet_visits.tsv --dog_profile_tsv=/path/to/dog_profile.tsv`

example.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
SQL_CONNECTION_STRING=''
1+
SQL_CONNECTION_STRING='mysql+pymysql://{db_user}:{db_password}@{db_host}:3306/{db_name}'
22
SQL_PASSWORD=''

pytest.ini

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[pytest]
2+
testpaths = tests/pytests
3+
pythonpath = .
4+
addopts = --doctest-modules

requirements.ini

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
numpy==2.0.2
2+
pdf2image==1.17.0
3+
presidio_analyzer==2.2.357
4+
presidio_anonymizer==2.2.357
5+
python-dateutil==2.9.0.post0
6+
python-dotenv==1.0.1
7+
SQLAlchemy==2.0.37
8+
tesserocr==2.7.1

requirements.txt

Lines changed: 125 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,172 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.11
3+
# by the following command:
4+
#
5+
# pip-compile requirements.ini
6+
#
17
annotated-types==0.6.0
8+
# via pydantic
29
azure-core==1.32.0
3-
beautifulsoup4==4.12.3
10+
# via presidio-anonymizer
411
blis==1.1.0
12+
# via thinc
513
catalogue==2.0.10
14+
# via
15+
# spacy
16+
# srsly
17+
# thinc
618
certifi==2024.12.14
7-
cffi==1.16.0
19+
# via requests
820
charset-normalizer==3.4.0
21+
# via requests
922
click==8.1.8
23+
# via typer
1024
cloudpathlib==0.20.0
11-
colorclass==2.2.2
12-
compressed-rtf==1.0.6
25+
# via weasel
1326
confection==0.1.5
14-
cryptography==43.0.0
27+
# via
28+
# thinc
29+
# weasel
1530
cymem==2.0.10
16-
dnspython==2.6.1
17-
easygui==0.98.3
18-
ebcdic==1.1.1
19-
email_validator==2.1.1
20-
en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl#sha256=293e9547a655b25499198ab15a525b05b9407a75f10255e405e8c3854329ab63
21-
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85
22-
extract-msg==0.48.7
31+
# via
32+
# preshed
33+
# spacy
34+
# thinc
2335
filelock==3.16.1
24-
flake8==7.1.1
25-
fsspec==2024.12.0
36+
# via tldextract
2637
greenlet==3.1.1
27-
huggingface-hub==0.27.0
38+
# via sqlalchemy
2839
idna==3.7
29-
isort==5.13.2
30-
Jinja2==3.1.5
40+
# via
41+
# requests
42+
# tldextract
43+
jinja2==3.1.5
44+
# via spacy
3145
langcodes==3.5.0
32-
language_data==1.3.0
33-
lark==1.1.9
46+
# via spacy
47+
language-data==1.3.0
48+
# via langcodes
3449
marisa-trie==1.2.1
50+
# via language-data
3551
markdown-it-py==3.0.0
36-
MarkupSafe==3.0.2
37-
mccabe==0.7.0
52+
# via rich
53+
markupsafe==3.0.2
54+
# via jinja2
3855
mdurl==0.1.2
39-
mpmath==1.3.0
40-
msoffcrypto-tool==5.4.1
56+
# via markdown-it-py
4157
murmurhash==1.0.11
42-
mysqlclient==2.2.6
43-
networkx==3.4.2
58+
# via
59+
# preshed
60+
# spacy
61+
# thinc
4462
numpy==2.0.2
45-
olefile==0.47
46-
oletools==0.60.2
63+
# via
64+
# -r requirements.ini
65+
# blis
66+
# spacy
67+
# thinc
4768
packaging==24.0
48-
pandas==2.2.2
49-
pcodedmp==1.2.6
69+
# via
70+
# spacy
71+
# thinc
72+
# weasel
5073
pdf2image==1.17.0
74+
# via -r requirements.ini
5175
phonenumbers==8.13.52
76+
# via presidio-analyzer
5277
pillow==10.3.0
78+
# via pdf2image
5379
preshed==3.0.9
54-
presidio_analyzer==2.2.355
55-
presidio_anonymizer==2.2.355
56-
pycodestyle==2.12.1
57-
pycparser==2.22
80+
# via
81+
# spacy
82+
# thinc
83+
presidio-analyzer==2.2.357
84+
# via -r requirements.ini
85+
presidio-anonymizer==2.2.357
86+
# via -r requirements.ini
5887
pycryptodome==3.21.0
88+
# via presidio-anonymizer
5989
pydantic==2.7.1
60-
pydantic_core==2.18.2
61-
pyflakes==3.2.0
62-
Pygments==2.18.0
63-
PyMySQL==1.1.1
64-
pyparsing==3.1.2
65-
pytesseract==0.3.10
90+
# via
91+
# confection
92+
# spacy
93+
# thinc
94+
# weasel
95+
pydantic-core==2.18.2
96+
# via pydantic
97+
pygments==2.18.0
98+
# via rich
6699
python-dateutil==2.9.0.post0
100+
# via -r requirements.ini
67101
python-dotenv==1.0.1
68-
pytz==2024.1
69-
PyYAML==6.0.2
70-
red-black-tree-mod==1.20
102+
# via -r requirements.ini
103+
pyyaml==6.0.2
104+
# via presidio-analyzer
71105
regex==2024.11.6
106+
# via presidio-analyzer
72107
requests==2.32.3
108+
# via
109+
# azure-core
110+
# requests-file
111+
# spacy
112+
# tldextract
113+
# weasel
73114
requests-file==2.1.0
115+
# via tldextract
74116
rich==13.9.4
75-
RTFDE==0.1.2
76-
safetensors==0.4.5
117+
# via typer
77118
shellingham==1.5.4
119+
# via typer
78120
six==1.16.0
121+
# via
122+
# azure-core
123+
# python-dateutil
79124
smart-open==7.1.0
80-
soupsieve==2.5
125+
# via weasel
81126
spacy==3.8.2
82-
spacy-huggingface-pipelines==0.0.4
127+
# via presidio-analyzer
83128
spacy-legacy==3.0.12
129+
# via spacy
84130
spacy-loggers==1.0.5
85-
SQLAlchemy==2.0.36
131+
# via spacy
132+
sqlalchemy==2.0.37
133+
# via -r requirements.ini
86134
srsly==2.5.0
87-
sympy==1.13.1
88-
tesseract==0.1.3
89-
tesserocr==2.7.0
135+
# via
136+
# confection
137+
# spacy
138+
# thinc
139+
# weasel
140+
tesserocr==2.7.1
141+
# via -r requirements.ini
90142
thinc==8.3.3
143+
# via spacy
91144
tldextract==5.1.3
92-
tokenizers==0.21.0
93-
torch==2.5.1
145+
# via presidio-analyzer
94146
tqdm==4.67.1
95-
transformers==4.47.1
147+
# via spacy
96148
typer==0.15.1
97-
typing_extensions==4.11.0
98-
tzdata==2024.1
99-
tzlocal==5.2
149+
# via
150+
# spacy
151+
# weasel
152+
typing-extensions==4.11.0
153+
# via
154+
# azure-core
155+
# pydantic
156+
# pydantic-core
157+
# sqlalchemy
158+
# typer
100159
urllib3==2.3.0
160+
# via requests
101161
wasabi==1.1.3
162+
# via
163+
# spacy
164+
# thinc
165+
# weasel
102166
weasel==0.4.1
167+
# via spacy
103168
wrapt==1.17.0
169+
# via smart-open
170+
171+
# The following packages are considered to be unsafe in a requirements file:
172+
# setuptools
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pytest==8.3.4
2+
pytest-mock==3.14.0
3+
python-dateutil==2.9.0.post0
4+
python-dotenv==1.0.1
5+
SQLAlchemy==2.0.37

scripts/database_setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ def parse_args() -> argparse.Namespace:
6565
"""
6666
parser = argparse.ArgumentParser(
6767
prog='Performs database utility related functions.',)
68-
parser.add_argument('operation', help='The database operation to perform, install or drop.')
68+
parser.add_argument('operation', help='The database operation to perform, install or drop.', choices=['install', 'drop'])
6969
parser.add_argument('--debug-sql', action='store_true', help='Enable SQL debugging')
7070
return parser.parse_args()
7171

0 commit comments

Comments
 (0)