Skip to content

Commit 166a261

Browse files
authored
Merge pull request #10 from innerNULL/dev
Inference & Evaluation
2 parents 9b667b8 + 3246a15 commit 166a261

File tree

8 files changed

+333
-93
lines changed

8 files changed

+333
-93
lines changed

README.md

Lines changed: 27 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@ to make this as a general program for text multi-label classification task.
1111

1212
## Usage
1313
### Python Env
14-
```sh
14+
```shell
1515
micromamba env create -f environment.yaml -p ./_pyenv --yes
1616
micromamba activate ./_pyenv
1717
pip install -r requirements.txt
1818
```
1919
### Run Tests
20-
```sh
20+
```shell
2121
python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v
2222
```
2323

@@ -55,7 +55,11 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
5555
```
5656
{
5757
"label2id": {
58-
58+
"label_0": 0,
59+
"label_1": 1,
60+
"label_2": 2,
61+
...
62+
"label_n": n
5963
},
6064
"id2label": {
6165
0: "label_0",
@@ -66,61 +70,20 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
6670
}
6771
}
6872
```
69-
As the label ID will be also used as index in one-hot vector, so must start from 0.
70-
71-
72-
### (MIMIC3 Dataset Preparation)
73-
The ETL contain following steps:
74-
* Origin JSON line dataset preparation
75-
* Transform JSON line file to **limited** JOSN line file, which means all `list` or `dict`
76-
will be transformed to `string`.
77-
* Data dictionary generation.
78-
79-
Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
80-
81-
#### Prepare (Specific) Original JSON Line Dataset
82-
The data should be in JSON line format, here provide an MIMIC-III data ETL program:
83-
```sh
84-
python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
85-
```
86-
When you need use this program do text multi-label classification on your custimized
87-
data set, you can just transfer it into a JSON line file, and using **training config**
88-
file to specify which field is text and which is label.
89-
90-
**NOTE**, since here you are dealing a multi-label classification task, the format of
91-
label field should be as a CSV string, for example:
92-
```
93-
{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
94-
```
95-
96-
But you can also use your specific dataset.
97-
98-
#### Transform To Limited JSON Line Dataset
99-
Although using JSON line file, here do not allow `list` and `dict` contained in JOSN.
100-
I believe "flat" JSON can make things clear, so here provide a tool which can help
101-
to convert `list` and `dict` contained in JSON to `string`:
102-
```shell
103-
python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
104-
```
105-
106-
**NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets
107-
as train.jsonl, dev.jsonl and test.jsonl.**
108-
109-
#### Data Dictionary Generation
110-
Generate (some) data dictionaries by scanning train, dev and test data. Run:
111-
```shell
112-
python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
113-
```
73+
**As the label ID will be also used as index in one-hot vector, so must start from 0.**
74+
75+
As the original paper use MIMIC-III as dataset, here also provide a
76+
[pre-built ETL](https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-dataset-etl.md)
77+
to generate training data from MIMIC-III data.
11478

11579

11680
### Training and Evaluation
117-
```sh
81+
```shell
11882
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}
11983
```
12084

121-
#### Training Config File
122-
The format should be JSON, most of parameters are easy to understand is your are a
123-
MLE or researcher:
85+
The format of config file is JSON, most of parameters are easy to understand
86+
if your are a MLE/data scientist/researcher:
12487
* `chunk_size`: Each chunks token ID number.
12588
* `chunk_num`: The number of chunk each text/document should have, padding first for short sentences.
12689
* `hf_lm`: HuggingFace language model name/path, each `hf_lm` may have different `lm_hidden_dim`,
@@ -145,55 +108,27 @@ MLE or researcher:
145108
* `ckpt_dir`: Checkpoint directory name.
146109
* `log_period`: How many **batchs** passed before each time's evaluation log printing.
147110
* `dump_period`: How many **steps** passed before each time's checkpoint dumping.
111+
* `label_splitter`: The seperator with which we split concated label string to list of label names.
112+
* `eval.label_confidence_threshold`: Each label's confidence threshold, if higher then will be set as positive during the evaluation.
148113

149-
## Examples
150-
### Using MIMIC-III Data Training ICD10 Classification Model
151-
#### Preparation - Get Raw MIMIC-III Data
152-
Suppose you put original MIMIC-III data under `./_data/raw/mimic3/` like:
153-
```
154-
./_data/raw/mimic3/
155-
├── DIAGNOSES_ICD.csv
156-
├── NOTEEVENTS.csv
157-
└── PROCEDURES_ICD.csv
158-
159-
0 directories, 3 files
160-
```
161-
#### ETL - Training Dataset Building
162-
This is about join necessary tables' data together and build training dataset. Suppose we are
163-
going to put training data under `./_data/etl/mimic3/`, as this programed rules, the directory
164-
should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
165-
```
166-
./_data/etl/mimic3/
167-
├── dev.jsonl
168-
├── dict.json
169-
├── dim_processed_base_data.jsonl
170-
├── test.jsonl
171-
└── train.jsonl
172-
173-
0 directories, 5 files
174-
```
175-
You can run:
114+
### Inference
176115
```shell
177-
python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/
116+
python inf.py inf.json
178117
```
118+
Most parameters explanations are already in `inf.json`.
179119

180-
#### Config - Prepare Your Training Config File
181-
The `data_dir` in this config will be needed by next ETL step, can just refer to `train_mimic3_icd.json`.
182-
183-
#### ETL - Convert Training Dataset JSONL to Limited JSONL File
184-
Note this step is unnecessary, since the outputs of `./bin/etl/etl_mimic3_processing.py` have
185-
already been limited JSON line files, so even though you run following program, you will get
186-
exactly same files:
120+
### Evaluation
187121
```shell
188-
python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}
122+
python eval.py eval.json
189123
```
124+
Most parameters explanations are already in `eval.json`.
190125

191-
#### Training - Training ICD10 Classification Model with MIMIC-II Dataset
192-
```shell
193-
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
194-
```
195126

196127

128+
## Examples
129+
### Training Examples
130+
* [ICD10 prediction based on MIMIC-III data](https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-train-example.md)
131+
197132
## Other Implementation Details
198133
* After `chunk_size` and `chunk_num` defined, each text's token ID length are fixed to `chunk_size * chunk_num`.
199134
if not long enough then automatically padding first.

doc/mimic-iii-dataset-etl.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
## MIMIC3 Dataset ETL
2+
The ETL contain following steps:
3+
* Origin JSON line dataset preparation
4+
* Transform JSON line file to **limited** JOSN line file, which means all `list` or `dict`
5+
will be transformed to `string`.
6+
* Data dictionary generation.
7+
8+
Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
9+
10+
### Prepare (Specific) Original JSON Line Dataset
11+
The data should be in JSON line format, here provide an MIMIC-III data ETL program:
12+
```sh
13+
python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
14+
```
15+
When you need use this program do text multi-label classification on your custimized
16+
data set, you can just transfer it into a JSON line file, and using **training config**
17+
file to specify which field is text and which is label.
18+
19+
**NOTE**, since here you are dealing a multi-label classification task, the format of
20+
label field should be as a CSV string, for example:
21+
```
22+
{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
23+
```
24+
25+
But you can also use your specific dataset.
26+
27+
### Transform To Limited JSON Line Dataset
28+
Although using JSON line file, here do not allow `list` and `dict` contained in JOSN.
29+
I believe "flat" JSON can make things clear, so here provide a tool which can help
30+
to convert `list` and `dict` contained in JSON to `string`:
31+
```shell
32+
python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
33+
```
34+
35+
**NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets
36+
as train.jsonl, dev.jsonl and test.jsonl.**
37+
38+
### Data Dictionary Generation
39+
Generate (some) data dictionaries by scanning train, dev and test data. Run:
40+
```shell
41+
python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
42+
```
43+
44+

doc/mimic-iii-train-example.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
## Using MIMIC-III Data Training ICD10 Classification Model
2+
### Preparation - Get Raw MIMIC-III Data
3+
Suppose you put original MIMIC-III data under `./_data/raw/mimic3/` like:
4+
```
5+
./_data/raw/mimic3/
6+
├── DIAGNOSES_ICD.csv
7+
├── NOTEEVENTS.csv
8+
└── PROCEDURES_ICD.csv
9+
10+
0 directories, 3 files
11+
```
12+
### ETL - Training Dataset Building
13+
This is about join necessary tables' data together and build training dataset. Suppose we are
14+
going to put training data under `./_data/etl/mimic3/`, as this programed rules, the directory
15+
should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
16+
```
17+
./_data/etl/mimic3/
18+
├── dev.jsonl
19+
├── dict.json
20+
├── dim_processed_base_data.jsonl
21+
├── test.jsonl
22+
└── train.jsonl
23+
24+
0 directories, 5 files
25+
```
26+
You can run:
27+
```shell
28+
python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/
29+
```
30+
31+
### Config - Prepare Your Training Config File
32+
The `data_dir` in this config will be needed by next ETL step, can just refer to `train_mimic3_icd.json`.
33+
34+
### ETL - Convert Training Dataset JSONL to Limited JSONL File
35+
Note this step is unnecessary, since the outputs of `./bin/etl/etl_mimic3_processing.py` have
36+
already been limited JSON line files, so even though you run following program, you will get
37+
exactly same files:
38+
```shell
39+
python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}
40+
```
41+
42+
### Training - Training ICD10 Classification Model with MIMIC-II Dataset
43+
```shell
44+
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
45+
```
46+
47+

eval.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"inf_results_path": "/path/to/inference/results/jsonl/file",
3+
"label_dict_path": "/path/to/dict.json/file",
4+
"gt_label_col": "label",
5+
"pred_results_col": "results",
6+
"min_confidence": 0.5,
7+
"label_splitter": "[SEP]"
8+
}

eval.py

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# -*- coding: utf-8 -*-
2+
# file: eval.py
3+
# date: 2025-08-05
4+
5+
6+
import pdb
7+
import os
8+
import sys
9+
import json
10+
import numpy as np
11+
from typing import Dict, List, Tuple, Optional
12+
from torch import IntTensor
13+
14+
from src.plm_icd_multi_label_classifier.metrics import metrics_func
15+
16+
17+
def verification(
18+
train_eval_results_path: Optional[str],
19+
eval_gt_one_hots: List[List[int]],
20+
eval_pred_one_hots: List[List[int]]
21+
) -> None:
22+
if train_eval_results_path is None:
23+
return
24+
train_eval_results: Dict = json.load(open(train_eval_results_path, "r"))
25+
train_gt_one_hots: List[List[int]] = train_eval_results["verbose"]["gt_one_hot"]
26+
train_pred_one_hots: List[List[int]] = train_eval_results["verbose"]["pred_one_hot"]
27+
assert(len(train_gt_one_hots) == len(train_pred_one_hots))
28+
assert(len(train_pred_one_hots) == len(eval_gt_one_hots))
29+
assert(len(eval_gt_one_hots) == len(eval_pred_one_hots))
30+
for i in range(len(train_gt_one_hots)):
31+
train_gt_one_hot: List[int] = train_gt_one_hots[i]
32+
eval_gt_one_hot: List[int] = eval_gt_one_hots[i]
33+
assert(len(train_gt_one_hot) == len(eval_gt_one_hot))
34+
for j in range(len(eval_gt_one_hot)):
35+
assert(train_gt_one_hot[j] == eval_gt_one_hot[j])
36+
return
37+
38+
39+
def main() -> None:
40+
configs: Dict = json.load(open(sys.argv[1], "r"))
41+
print(json.dumps(configs, indent=2))
42+
label_dict_path: str = configs["label_dict_path"]
43+
gt_label_col: str = configs["gt_label_col"]
44+
pred_results_col: str = configs["pred_results_col"]
45+
min_confidence: float = configs["min_confidence"]
46+
label_splitter: str = configs["label_splitter"]
47+
train_eval_results_path: Optional[str] = configs["train_eval_results_path"]
48+
49+
label_dict: Dict = json.load(open(label_dict_path, "r"))
50+
inf_results: List[Dict] = [
51+
json.loads(x)
52+
for x in open(configs["inf_results_path"], "r").read().split("\n")
53+
if x not in {""}
54+
]
55+
56+
label_dim: int = len(label_dict["id2label"])
57+
pred_one_hots: List[List[int]] = []
58+
gt_one_hots: List[List[int]] = []
59+
for sample in inf_results:
60+
pred_one_hot: np.ndarray = np.zeros(label_dim)
61+
gt_one_hot: np.ndarray = np.zeros(label_dim)
62+
63+
gt_labels: List[str] | str = sample[gt_label_col]
64+
if isinstance(gt_labels, str):
65+
gt_labels = [
66+
x.strip(" ") for x in gt_labels.split(label_splitter)
67+
]
68+
gt_labels = [x for x in gt_labels if x not in {""}]
69+
if len(gt_labels) == 0:
70+
continue
71+
pred_results: List[Tuple[str, float]] = sorted(
72+
[(k, v) for k, v in sample[pred_results_col].items()],
73+
reverse=True,
74+
key=lambda x: x[1]
75+
)
76+
for label in gt_labels:
77+
label_id: int = label_dict["label2id"][label]
78+
gt_one_hot[label_id] = 1.0
79+
for label, score in pred_results:
80+
if score < min_confidence:
81+
continue
82+
label_id: int = label_dict["label2id"][label]
83+
pred_one_hot[label_id] = 1.0
84+
if pred_one_hot.sum() == 0 and len(pred_results) > 0:
85+
top1_label: str = pred_results[0][0]
86+
top1_label_id: int = label_dict["label2id"][top1_label]
87+
pred_one_hot[top1_label_id] = 1.0
88+
89+
gt_one_hots.append(gt_one_hot.tolist())
90+
pred_one_hots.append(pred_one_hot.tolist())
91+
92+
verification(train_eval_results_path, gt_one_hots, pred_one_hots)
93+
results = metrics_func(IntTensor(pred_one_hots), IntTensor(gt_one_hots))
94+
print(json.dumps(results, indent=2))
95+
return
96+
97+
98+
if __name__ == "__main__":
99+
main()

inf.json

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
{
2+
"ckpt_path": "/path/to/model.pt",
3+
"label_dict_path": "/path/to/dict.json/file",
4+
"test_data_path": "/test/jsonl/file/path",
5+
"out_path": "./_inf_results.jsonl",
6+
"text_col": "text",
7+
"result_col": "results",
8+
"model": {
9+
"hf_lm": "google-bert/bert-base-uncased",
10+
"lm_hidden_dim": 768,
11+
"chunk_size": 512,
12+
"chunk_num": 2
13+
},
14+
"inf": {
15+
"min_confidence": 0.3,
16+
"top_k": 10,
17+
"label_splitter": "[SEP]"
18+
}
19+
}

0 commit comments

Comments
 (0)