@@ -11,13 +11,13 @@ to make this as a general program for text multi-label classification task.
11
11
12
12
## Usage
13
13
### Python Env
14
- ``` sh
14
+ ``` shell
15
15
micromamba env create -f environment.yaml -p ./_pyenv --yes
16
16
micromamba activate ./_pyenv
17
17
pip install -r requirements.txt
18
18
```
19
19
### Run Tests
20
- ``` sh
20
+ ``` shell
21
21
python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v
22
22
```
23
23
@@ -55,7 +55,11 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
55
55
```
56
56
{
57
57
"label2id": {
58
-
58
+ "label_0": 0,
59
+ "label_1": 1,
60
+ "label_2": 2,
61
+ ...
62
+ "label_n": n
59
63
},
60
64
"id2label": {
61
65
0: "label_0",
@@ -66,61 +70,20 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
66
70
}
67
71
}
68
72
```
69
- As the label ID will be also used as index in one-hot vector, so must start from 0.
70
-
71
-
72
- ### (MIMIC3 Dataset Preparation)
73
- The ETL contain following steps:
74
- * Origin JSON line dataset preparation
75
- * Transform JSON line file to ** limited** JOSN line file, which means all ` list ` or ` dict `
76
- will be transformed to ` string ` .
77
- * Data dictionary generation.
78
-
79
- Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
80
-
81
- #### Prepare (Specific) Original JSON Line Dataset
82
- The data should be in JSON line format, here provide an MIMIC-III data ETL program:
83
- ``` sh
84
- python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
85
- ```
86
- When you need use this program do text multi-label classification on your custimized
87
- data set, you can just transfer it into a JSON line file, and using ** training config**
88
- file to specify which field is text and which is label.
89
-
90
- ** NOTE** , since here you are dealing a multi-label classification task, the format of
91
- label field should be as a CSV string, for example:
92
- ```
93
- {"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
94
- ```
95
-
96
- But you can also use your specific dataset.
97
-
98
- #### Transform To Limited JSON Line Dataset
99
- Although using JSON line file, here do not allow ` list ` and ` dict ` contained in JOSN.
100
- I believe "flat" JSON can make things clear, so here provide a tool which can help
101
- to convert ` list ` and ` dict ` contained in JSON to ` string ` :
102
- ``` shell
103
- python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
104
- ```
105
-
106
- ** NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets
107
- as train.jsonl, dev.jsonl and test.jsonl.**
108
-
109
- #### Data Dictionary Generation
110
- Generate (some) data dictionaries by scanning train, dev and test data. Run:
111
- ``` shell
112
- python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
113
- ```
73
+ ** As the label ID will be also used as index in one-hot vector, so must start from 0.**
74
+
75
+ As the original paper use MIMIC-III as dataset, here also provide a
76
+ [ pre-built ETL] ( https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-dataset-etl.md )
77
+ to generate training data from MIMIC-III data.
114
78
115
79
116
80
### Training and Evaluation
117
- ``` sh
81
+ ``` shell
118
82
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}
119
83
```
120
84
121
- #### Training Config File
122
- The format should be JSON, most of parameters are easy to understand is your are a
123
- MLE or researcher:
85
+ The format of config file is JSON, most of parameters are easy to understand
86
+ if your are a MLE/data scientist/researcher:
124
87
* ` chunk_size ` : Each chunks token ID number.
125
88
* ` chunk_num ` : The number of chunk each text/document should have, padding first for short sentences.
126
89
* ` hf_lm ` : HuggingFace language model name/path, each ` hf_lm ` may have different ` lm_hidden_dim ` ,
@@ -145,55 +108,27 @@ MLE or researcher:
145
108
* ` ckpt_dir ` : Checkpoint directory name.
146
109
* ` log_period ` : How many ** batchs** passed before each time's evaluation log printing.
147
110
* ` dump_period ` : How many ** steps** passed before each time's checkpoint dumping.
111
+ * ` label_splitter ` : The seperator with which we split concated label string to list of label names.
112
+ * ` eval.label_confidence_threshold ` : Each label's confidence threshold, if higher then will be set as positive during the evaluation.
148
113
149
- ## Examples
150
- ### Using MIMIC-III Data Training ICD10 Classification Model
151
- #### Preparation - Get Raw MIMIC-III Data
152
- Suppose you put original MIMIC-III data under ` ./_data/raw/mimic3/ ` like:
153
- ```
154
- ./_data/raw/mimic3/
155
- ├── DIAGNOSES_ICD.csv
156
- ├── NOTEEVENTS.csv
157
- └── PROCEDURES_ICD.csv
158
-
159
- 0 directories, 3 files
160
- ```
161
- #### ETL - Training Dataset Building
162
- This is about join necessary tables' data together and build training dataset. Suppose we are
163
- going to put training data under ` ./_data/etl/mimic3/ ` , as this programed rules, the directory
164
- should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
165
- ```
166
- ./_data/etl/mimic3/
167
- ├── dev.jsonl
168
- ├── dict.json
169
- ├── dim_processed_base_data.jsonl
170
- ├── test.jsonl
171
- └── train.jsonl
172
-
173
- 0 directories, 5 files
174
- ```
175
- You can run:
114
+ ### Inference
176
115
``` shell
177
- python ./bin/etl/etl_mimic3_processing. py ./_data/raw/mimic3/ ./_data/etl/mimic3/
116
+ python inf. py inf.json
178
117
```
118
+ Most parameters explanations are already in ` inf.json ` .
179
119
180
- #### Config - Prepare Your Training Config File
181
- The ` data_dir ` in this config will be needed by next ETL step, can just refer to ` train_mimic3_icd.json ` .
182
-
183
- #### ETL - Convert Training Dataset JSONL to Limited JSONL File
184
- Note this step is unnecessary, since the outputs of ` ./bin/etl/etl_mimic3_processing.py ` have
185
- already been limited JSON line files, so even though you run following program, you will get
186
- exactly same files:
120
+ ### Evaluation
187
121
``` shell
188
- python ./bin/etl/etl_jsonl2limited_jsonl. py ./_data/raw/mimic3/ ${INPUT_JSONL_FILE} ./_data/raw/mimic3/ ${OUTPUT_JSONL_FILE}
122
+ python eval. py eval.json
189
123
```
124
+ Most parameters explanations are already in ` eval.json ` .
190
125
191
- #### Training - Training ICD10 Classification Model with MIMIC-II Dataset
192
- ``` shell
193
- CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
194
- ```
195
126
196
127
128
+ ## Examples
129
+ ### Training Examples
130
+ * [ ICD10 prediction based on MIMIC-III data] ( https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-train-example.md )
131
+
197
132
## Other Implementation Details
198
133
* After ` chunk_size ` and ` chunk_num ` defined, each text's token ID length are fixed to ` chunk_size * chunk_num ` .
199
134
if not long enough then automatically padding first.
0 commit comments