Merge pull request #10 from innerNULL/dev

innerNULL · web-flow · commit 166a2617f875 · 2025-08-08T08:59:16.000+08:00
Inference &amp; Evaluation
diff --git a/README.md b/README.md
@@ -11,13 +11,13 @@ to make this as a general program for text multi-label classification task.
 
 ## Usage
 ### Python Env
-```sh
+```shell
 micromamba env create -f environment.yaml -p ./_pyenv --yes
 micromamba activate ./_pyenv
 pip install -r requirements.txt
 ```
 ### Run Tests
-```sh
+```shell
 python -m pytest ./test --cov=./src/plm_icd_multi_label_classifier --durations=0 -v
 ```
 
@@ -55,7 +55,11 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
 ```
 {
   "label2id": {
-
+    "label_0": 0,
+    "label_1": 1, 
+    "label_2": 2, 
+    ...
+    "label_n": n
   },
   "id2label": {
     0: "label_0",
@@ -66,61 +70,20 @@ And the `dict.json` is for bi-directionary mapping between label names and IDs,
   }
 }
 ```
-As the label ID will be also used as index in one-hot vector, so must start from 0.
-
-
-### (MIMIC3 Dataset Preparation)
-The ETL contain following steps:
-* Origin JSON line dataset preparation
-* Transform JSON line file to **limited** JOSN line file, which means all `list` or `dict` 
-  will be transformed to `string`.
-* Data dictionary generation.
-
-Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
-
-#### Prepare (Specific) Original JSON Line Dataset
-The data should be in JSON line format, here provide an MIMIC-III data ETL program:
-```sh
-python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
-```
-When you need use this program do text multi-label classification on your custimized 
-data set, you can just transfer it into a JSON line file, and using **training config** 
-file to specify which field is text and which is label. 
-
-**NOTE**, since here you are dealing a multi-label classification task, the format of 
-label field should be as a CSV string, for example:
-```
-{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
-```
-
-But you can also use your specific dataset.
-
-#### Transform To Limited JSON Line Dataset
-Although using JSON line file, here do not allow `list` and `dict` contained in JOSN. 
-I believe "flat" JSON can make things clear, so here provide a tool which can help 
-to convert `list` and `dict` contained in JSON to `string`:
-```shell
-python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
-```
-
-**NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets 
-as train.jsonl, dev.jsonl and test.jsonl.**
-
-#### Data Dictionary Generation
-Generate (some) data dictionaries by scanning train, dev and test data. Run:
-```shell
-python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
-```
+**As the label ID will be also used as index in one-hot vector, so must start from 0.**
+  
+As the original paper use MIMIC-III as dataset, here also provide a 
+[pre-built ETL](https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-dataset-etl.md)  
+to generate training data from MIMIC-III data.
 
 
 ### Training and Evaluation
-```sh
+```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ${TRAIN_CONFIG_JSON_FILE_PATH}
 ```
 
-#### Training Config File
-The format should be JSON, most of parameters are easy to understand is your are a 
-MLE or researcher:
+The format of config file is JSON, most of parameters are easy to understand 
+if your are a MLE/data scientist/researcher:
 * `chunk_size`: Each chunks token ID number.
 * `chunk_num`: The number of chunk each text/document should have, padding first for short sentences.
 * `hf_lm`: HuggingFace language model name/path, each `hf_lm` may have different `lm_hidden_dim`, 
@@ -145,55 +108,27 @@ MLE or researcher:
 * `ckpt_dir`: Checkpoint directory name.
 * `log_period`: How many **batchs** passed before each time's evaluation log printing.
 * `dump_period`: How many **steps** passed before each time's checkpoint dumping.
+* `label_splitter`: The seperator with which we split concated label string to list of label names.
+* `eval.label_confidence_threshold`: Each label's confidence threshold, if higher then will be set as positive during the evaluation.
 
-## Examples 
-### Using MIMIC-III Data Training ICD10 Classification Model
-#### Preparation - Get Raw MIMIC-III Data
-Suppose you put original MIMIC-III data under `./_data/raw/mimic3/` like:
-```
-./_data/raw/mimic3/
-├── DIAGNOSES_ICD.csv
-├── NOTEEVENTS.csv
-└── PROCEDURES_ICD.csv
-
-0 directories, 3 files
-```
-#### ETL - Training Dataset Building
-This is about join necessary tables' data together and build training dataset. Suppose we are 
-going to put training data under `./_data/etl/mimic3/`, as this programed rules, the directory 
-should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
-```
-./_data/etl/mimic3/
-├── dev.jsonl
-├── dict.json
-├── dim_processed_base_data.jsonl
-├── test.jsonl
-└── train.jsonl
-
-0 directories, 5 files
-```
-You can run:
+### Inference
 ```shell
-python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/ 
+python inf.py inf.json
 ```
+Most parameters explanations are already in `inf.json`.
 
-#### Config - Prepare Your Training Config File
-The `data_dir` in this config will be needed by next ETL step, can just refer to `train_mimic3_icd.json`.
-
-#### ETL - Convert Training Dataset JSONL to Limited JSONL File
-Note this step is unnecessary, since the outputs of `./bin/etl/etl_mimic3_processing.py` have 
-already been limited JSON line files, so even though you run following program, you will get 
-exactly same files:
+### Evaluation
 ```shell
-python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}
+python eval.py eval.json
 ```
+Most parameters explanations are already in `eval.json`. 
 
-#### Training - Training ICD10 Classification Model with MIMIC-II Dataset
-```shell
-CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
-```
 
 
+## Examples 
+### Training Examples
+* [ICD10 prediction based on MIMIC-III data](https://github.com/innerNULL/PLM-ICD-multi-label-classifier/blob/main/doc/mimic-iii-train-example.md)
+
 ## Other Implementation Details
 * After `chunk_size` and `chunk_num` defined, each text's token ID length are fixed to `chunk_size * chunk_num`. 
 if not long enough then automatically padding first.
diff --git a/doc/mimic-iii-dataset-etl.md b/doc/mimic-iii-dataset-etl.md
@@ -0,0 +1,44 @@
+## MIMIC3 Dataset ETL
+The ETL contain following steps:
+* Origin JSON line dataset preparation
+* Transform JSON line file to **limited** JOSN line file, which means all `list` or `dict` 
+  will be transformed to `string`.
+* Data dictionary generation.
+
+Note, the final data folder should contains 4 files: train.jsonl, dev.jsonl, test.jsonl, dict.json.
+
+### Prepare (Specific) Original JSON Line Dataset
+The data should be in JSON line format, here provide an MIMIC-III data ETL program:
+```sh
+python ./bin/etl/etl_mimic3_processing.py ${YOUR_MIMIC3_DATA_DIRECTORY} ${YOUR_TARGET_OUTPUT_DIRECTORY}
+```
+When you need use this program do text multi-label classification on your custimized 
+data set, you can just transfer it into a JSON line file, and using **training config** 
+file to specify which field is text and which is label. 
+
+**NOTE**, since here you are dealing a multi-label classification task, the format of 
+label field should be as a CSV string, for example:
+```
+{"text": "this is a fake text.", "label": "label1,label2,label3,label4"}
+```
+
+But you can also use your specific dataset.
+
+### Transform To Limited JSON Line Dataset
+Although using JSON line file, here do not allow `list` and `dict` contained in JOSN. 
+I believe "flat" JSON can make things clear, so here provide a tool which can help 
+to convert `list` and `dict` contained in JSON to `string`:
+```shell
+python ./bin/etl/etl_jsonl2limited_jsonl.py ${ORIGINAL_JSON_LINE_DATASET} ${TRANSFORMED_JSON_LINE_DATASET}
+```
+
+**NOTE, alghouth you can put dataset in anly directory you like, but you HAVE TO naming you datasets 
+as train.jsonl, dev.jsonl and test.jsonl.**
+
+### Data Dictionary Generation
+Generate (some) data dictionaries by scanning train, dev and test data. Run:
+```shell
+python ./bin/etl/etl_generate_data_dict.py ${TRAIN_CONFIG_JSON_FILE_PATH}
+```
+
+
diff --git a/doc/mimic-iii-train-example.md b/doc/mimic-iii-train-example.md
@@ -0,0 +1,47 @@
+## Using MIMIC-III Data Training ICD10 Classification Model
+### Preparation - Get Raw MIMIC-III Data
+Suppose you put original MIMIC-III data under `./_data/raw/mimic3/` like:
+```
+./_data/raw/mimic3/
+├── DIAGNOSES_ICD.csv
+├── NOTEEVENTS.csv
+└── PROCEDURES_ICD.csv
+
+0 directories, 3 files
+```
+### ETL - Training Dataset Building
+This is about join necessary tables' data together and build training dataset. Suppose we are 
+going to put training data under `./_data/etl/mimic3/`, as this programed rules, the directory 
+should contain 3 files, train.jsonl, dev.jsonl and test.jsonl, like:
+```
+./_data/etl/mimic3/
+├── dev.jsonl
+├── dict.json
+├── dim_processed_base_data.jsonl
+├── test.jsonl
+└── train.jsonl
+
+0 directories, 5 files
+```
+You can run:
+```shell
+python ./bin/etl/etl_mimic3_processing.py ./_data/raw/mimic3/ ./_data/etl/mimic3/ 
+```
+
+### Config - Prepare Your Training Config File
+The `data_dir` in this config will be needed by next ETL step, can just refer to `train_mimic3_icd.json`.
+
+### ETL - Convert Training Dataset JSONL to Limited JSONL File
+Note this step is unnecessary, since the outputs of `./bin/etl/etl_mimic3_processing.py` have 
+already been limited JSON line files, so even though you run following program, you will get 
+exactly same files:
+```shell
+python ./bin/etl/etl_jsonl2limited_jsonl.py ./_data/raw/mimic3/${INPUT_JSONL_FILE} ./_data/raw/mimic3/${OUTPUT_JSONL_FILE}
+```
+
+### Training - Training ICD10 Classification Model with MIMIC-II Dataset
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 python ./train.py ./train_mimic3_icd.json
+```
+
+
diff --git a/eval.json b/eval.json
@@ -0,0 +1,8 @@
+{
+  "inf_results_path": "/path/to/inference/results/jsonl/file",
+  "label_dict_path": "/path/to/dict.json/file",
+  "gt_label_col": "label",
+  "pred_results_col": "results",
+  "min_confidence": 0.5,
+  "label_splitter": "[SEP]"
+}
diff --git a/eval.py b/eval.py
@@ -0,0 +1,99 @@
+# -*- coding: utf-8 -*-
+# file: eval.py
+# date: 2025-08-05
+
+
+import pdb
+import os
+import sys
+import json
+import numpy as np
+from typing import Dict, List, Tuple, Optional
+from torch import IntTensor
+
+from src.plm_icd_multi_label_classifier.metrics import metrics_func
+
+
+def verification(
+    train_eval_results_path: Optional[str],
+    eval_gt_one_hots: List[List[int]],
+    eval_pred_one_hots: List[List[int]]
+) -> None:
+    if train_eval_results_path is None:
+        return
+    train_eval_results: Dict = json.load(open(train_eval_results_path, "r")) 
+    train_gt_one_hots: List[List[int]] = train_eval_results["verbose"]["gt_one_hot"]
+    train_pred_one_hots: List[List[int]] = train_eval_results["verbose"]["pred_one_hot"] 
+    assert(len(train_gt_one_hots) == len(train_pred_one_hots))
+    assert(len(train_pred_one_hots) == len(eval_gt_one_hots))
+    assert(len(eval_gt_one_hots) == len(eval_pred_one_hots))
+    for i in range(len(train_gt_one_hots)):
+        train_gt_one_hot: List[int] = train_gt_one_hots[i]
+        eval_gt_one_hot: List[int] = eval_gt_one_hots[i]
+        assert(len(train_gt_one_hot) == len(eval_gt_one_hot))
+        for j in range(len(eval_gt_one_hot)):
+            assert(train_gt_one_hot[j] == eval_gt_one_hot[j])
+    return
+
+
+def main() -> None:
+    configs: Dict = json.load(open(sys.argv[1], "r"))
+    print(json.dumps(configs, indent=2))
+    label_dict_path: str = configs["label_dict_path"]
+    gt_label_col: str = configs["gt_label_col"]
+    pred_results_col: str = configs["pred_results_col"]
+    min_confidence: float = configs["min_confidence"]
+    label_splitter: str = configs["label_splitter"]
+    train_eval_results_path: Optional[str] = configs["train_eval_results_path"]
+
+    label_dict: Dict = json.load(open(label_dict_path, "r"))
+    inf_results: List[Dict] = [
+        json.loads(x) 
+        for x in open(configs["inf_results_path"], "r").read().split("\n")
+        if x not in {""}
+    ]
+
+    label_dim: int = len(label_dict["id2label"])
+    pred_one_hots: List[List[int]] = [] 
+    gt_one_hots: List[List[int]] = []     
+    for sample in inf_results:
+        pred_one_hot: np.ndarray = np.zeros(label_dim)
+        gt_one_hot: np.ndarray = np.zeros(label_dim)
+        
+        gt_labels: List[str] | str = sample[gt_label_col]
+        if isinstance(gt_labels, str):
+            gt_labels = [
+                x.strip(" ") for x in gt_labels.split(label_splitter)
+            ]
+        gt_labels = [x for x in gt_labels if x not in {""}]
+        if len(gt_labels) == 0:
+            continue
+        pred_results: List[Tuple[str, float]] = sorted(
+            [(k, v) for k, v in sample[pred_results_col].items()],
+            reverse=True, 
+            key=lambda x: x[1]
+        )
+        for label in gt_labels:
+            label_id: int = label_dict["label2id"][label]
+            gt_one_hot[label_id] = 1.0
+        for label, score in pred_results:
+            if score < min_confidence:
+                continue
+            label_id: int = label_dict["label2id"][label]
+            pred_one_hot[label_id] = 1.0
+        if pred_one_hot.sum() == 0 and len(pred_results) > 0:
+            top1_label: str = pred_results[0][0]
+            top1_label_id: int = label_dict["label2id"][top1_label] 
+            pred_one_hot[top1_label_id] = 1.0
+
+        gt_one_hots.append(gt_one_hot.tolist())
+        pred_one_hots.append(pred_one_hot.tolist())
+    
+    verification(train_eval_results_path, gt_one_hots, pred_one_hots)
+    results = metrics_func(IntTensor(pred_one_hots), IntTensor(gt_one_hots))
+    print(json.dumps(results, indent=2))
+    return
+
+
+if __name__ == "__main__":
+    main()
diff --git a/inf.json b/inf.json
@@ -0,0 +1,19 @@
+{
+  "ckpt_path": "/path/to/model.pt",
+  "label_dict_path": "/path/to/dict.json/file",
+  "test_data_path": "/test/jsonl/file/path",
+  "out_path": "./_inf_results.jsonl",
+  "text_col": "text",
+  "result_col": "results",
+  "model": {
+    "hf_lm": "google-bert/bert-base-uncased", 
+    "lm_hidden_dim": 768,
+    "chunk_size": 512,
+    "chunk_num": 2 
+  },
+  "inf": {
+    "min_confidence": 0.3,
+    "top_k": 10,
+    "label_splitter": "[SEP]"
+  }
+}
diff --git a/inf.py b/inf.py
diff --git a/train.py b/train.py