MedTok: Multimodal Medical Code Tokenizer

👀 Overview of MedTok

Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.32% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.

🚀 Installation

Clone the Github repository and setup the enviroment.

git clone https://github.com/mims-harvard/MedTok
cd MedTok

conda env create -f MedTok.yaml
conda activate MedTok

💡 How to train MedTok?

To train MedTok, please first download all_codes_mappings.parquet to 'Dataset/medicalCode/', kg.csv to 'Dataset/primeKG/' and then run:

sbatch run.sh

🛠️ How to use MedTok?

We provide two ways to use MedTok.

One is using this codebase to run inference script to get all tokens and corresponding embeddings. The embeddings are also available at mims-harvard/MedTok.

python inference.py

or

The other is accessing MedTok by mims-harvard/MedTok.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("MedTok")
tokens = tokenizer.tokenize("E11.9")
ids = tokenizer.encode("E11.9")
embed = tokenizer.embed("E11.9")

If you want to use the tokenized embedding for each medical code, please download it from mims-harvard/MedTok or code2embeddings.json.zip directly. And the downloaded embedding file could be put into 'MedTok/embedding.npy' to run EHR or QA tasks based on MedTok.

🏥MedTok for EHR

Please first download EHR datasets to 'Dataset/EHR/{EHR_dataset_name}', and then run:

cd MedTok_EHR_Tutorial
python MedTok_EHR.py

Note for applying MedTok with EHRShot, please preprocessing EHRShot dataset as the format as MIMIC-III and MIMIC-IV, which means prepare the data into several CSV files, including admission, diagnosis, procedure, medication/prescriptions.

❓MedTok for MedicalQA

To finetune LLMs with datasets we presented in our paper, please run the following command:

cd MedTok_QA_Tutorial
WORLD_SIZE=[WORLD_SIZE] \
CUDA_LAUNCH_BLOCKING=[CUDA_LAUNCH_BLOCKING] \
CUDA_VISIBLE_DEVICES=[GPU_NUMS] \
torchrun --nproc_per_node=[NODE_NUMS] --master_port 1234 fintune_llama3.py

After obtaining the pre-trained model, please do inference directly on other datasets:

cd MedTok_QA_Tutorial
torchrun --nproc_per_node=[NODE_NUMS] --master_port 1234 inference.py

If you want to apply MedTok to your own QA system or datasets, please first extract the diseases contained in each query and obtain their medical code, and then prepare the datasets to be used as training dataset to finetune LLMs.

cd MedTok_QA_Tutorial
python extract_disease.py
python map_query_id.py

Please reference the json file in 'Dataset/MedicalQA/xx.json' to prepare your data.

🤗 Try out MedTok

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mims-harvard/MedTok", trust_remote_code=True)
tokens = tokenizer("E11.9")
embed = tokenizer.embed("E11.9")

Citation

@article{su2025multimodal,
  title={Multimodal Medical Code Tokenizer},
  author={Su, Xiaorui and Messica, Shvat and Huang, Yepeng and Johnson, Ruth and Fesser, Lukas and Gao, Shanghua and Sahneh, Faryad and Zitnik, Marinka},
  journal={International Conference on Machine Learning, ICML},
  year={2025}
}

Contact

If you have any questions or suggestions, please email Xiaorui Su and Marinka Zitnik.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
Dataset		Dataset
MedTok		MedTok
MedTok_EHR_Tutorial		MedTok_EHR_Tutorial
MedTok_QA_Tutorial		MedTok_QA_Tutorial
LICENSE		LICENSE
MedTok.jpg		MedTok.jpg
MedTok.yaml		MedTok.yaml
README.md		README.md
inference.py		inference.py
train_MedTok.py		train_MedTok.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedTok: Multimodal Medical Code Tokenizer

👀 Overview of MedTok

🚀 Installation

💡 How to train MedTok?

🛠️ How to use MedTok?

🏥MedTok for EHR

❓MedTok for MedicalQA

🤗 Try out MedTok

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

License

mims-harvard/MedTok

Folders and files

Latest commit

History

Repository files navigation

MedTok: Multimodal Medical Code Tokenizer

👀 Overview of MedTok

🚀 Installation

💡 How to train MedTok?

🛠️ How to use MedTok?

🏥MedTok for EHR

❓MedTok for MedicalQA

🤗 Try out MedTok

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages