School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024
[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]
🔥 Details will be released. Stay tuned 🍻 👍

- [07/2024] Code and checkpoints are released.
- [02/2024] LION has been accepted by CVPR 2024.
- [11/2023] Arxiv paper released.
- [11/2023] Project page released.
This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.
The framework of the proposed LION model:
git clone https://github.com/JiuTian-VL/JiuTian-LION.git
cd JiuTian-LION
conda create -n LION python=3.12
conda activate LION
conda install pip
pip install -r requirements.txt
Version | Checkpoint |
---|---|
LION-FlanT5-XL | daybreaksly/LION-FlanT5-XL |
LION-FlanT5-XXL | daybreaksly/LION-FlanT5-XXL |
- Download the pre-trained vit model eva_vit_g.
- Download the pre-trained RAM model ram_swin_large_14m.
- Download the pre-trained FlanT5 model FlanT5-XL.
- Download the pre-trained BERT model bert-base-uncased
- Fill in the paths to these models into the corresponding locations in the config file
configs\models\lion_flant5xl.yaml
We provide inference examples for Image-Level and Region-Level tasks in playground.ipynb
.
We provide a training script and instruction to do stage4 training as an example.
- Download dataset from huggingface
- Download images and organized them in one folder:
Please download the following datasets:
- Training images
OCR-VQA
coco-2014
coco-2017
okvqa-2014
textcaps
vqav2-2014
visual_genome
After downloading, place all these folders under a single directory.
For example:
/path/to/data/images/
├── OCR-VQA/images
├── coco/images/train2014
├── coco_2017/train2017
├── okvqa/images/train/train2014
├── textcaps/images/train_images
├── vqav2/images/train2014
├── visual_genome/VG_100K
└── visual_genome/VG_100K_2
---
In your config file, add the unified image folder path:
train_datasets:
- ann_path: "/path/to/image_level_data.json"
vis_root: "/path/to/image_folder"
is_train: true
sample_ratio: 1
- ann_path: "/path/to/region_level_data.json"
vis_root: "/path/to/image_folder"
is_train: true
sample_ratio: 1
- Configure training with
configs/lion_train_stage4.yaml
(update model paths and dataset paths) - Run multi‑GPU training:
cd JiuTian-LION
bash scripts/start_train.sh
Or manually:
CUDA_VISIBLE_DEVICES=0,1,2,3 TOKENIZERS_PARALLELISM=true \
torchrun --master_port 12345 --nproc_per_node=4 \
train.py --cfg-path configs/lion_train_stage4.yaml
Outputs and checkpoints are written to outputs/lion_stage4/<timestamp>/
by default.
For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.
We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{chen2024lion,
title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge},
author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}