GitHub - JiuTian-VL/JiuTian-LION: [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Gongwei Chen, Leyang Shen, Rui Shao*, Xiang Deng, Liqiang Nie*

School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
*Corresponding author

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024

[Paper] [Project Page] [Video(YouTube)] [Video(bilibili)]

🔥 Details will be released. Stay tuned 🍻 👍

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

[07/2024] Code and checkpoints are released.
[02/2024] LION has been accepted by CVPR 2024.
[11/2023] Arxiv paper released.
[11/2023] Project page released.

Introduction

This is the github repository of LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting capabilities and alleviating hallucinations.

The framework of the proposed LION model:

Installation

Download

git clone https://github.com/JiuTian-VL/JiuTian-LION.git
cd JiuTian-LION

Environment

conda create -n LION python=3.12
conda activate LION
conda install pip
pip install -r requirements.txt

Checkpoints

Version	Checkpoint
LION-FlanT5-XL	daybreaksly/LION-FlanT5-XL
LION-FlanT5-XXL	daybreaksly/LION-FlanT5-XXL

Usage

Prepare models

Download the pre-trained vit model eva_vit_g.
Download the pre-trained RAM model ram_swin_large_14m.
Download the pre-trained FlanT5 model FlanT5-XL.
Download the pre-trained BERT model bert-base-uncased
Fill in the paths to these models into the corresponding locations in the config file configs\models\lion_flant5xl.yaml

Inference

We provide inference examples for Image-Level and Region-Level tasks in playground.ipynb.

Training

We provide a training script and instruction to do stage4 training as an example.

Download dataset from huggingface
Download images and organized them in one folder:

Please download the following datasets:

Training images
- OCR-VQA
- coco-2014
- coco-2017
- okvqa-2014
- textcaps
- vqav2-2014
- visual_genome

After downloading, place all these folders under a single directory.
For example:

/path/to/data/images/
├── OCR-VQA/images
├── coco/images/train2014
├── coco_2017/train2017
├── okvqa/images/train/train2014
├── textcaps/images/train_images
├── vqav2/images/train2014
├── visual_genome/VG_100K
└── visual_genome/VG_100K_2

---

In your config file, add the unified image folder path:

train_datasets:
  - ann_path: "/path/to/image_level_data.json"
    vis_root: "/path/to/image_folder"
    is_train: true
    sample_ratio: 1
  - ann_path: "/path/to/region_level_data.json"
    vis_root: "/path/to/image_folder"
    is_train: true
    sample_ratio: 1

Configure training with configs/lion_train_stage4.yaml (update model paths and dataset paths)
Run multi‑GPU training:

cd JiuTian-LION
bash scripts/start_train.sh

Or manually:

CUDA_VISIBLE_DEVICES=0,1,2,3 TOKENIZERS_PARALLELISM=true \
  torchrun --master_port 12345 --nproc_per_node=4 \
  train.py --cfg-path configs/lion_train_stage4.yaml

Outputs and checkpoints are written to outputs/lion_stage4/<timestamp>/ by default.

Evaluation results

For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.

We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 1~2 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.

Qualitative Comparison

More Examples

Citation

If you find this work useful for your research, please kindly cite our paper:

@inproceedings{chen2024lion,
    title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge}, 
    author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
common		common
configs		configs
datasets		datasets
images		images
models		models
preprocessors		preprocessors
ram		ram
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
playground.ipynb		playground.ipynb
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

Introduction

Installation

Download

Environment

Checkpoints

Usage

Prepare models

Inference

Training

Evaluation results

Qualitative Comparison

More Examples

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

JiuTian-VL/JiuTian-LION

Folders and files

Latest commit

History

Repository files navigation

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

Introduction

Installation

Download

Environment

Checkpoints

Usage

Prepare models

Inference

Training

Evaluation results

Qualitative Comparison

More Examples

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages