Official repository for the paper "T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT".
- [2025.06.12] T2I-R1 has achieved the best result in open-source AR-based models in TIIF-Bench! 🔥
- [2025.05.24] We release the checkpoint of T2I-R1! 🔥
- [2025.05.23] Our new work exploring different RL Strategies for T2I is released: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO 🚀
- [2025.05.02] We release the arxiv paper and the training code. 🔥
- [2025.02.28] Our previous work for Image Generation with CoT: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step is accepted by CVPR 2025 🎉
Chain-of-Thought (CoT) reasoning with reinforcement learning (RL) has been extensively explored by LLMs and LMMs. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this project, we provide T2I-R1, a novel reasoning-enhanced text-to-image generation model powered by RL with a bi-level CoT reasoning process.
We identify two levels of CoT that can be utilized to enhance different stages of generation:
-
🧠 Semantic-level CoT is the textual reasoning about the image to generate, which is introduced prior to the image generation. The semantic-level CoT designs the global structure of the image, e.g., the appearance and location of each object. Optimizing the semantic-level CoT could explicitly manage the planning and reasoning of the prompt before the subsequent image tokens generation, making the generation easier.
-
🎨 Token-level CoT is the intermediate patch-by-patch generation process of the image. Unlike semantic-level CoT, token-level CoT focuses on low-level details like pixel generation and maintaining visual coherence between adjacent patches. Optimizing the token-level CoT can enhance both the generation quality and the alignment between the prompt and the resulting images.
To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step:
-
Release ORM Checkpoint and reward code
-
Release Checkpoint
Clone the repository:
git clone https://github.com/CaraJ7/T2I-R1.git
cd T2I-R1
Create a conda environment:
conda create -n t2i-r1 python=3.10
conda activate t2i-r1
Please follow the official instructions here to install both PyTorch and TorchVision dependencies.
Install additional dependencies:
cd src
pip install -r requirements.txt
Note that other newer versions of torch, transformers, and trl may also work.
Make sure to install from our repo. We make some necessary modifications to train with Zero3.
Install GrouningDINO if you want to use Object Detector reward
cd t2i-r1/src/t2i-r1/src/utils/GroundingDINO
pip install -e .
Install LLaVA if you want to use ORM reward
cd t2i-r1/src/t2i-r1/src/utils/LLaVA-NeXT
pip install -e ".[train]"
Please download the reward model you need for training.
cd t2i-r1
mkdir reward_weight
cd reward_weight
- Download HPS checkpoint from this link by
wget https://huggingface.co/xswu/HPSv2/resolve/main/HPS_v2.1_compressed.pt
- Download GIT checkpoint from this link by
huggingface-cli download microsoft/git-large-vqav2 --repo-type model --local-dir git-large-vqav2
- Download GroundingDINO checkpoint from this link by
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
- Download ORM checkpoint from this link by
huggingface-cli download CaraJ/ORM-T2I-R1 --repo-type model --local-dir ORM-T2I-R1
cd t2i-r1/src
bash scripts/run_grpo.sh
Notes:
- Parameters:
- reward_funcs: The options are
hps
,git
,gdino
,orm
. You can choose whatever composition you need for training. Make sure to substitute the correct checkpoint path and config path in therun_grpo.sh
- reward_funcs: The options are
You can download the checkpoint from here or train the model by yourself.
cd t2i-r1/src/infer
python reason_inference.py \
--model_path YOUR_MODEL_CKPT \
--data_path test_data.txt
- When necessary, we incorporate the corresponding repo from the reward model we use. We modify certain code to adapt for Zero3 training and delete unused folders to maintain a lightweight codebase.
- For GroundingDINO, we modify the code in
t2i-r1/src/t2i-r1/src/utils/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py
. - For LLaVA (ORM), we modify the code in
t2i-r1/src/t2i-r1/src/utils/LLaVA-NeXT/llava/model/builder.py
andt2i-r1/src/t2i-r1/src/utils/LLaVA-NeXT/llava/model/llava_arch.py
.
- For GroundingDINO, we modify the code in
Explore our additional research on Autoregressive Text-to-Image Generation and CoT Reasoning
- [Image Generation CoT] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step?
- [MME-CoT] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
- [TIIF-Bench] TIIF-Bench: How Does Your T2I Model Follow Your Instructions?
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- [MAVIS] MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
- [MMSearch] MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines
We would like to thank R1-V and Image Generation CoT, upon which our repo is built.
This project is released under Apache License 2.0. We release our checkpoints for research purposes only. Users are granted the freedom to create images using this tool, but they are expected to comply with local laws and utilize it in a responsible manner. The developers do not assume any responsibility for potential misuse by users.
@article{jiang2025t2i,
title={T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT},
author={Jiang, Dongzhi and Guo, Ziyu and Zhang, Renrui and Zong, Zhuofan and Li, Hao and Zhuo, Le and Yan, Shilin and Heng, Pheng-Ann and Li, Hongsheng},
journal={arXiv preprint arXiv:2505.00703},
year={2025}
}