ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
Feng Han1,2*, Yang Jiao1,2*, Shaoxiang Chen3, Junhao Xu1,2, Jingjing Chen1,2, Yu-Gang Jiang1,2
1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 Shanghai Collaborative Innovation Center on Intelligent Visual Computing 3 Minimax
We introduce ControlThinker, a novel framework bridging the semantic gap in controllable image generation through enhanced visual reasoning. ControlThinker employs a "comprehend-then-generate" paradigm. It utilizes a Multimodal Large Language Model (MLLM) specifically enhanced via supervised and reinforcement fine-tuning to extract latent semantics from control images, generating enriched prompts that significantly enhance visual quality and semantic coherence of generated images without modifying image generators. Extensive experiments across various control types confirm ControlThinker's effectiveness.
- August 5, 2025: We have released the inference code.
- July 15, 2025: We have released our checkpoints on huggingface.
- June 2, 2025: We have released the paper of ControlThinker.
- May 30, 2025: The codes and models are coming soon.
conda create -n Controlthinker python=3.10
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
cd ControlAR
pip install -r requirements.txt
pip3 install -U openmim
mim install mmengine
mim install "mmcv==2.1.0"
pip3 install "mmsegmentation>=1.0.0"
pip3 install mmdet
git clone https://github.com/open-mmlab/mmsegmentation.git
First download the vq_ds16_t2i.pt, flan-t5-xl model from ControlAR
Download
train-00000-of-02403.parquet
..
train-00020-of-02403.parquet
of MultiGen-20M( ~10GB) and save the .parquet files to a same directory Download
train-00000-of-00248.parquet
..
train-000040-of-00248.parquet
of COCOStuff with caption( ~10GB) and save the .parquet files to a same directory
Download
validation-00000-of-00005.parquet
..
validation-00004-of-00005.parquet
of MultiGen-20M( ~2GB) and save the .parquet files to a same directory
Download
validation-00000-of-00011.parquet
..
validation-000010-of-00011.parquet
of COCOStuff with caption( ~2GB) and save the .parquet files to a same directory
For MultiGen-20M
python ControlAR/autoregressive/test/extract_files_from_multigen_data.py --data-path /dir_of_MultiGen_parquet_fils --code-path /path_of_output_dir --vq-ckpt /path_of_vq_ds16_t2i --t5-path /parent_dir_of_flant5xl --split train/validation
For Captioned_COCOStuff
python ControlAR/autoregressive/test/extract_files_from_cocostuff_data.py --data-path /dir_of_COCOStuff_parquet_fils --code-path /path_of_output_dir --vq-ckpt /path_of_vq_ds16_t2i --t5-path /parent_dir_of_flant5xl --split train/validation
To extract original prompts:
python ControlAR/autoregressive/train/extract_file_caption.py --data-path /dir_of_MultiGen_or_COCOStuff_parquet_files --output-path /json_of_ori_prompts --split validation --dataset multigen/cocostuff
Place the JSON file of the original prompt in the directory of extracted data and name it captions.json
We adopt LLaMA-Factory for SFT.
python ControlAR/condition/hed_batch.py --input_folder /path_of_ori_images --output_folder /path_of_control_images
You can change hed
to lineart/canny
.
For depth
and segmentation
, they are provided by the dataset.
Place the directory of control images in the directory of extracted data.
python -u generate_formatted_dataset_think \
--input_json /json_of_ori_prompts \
--image_dir /path_of_control_images\
--output_json /path_of_formatted_input \
--condition_type hed/depth/canny/lineart/segmentation \
Replace the content of ms-swift/swift/trainers/rlhf_trainer/grpo_trainer.py
with train/reinforcement-learning/controlthinker_trainer.py
export MASTER_PORT=29501
CUDA_VISIBLE_DEVICES=6,7 \
NPROC_PER_NODE=2 \
swift rlhf \
--rlhf_type grpo \
--model /path_of_model \
--train_type lora \
--lora_rank 64 \
--lora_alpha 128 \
--target_modules all-linear \
--torch_dtype bfloat16 \
--dataset /path_of_formatted_input \
--max_completion_length 1024 \
--num_train_epochs 1 \
--per_device_train_batch_size 6 \
--per_device_eval_batch_size 6 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 1 \
--eval_steps 10000 \
--save_steps 800 \
--save_total_limit 10 \
--logging_steps 1 \
--max_length 4096 \
--output_dir /dir_of_output_checkpoints \
--max_steps 2400 \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--num_generations 12 \
--temperature 0.9 \
--deepspeed zero2 \
--report_to wandb \
--log_completions true
First Download the Checkpoints on huggingface.
The same as the Second Stage Training
The same as the Second Stage Training
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift infer \
--model /path_of_checkpoints \
--stream false \
--infer_backend pt \
--max_batch_size 26 \
--temperature 0.9 \
--val_dataset /path_of_formatted_input \
--result_path /path_of_response \
--max_new_tokens 2048
python format_output_think.py \
--input_json /path_of_response \
--output_json /path_of_final_enhanced_prompts
python ./ControlAR/autoregressive/sample/ extract_extent_caption_embedding.py \
--json-path /path_of_final_enhanced_prompts \
--output-path /dir_of_new_text_embeddings \
--flan-t5-path /parent_dir_of_flant5xl \
--prompt-num 1 \
# prompt-num is the number of images to generate for orm
# (1 means no orm)
First download the edge_base.safetensors, depth_base.safetensors, seg_cocostuff.safetensors, vq_ds16_t2i.pt, flan-t5-xl, dinov2-base, dinov2-small model from ControlAR
Place the dinov2-base, dinov2-small model to autoregressive/models
For hed/canny/lineart
python -u ./ControlAR/autoregressive/test/generate_image_with_new_prompt.py --condition_type hed/canny/lineart --vq_ckpt /path_of_vq_ds16_t2i --gpt_ckpt /path_of_edge_base.safetensors --code_path /path_of_multigen_data --gpt_model GPT-XL --image_size 512 --sample_dir /dir_of_output_images --seed 0 --adapter_size base --per_proc_batch_size 16 --text_embedding_dir /dir_of_new_text_embeddings --t5_path /dir_of_flant5xl_model
For depth
python -u ./ControlAR/autoregressive/test/generate_image_with_new_prompt.py --condition_type depth --vq_ckpt /path_of_vq_ds16_t2i --gpt_ckpt /path_of_depth_base.safetensors --code_path /path_of_multigen_data --gpt_model GPT-XL --image_size 512 --sample_dir /dir_of_output_images --seed 0 --adapter_size base --per_proc_batch_size 16 --text_embedding_dir /dir_of_new_text_embeddings --t5_path /dir_of_flant5xl_model
For segmentation
python -u ./ControlAR/autoregressive/test/generate_image_with_new_prompt.py --condition_type seg --vq_ckpt /path_of_vq_ds16_t2i --gpt_ckpt /path_of_seg_cocostuff.safetensors --code_path /path_of_multigen_data --gpt_model GPT-XL --image_size 512 --sample_dir /dir_of_output_images --seed 0 --per_proc_batch_size 16 --text_embedding_dir /dir_of_new_text_embeddings --t5_path /dir_of_flant5xl_model
python ControlAR/condition/hed_batch.py --input_folder /path_of_generated_images --output_folder /path_of_control_images
You can change hed
to lineart/canny/depth/seg
.
python -u inference_scaling.py --ctl_gt_folder /path_of_ground_truth_control_image \
--ctl_candidates_folder /path_of_generated_control_image \
--img_candidates_folder /path_of_generated_image \
--output_folder /path_of_picked_image
Download DeepLabV3(weight) and save it to evaluations/
.
Refer to ControlAR/condition/README.md
for preparing dpt
model.
Run scripts in ControlAR/evaluations
to calculate matrices
- Release checkpoints and evaluation code
- Released the training code along with an easy-to-follow tutorial.
- Release the paper of ControlThinker
Visualisation of generated images of ControlThinker and other baselines.
1๏ธโฃ Codes and tutorial will be available soon.
@article{han2025controlthinker,
title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning},
author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2506.03596},
year={2025}
}
ControlThinker is licensed under the Apache 2.0.