Zhengyao Lv*1,
Tianlin Pan*2,3,
Chenyang Si2‡†,
Zhaoxi Chen4,
Wangmeng Zuo5,
Ziwei Liu4†,
Kwan-Yee K. Wong1†
1The University of Hong Kong
2Nanjing University
3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
(*Equal Contribution. ‡Project Leader. †Corresponding Author.)
Paper | Project Page | LoRA Weights
We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
teaser.mp4
For Stable Diffusion 3.5, simply run:
python infer/infer_sd3.py
For FLUX.1, run:
python infer/infer_flux.py
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
Model | Attribute Binding | Object Relationship | Complex |
|||
---|---|---|---|---|---|---|
Color |
Shape |
Texture |
Spatial |
Non-Spatial |
||
FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
FLUX.1-Dev + TACA ( |
0.7843 | 0.5362 | 0.6872 | 0.2405 | 0.3041 | 0.4494 |
FLUX.1-Dev + TACA ( |
0.7842 | 0.5347 | 0.6814 | 0.2321 | 0.3046 | 0.4479 |
SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
SD3.5-Medium + TACA ( |
0.8074 | 0.5938 | 0.7522 | 0.2678 | 0.3106 | 0.4470 |
SD3.5-Medium + TACA ( |
0.7984 | 0.5834 | 0.7467 | 0.2374 | 0.3111 | 0.4505 |