Skip to content

Vchitect/TACA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

1The University of Hong Kong       2Nanjing University
3University of Chinese Academy of Sciences       4Nanyang Technological University
5Harbin Institute of Technology
(*Equal Contribution.    Project Leader.    Corresponding Author.)

Paper | Project Page | LoRA Weights

About

We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.

teaser.mp4

Usage

For Stable Diffusion 3.5, simply run:

python infer/infer_sd3.py

For FLUX.1, run:

python infer/infer_flux.py

Benchmark

Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.

Model Attribute Binding Object Relationship Complex $\uparrow$
Color $\uparrow$ Shape $\uparrow$ Texture $\uparrow$ Spatial $\uparrow$ Non-Spatial $\uparrow$
FLUX.1-Dev 0.7678 0.5064 0.6756 0.2066 0.3035 0.4359
FLUX.1-Dev + TACA ($r = 64$) 0.7843 0.5362 0.6872 0.2405 0.3041 0.4494
FLUX.1-Dev + TACA ($r = 16$) 0.7842 0.5347 0.6814 0.2321 0.3046 0.4479
SD3.5-Medium 0.7890 0.5770 0.7328 0.2087 0.3104 0.4441
SD3.5-Medium + TACA ($r = 64$) 0.8074 0.5938 0.7522 0.2678 0.3106 0.4470
SD3.5-Medium + TACA ($r = 16$) 0.7984 0.5834 0.7467 0.2374 0.3111 0.4505

Showcases

About

[ICCV25] TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages