A clean, modular, and scalable training system for training text-to-{image,video} Flow Matching (FM) models.
First two rows: image generation results using the trained flux-tiny model via our minFM training system on ImageNet 256. Third row: image generation results from inference using a loaded checkpoint from FLUX.1 [dev]
- NVIDIA GPUs
- Linux environment
GPU Configuration Notes:
- You must ensure the total number of GPUs is divisible by the
shard_size
parameter in your config - The total number of available GPUs must be divisible by the number of GPUs in each DIT balancer sharding group, which is specified by
dit_balancer_specs
(e.g., "g1n4" means 4 GPUs per group, so total GPUs must be divisible by 4) - See the Configuration section below for details on adjusting these parameters
Before running training or inference, prepare the required data and checkpoints.
# Before running the script, specify the folders for model caching (e.g., VAE) and data:
export MINFM_CACHE_DIR="<minfm-cache-dir>"
export MINFM_DATA_DIR="<minfm-data-dir>"
# Prepare: cache everything required for inference and training
# Requires HF_TOKEN (https://huggingface.co/settings/tokens)
# This downloads:
# - T5, CLIP, FLUX VAE, and FLUX 1.dev checkpoints to MINFM_CACHE_DIR
# - ImageNet dataset to MINFM_DATA_DIR
# Requires ~210G of storage in total.
export HF_TOKEN="<your-hf-token>"
bash ./scripts/cache_everything.sh
For training, the easiest way to get started is using the provided training script:
# Run training with the default config
bash run.sh train ./configs/flux_tiny_imagenet.yaml
# Run training with a custom config
bash run.sh train path/to/your/config.yaml
# To use Weights & Biases logging, set wandb_mode to "online" in the config
# and then set the WANDB_API_KEY environment variable
WANDB_API_KEY=<YOUR_WANDB_KEY> bash run.sh train path/to/your/config.yaml
For inference (text-to-image generation):
# Run inference using a pretrained FLUX model
# See the `inferencer` section for inference parameters
# Note that the shard_size and dit_balancer_specs in the config are pre-set
# for 4*K GPUs; adjust the values to accommodate your available GPUs.
bash run.sh inference ./configs/flux_inference.yaml
# Run inference with custom config
bash run.sh inference path/to/your/config.yaml
The run.sh
script automatically:
- ✅ Sets up the environment with
uv
- ✅ Runs distributed training/inference with proper settings
- ✅ Starts background http server for easy inspection of intermediate results
- ✅ Starts background periodic checkpoint clean-up
We provide a complete training example using the tiny FLUX DiT model (560.25M parameters) trained on ImageNet with the flux_tiny_imagenet.yaml
configuration.
- Training Time: ~4 days on 8× H100 GPUs
- Total Steps: 380K steps with 1K batch size
- Model Size: 560.25M parameters
All training artifacts are hosted on the HuggingFace minFM repository:
-
📈 Training & Validation Curves - Complete W&B training metrics and loss curves
-
🎨 Intermediate Visualizations - Generated samples every 2k steps
# Download, extract, and view locally wget https://huggingface.co/datasets/Kai-46/minFM/resolve/main/flux-tiny_imagenet_intermediate_results.tar.gz tar -xzf flux-tiny_imagenet_intermediate_results.tar.gz cd flux-tiny_imagenet_intermediate_results python -m http.server 8000 # Open http://localhost:8000 in your browser
-
💾 Final Checkpoint - Ready-to-use model weights at step 380k; this contains float32 model, ema, optimizer.
-
Evaluation Metrics
Sampling solver: 50-step DDIM-style SDE
CFG Scale Inception Score FID sFID Precision Recall 1.5 248.19 3.44 9.48 0.818 0.515 2.0 342.52 5.33 8.61 0.884 0.435 5.0 478.90 16.54 10.08 0.934 0.205 Reproduction:
# Download the above checkpoint to ./experiments/flux_tiny_imagenet # Comment out the inferencer section in ./configs/flux_tiny_imagenet.yaml intended for metrics computation # Sample images bash run.sh inference ./configs/flux_tiny_imagenet.yaml # Compute metrics python scripts/eval_imagenet_metrics.py <path/to/sampled/images.npz>
If you prefer manual setup:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync --no-dev
# For development:
uv sync; pre-commit install
# Run training/inference (adjust --nproc_per_node to match your GPU count)
uv run torchrun --nproc_per_node=<num_gpus> -m entrypoint --config path/to/your/config.yaml --mode <train/inference>
The system uses YAML configuration files. See configs/flux_tiny_imagenet.yaml
for a complete example.
When using a different number of GPUs than the default configs, you need to adjust two key parameters:
-
shard_size
: Found in the FSDP sections of model components (text_encoder, clip_encoder, denoiser)- Must be a divisor of your total GPU count
- Example: For 6 GPUs, you can use
shard_size: 1
,2
,3
, or6
-
dit_balancer_specs
: Found in the balancer section- Format: "1x{gpus_per_group}" where {gpus_per_group} specifies GPUs per sharding group
- Your total GPU count must be divisible by this number
- Example: "1x4" for 4 GPUs per group (works with 4, 8, 12, 16... total GPUs)
Example for 4 GPUs:
model:
text_encoder:
fsdp:
shard_size: 1 # or 2, or 4
denoiser:
fsdp:
shard_size: 1 # or 2, or 4
balancer:
dit_balancer_specs: "g1n4" # 4 GPUs per group, total GPUs (4) divisible by 4
Example for 8 GPUs with 4 GPUs per group:
model:
text_encoder:
fsdp:
shard_size: 1 # or 2, 4, or 8
denoiser:
fsdp:
shard_size: 1 # or 2, 4, or 8
balancer:
dit_balancer_specs: "g1n8" # 8 GPUs per group, total GPUs (8) divisible by 8
# you can also try 1x4, which means 2 4-GPU groups
- Denoiser: Primary model architecture
- VAE: Image compression/decompression
- Text Encoder (T5): Text embeddings
- Text Encoder (CLIP): Text embeddings
- Patchifier: Image tokenization into patches
- TimeSampler: Timestep sampling
- TimeWarper: Adaptive timestep scheduling based on sequence length
- TimeWeighter: Loss weighting based on timesteps
- 🔄 Native Packed Sequences - Operate natively on packed interleaved text and image sequences
- ⚖️ KnapFormer Sequence Balancer - Balance compute workloads across GPUs for optimal performance
- 🔄 Configurable Gradient Accumulation - Automatic gradient accumulation with configurable total batch sizes
- 💾 Flexible Checkpoint Loading - Selective loading of model, EMA, optimizer, scheduler and step components
- 🚀 Distributed Training and Async Checkpointing - FSDP2 support with EMA
- 📦 Modular Design - Mix and match components with structured YAML files
- ⚡ Highly Optimized - FlashAttention variable-length support for H100/A100
# Download Parti prompts
curl -s https://raw.githubusercontent.com/google-research/parti/main/PartiPrompts.tsv \
| tail -n +2 | cut -f1 | awk 'NF' \
> resources/parti_prompts.txt
This project is licensed under the Apache-2.0 License — see the LICENSE file for details.
# If you use this repo, please cite:
@misc{zhang2025minfm,
title={minFM},
author={Kai, Zhang and Peng, Wang and Sai, Bi and Jianming, Zhang and Yuanjun, Xiong},
publisher = {GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/Kai-46/minFM/}},
year={2025}
}
# If you use the KnapFormer sequence balancer, please also cite:
@misc{zhang2025knapformer,
title={KnapFormer},
author={Kai, Zhang and Peng, Wang and Sai, Bi and Jianming, Zhang and Yuanjun, Xiong},
publisher = {GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/Kai-46/KnapFormer/}},
year={2025}
}
# If you use the energy-preserving cfg in utils/sampler.py, please also cite:
@article{zhang2024ep,
title={EP-CFG: Energy-Preserving Classifier-Free Guidance},
author={Zhang, Kai and Luan, Fujun and Bi, Sai and Zhang, Jianming},
journal={arXiv preprint arXiv:2412.09966},
year={2024}
}
This repository may be relocated to the adobe-research organization, with this copy serving as a mirror.