Images above illustrate reconstruction capabilities of different models after transforming input image to a codebook. Any 384x384 image can be represented as a combination of the 8192 learned vectors and can be used to reconstruct the original image with insignificant perceptual loss
Latent diffusion model describes a method to convert any image into a discrete codebook ( a first stage step). This discrete representation of codes is then used for downstream tasks (second stage) like image generation (e.g. given an initial sequence of tokens representing an image syntheisize the remaining; or condition image generation based on a label). The utility of these discrete representations for discriminative tasks (e.g. classification) is yet to be proven
This fork reproduces the first stage - Transforming an input image to a codebook and then reconstructing image back from codebook. It also has utilities to convert images in batches to codebook form, as well as visualize the codebook vectors. The reproduced snapshot is used in the Google Colab notebook and docker image TBD
- Original repo
- FAQ
- Google Colab link
- [Docker Link - TBD]
1. Why reproduce and freeze an existing result?
Reproduction and freezing addresses the problem of Github repos (including their notebooks) breaking over time due to updates on the dependent packages. This problem is circumvented by taking the environment snapshot of a working version
The notebook downloads a working environment snapshot (made using conda-pack), including all required models. The docker version is essentially the same environment packaged in a container.
2. What are the limitations?
1. Conda-pack was performed on Ubuntu 20.04.4 LTS. The target OS needs to be the same for the notebook to work. This limitation is not there for the docker container even though conda-pack is used in its creation, given the container abstraction wrapped around it
2. Reproducibility is achieved by using conda-packed enviromment which needs to be run prior to execution of any code in the repository. This imposes a level of indirection in interactive coding in the notebook. Edits to python code needs to be made in a python file. The notebook cell merely serves as a command line interface to execute the python file or function.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*,
Andreas Blattmann*,
Dominik Lorenz,
Patrick Esser,
Björn Ommer
* equal contribution
-
Thanks to Katherine Crowson, classifier-free guidance received a ~2x speedup and the PLMS sampler is available. See also this PR.
-
Our 1.45B latent diffusion LAION model was integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:
-
More pre-trained LDMs are available:
- A 1.45B model trained on the LAION-400M database.
- A class-conditional model on ImageNet, achieving a FID of 3.6 when using classifier-free guidance Available via a colab notebook
.
A suitable conda environment named ldm
can be created
and activated with:
conda env create -f environment.yaml
conda activate ldm
A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.
Download the pre-trained weights (5.7GB)
mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
and sample with
python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50
This will save each sample individually as well as a grid of size n_iter
x n_samples
at the specified output location (default: outputs/txt2img-samples
).
Quality, sampling speed and diversity are best controlled via the scale
, ddim_steps
and ddim_eta
arguments.
As a rule of thumb, higher values of scale
produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps
generally also gives higher quality samples, but returns are diminishing for values > 250.
Fast sampling (i.e. low values of ddim_steps
) while retaining good quality can be achieved by using --ddim_eta 0.0
.
Faster sampling (i.e. even lower values of ddim_steps
) while retaining good quality can be achieved by using --ddim_eta 0.0
and --plms
(see Pseudo Numerical Methods for Diffusion Models on Manifolds).
For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on
can sometimes result in interesting results. To try it out, tune the H
and W
arguments (which will be integer-divided
by 8 in order to calculate the corresponding latent size), e.g. run
python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0
to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.
The example below was generated using the above command.
Download the pre-trained weights
wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1
and sample with
python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results
indir
should contain images *.png
and masks <image_fname>_mask.png
like
the examples provided in data/inpainting_examples
.
Available via a notebook .
We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via
CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>
For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.
The LSUN datasets can be conveniently downloaded via the script available here.
We performed a custom split into training and validation images, and provide the corresponding filenames
at https://ommer-lab.com/files/lsun.zip.
After downloading, extract them to ./data/lsun
. The beds/cats/churches subsets should
also be placed/symlinked at ./data/lsun/bedrooms
/./data/lsun/cats
/./data/lsun/churches
, respectively.
The code will try to download (through Academic
Torrents) and prepare ImageNet the first time it
is used. However, since ImageNet is quite large, this requires a lot of disk
space and time. If you already have ImageNet on your disk, you can speed things
up by putting the data into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
(which defaults to
~/.cache/autoencoders/data/ILSVRC2012_{split}/data/
), where {split}
is one
of train
/validation
. It should have the following structure:
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│ ├── n01440764_10026.JPEG
│ ├── n01440764_10027.JPEG
│ ├── ...
├── n01443537
│ ├── n01443537_10007.JPEG
│ ├── n01443537_10014.JPEG
│ ├── ...
├── ...
If you haven't extracted the data, you can also place
ILSVRC2012_img_train.tar
/ILSVRC2012_img_val.tar
(or symlinks to them) into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/
/
${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/
, which will then be
extracted into above structure without downloading it again. Note that this
will only happen if neither a folder
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
nor a file
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready
exist. Remove them
if you want to force running the dataset preparation again.
Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>
.
Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder
.
Training can be started by running
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,
where config_spec
is one of {autoencoder_kl_8x8x64
(f=32, d=64), autoencoder_kl_16x16x16
(f=16, d=16),
autoencoder_kl_32x32x4
(f=8, d=4), autoencoder_kl_64x64x3
(f=4, d=3)}.
For training VQ-regularized models, see the taming-transformers repository.
In configs/latent-diffusion/
we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets.
Training can be started by running
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,
where <config_spec>
is one of {celebahq-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
lsun_bedrooms-ldm-vq-4
(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
lsun_churches-ldm-vq-4
(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8
(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.
All models were trained until convergence (no further substantial improvement in rFID).
Model | rFID vs val | train steps | PSNR | PSIM | Link | Comments |
---|---|---|---|---|---|---|
f=4, VQ (Z=8192, d=3) | 0.58 | 533066 | 27.43 +/- 4.26 | 0.53 +/- 0.21 | https://ommer-lab.com/files/latent-diffusion/vq-f4.zip | |
f=4, VQ (Z=8192, d=3) | 1.06 | 658131 | 25.21 +/- 4.17 | 0.72 +/- 0.26 | https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1 | no attention |
f=8, VQ (Z=16384, d=4) | 1.14 | 971043 | 23.07 +/- 3.99 | 1.17 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/vq-f8.zip | |
f=8, VQ (Z=256, d=4) | 1.49 | 1608649 | 22.35 +/- 3.81 | 1.26 +/- 0.37 | https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip | |
f=16, VQ (Z=16384, d=8) | 5.15 | 1101166 | 20.83 +/- 3.61 | 1.73 +/- 0.43 | https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1 | |
f=4, KL | 0.27 | 176991 | 27.53 +/- 4.54 | 0.55 +/- 0.24 | https://ommer-lab.com/files/latent-diffusion/kl-f4.zip | |
f=8, KL | 0.90 | 246803 | 24.19 +/- 4.19 | 1.02 +/- 0.35 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | |
f=16, KL (d=16) | 0.87 | 442998 | 24.08 +/- 4.22 | 1.07 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/kl-f16.zip | |
f=32, KL (d=64) | 2.04 | 406763 | 22.27 +/- 3.93 | 1.41 +/- 0.40 | https://ommer-lab.com/files/latent-diffusion/kl-f32.zip |
Running the following script downloads und extracts all available pretrained autoencoding models.
bash scripts/download_first_stages.sh
The first stage models can then be found in models/first_stage_models/<model_spec>
Datset | Task | Model | FID | IS | Prec | Recall | Link | Comments |
---|---|---|---|---|---|---|---|---|
CelebA-HQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0) | 5.11 (5.11) | 3.29 | 0.72 | 0.49 | https://ommer-lab.com/files/latent-diffusion/celeba.zip | |
FFHQ | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1) | 4.98 (4.98) | 4.50 (4.50) | 0.73 | 0.50 | https://ommer-lab.com/files/latent-diffusion/ffhq.zip | |
LSUN-Churches | Unconditional Image Synthesis | LDM-KL-8 (400 DDIM steps, eta=0) | 4.02 (4.02) | 2.72 | 0.64 | 0.52 | https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip | |
LSUN-Bedrooms | Unconditional Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=1) | 2.95 (3.0) | 2.22 (2.23) | 0.66 | 0.48 | https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip | |
ImageNet | Class-conditional Image Synthesis | LDM-VQ-8 (200 DDIM steps, eta=1) | 7.77(7.76)* /15.82** | 201.56(209.52)* /78.82** | 0.84* / 0.65** | 0.35* / 0.63** | https://ommer-lab.com/files/latent-diffusion/cin.zip | *: w/ guiding, classifier_scale 10 **: w/o guiding, scores in bracket calculated with script provided by ADM |
Conceptual Captions | Text-conditional Image Synthesis | LDM-VQ-f4 (100 DDIM steps, eta=0) | 16.79 | 13.89 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/text2img.zip | finetuned from LAION |
OpenImages | Super-resolution | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip | BSR image degradation |
OpenImages | Layout-to-Image Synthesis | LDM-VQ-4 (200 DDIM steps, eta=0) | 32.02 | 15.92 | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip | |
Landscapes | Semantic Image Synthesis | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip | |
Landscapes | Semantic Image Synthesis | LDM-VQ-4 | N/A | N/A | N/A | N/A | https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip | finetuned on resolution 512x512 |
The LDMs listed above can jointly be downloaded and extracted via
bash scripts/download_models.sh
The models can then be found in models/ldm/<model_spec>
.
- More inference scripts for conditional LDMs.
- In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing
-
Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
-
The implementation of the transformer encoder is from x-transformers by lucidrains.
@misc{rombach2021highresolution,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2021},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV}
}