VQGAN

Created with conversation with Claude Sonnet 4.6

VQGAN trains a discrete autoencoder with a VQ codebook bottleneck (like VQ-VAE) under an adversarial + perceptual loss recipe, then fits an autoregressive transformer prior over the resulting tokens for generation. Its lasting contribution is the VAE training recipe — adopted by every modern latent-space generator from LDM to SD3 to FLUX, even when the discrete bottleneck is dropped.

The VAE Training Recipe

Four loss terms applied to a continuous or VQ-regularized autoencoder:

L_{VAE} = L_{rec} + λ_{perc} L_{perc} + λ_{adv} L_{adv} + λ_{KL} L_{KL}

Loss	Purpose
L1 reconstruction	Pixel-level fidelity
LPIPS (perceptual)	Semantic sharpness via pretrained VGG features
PatchGAN adversarial	Local realism; prevents blurring that LPIPS misses
KL regularization (optional, tiny $λ$ )	Keeps latent magnitude bounded; not a true bottleneck

L1 Reconstruction

Straightforward pixel-level L1 between input $x$ and reconstruction $\overset{x}{^} = D (E (x))$ . Necessary but insufficient — L1 minimization is equivalent to maximizing a Laplacian likelihood, which averages over uncertainty and produces blurry outputs wherever the decoder is unsure.

PatchGAN — Patch-Based Adversarial Loss

Paper: Isola et al. 2017, “Image-to-Image Translation with Conditional Adversarial Networks” (pix2pix)

A full-image discriminator (real/fake for the whole image) is expensive, gives a single weak gradient signal, and tends to focus on global structure while ignoring local texture. PatchGAN replaces it with a fully convolutional discriminator that produces a spatial grid of real/fake scores, each corresponding to a local patch of the input (e.g. 70×70 pixels):

L_{adv} = E [lo g D (x)] + E [lo g (1 - D (\overset{x}{^}))]

where $D$ outputs a grid, not a scalar. The generator must fool every patch independently — it cannot hide blurriness in any local region. This specifically targets the high-frequency local texture that L1 and LPIPS both fail to enforce.

Why PatchGAN complements LPIPS

LPIPS catches semantic/structural deviations. PatchGAN catches local sharpness failures. They cover different failure modes of pure reconstruction losses, which is why both are needed.

Tokens + AR Transformer

The original paper pairs the discrete autoencoder with a GPT-style autoregressive transformer over the resulting tokens for generation. LDM kept the VAE recipe but replaced the AR transformer with a diffusion model over continuous latents, which is the formulation that scaled.

What “VQGAN-style” Usually Means Today

When people say “VQGAN-style training” they mean the four-loss recipe, not the discrete tokens or AR transformer. SD’s VAE, DiT’s VAE, SD3/FLUX VAEs all use this recipe with a KL-regularized continuous latent. The discrete VQ bottleneck is the part that didn’t survive at scale.

Yanda's Random Notes

Explorer

VQGAN

The VAE Training Recipe

L1 Reconstruction

PatchGAN — Patch-Based Adversarial Loss

Tokens + AR Transformer

What “VQGAN-style” Usually Means Today

Graph View

Table of Contents

Backlinks