LPIPS

Created with discussion with Claude Sonnet 4.6

A perceptual similarity loss computed in the feature space of a pretrained network. The standard reconstruction loss for image generators where pixel-space L1/L2 is inadequate.

The Problem with Pixel-Space Losses

L1/L2 in pixel space is a poor proxy for perceptual similarity. A 1-pixel spatial shift produces high L2 error but looks identical to a human. Conversely, two images can have low L2 distance but look completely different. Minimizing L1/L2 also corresponds to maximizing a Laplacian/Gaussian likelihood, which averages over uncertainty and produces blurry outputs.

How It Works

Pass both $x$ and $\overset{x}{^}$ through a pretrained network (VGG or AlexNet), extract intermediate feature maps at multiple layers, compute L2 distance in that feature space, then take a weighted sum across layers:

L_{LPIPS} = l \sum w_{l} ∥ ϕ_{l} (x) - ϕ_{l} (\overset{x}{^}) ∥_{2}^{2}

The weights $w_{l}$ are learned on a dataset of human perceptual judgments (humans rating which of two distortions looks more similar to a reference). The intuition: if two images activate the same intermediate CNN features, they look similar to a human — texture, structure, and semantics are captured rather than pixel coincidence.

LPIPS doesn't fully replace adversarial losses

LPIPS penalizes semantic deviation well but still allows some blurring — it’s computed over spatial averages of feature maps. This is why VQGAN pairs it with a PatchGAN discriminator. When some sharpness can be sacrificed for training stability, LPIPS alone is often used.

Yanda's Random Notes

Explorer

LPIPS

The Problem with Pixel-Space Losses

How It Works

Graph View

Table of Contents

Backlinks