Polyak Averaging

This is generated with my conversation with ChatGPT 5.2.

Core Idea

Definition

Polyak Averaging (a.k.a. Exponential Moving Average, EMA) updates parameters via:
$θ^{'} \leftarrow τ θ^{'} + (1 - τ) θ$
This is a linear interpolation in parameter space, not in function space.

We are averaging weight vectors in $R^{d}$ .
The resulting function is not a linear combination of network outputs — only the parameters are averaged.

What EMA Really Does

Unrolling the recursion:

θ_{t}^{'} = (1 - τ) θ_{t} + τ (1 - τ) θ_{t - 1} + τ^{2} (1 - τ) θ_{t - 2} + \dots

Intuition

EMA is a low-pass filter over the training trajectory. It smooths SGD noise and tracks a moving average of recent models.

Geometric Interpretation

Modern neural networks are highly overparameterized.

Empirically:

Many minima are connected by low-loss paths.
Solutions often lie inside wide flat valleys, not sharp isolated pits.

Insight

SGD tends to wander inside a flat basin. EMA pulls the solution toward the center of that basin.

The center of a flat valley is often:

More stable
Flatter
Better generalizing

When Polyak Averaging Works Best

Success

EMA works best when:

Training remains inside a single basin

The minimum is wide and flat

SGD has moderate noise (small/medium batch size)

Learning rate is not extremely small

Model is large and overparameterized

Large transformers and CNNs benefit strongly.

When It May Fail

Failure

EMA can fail when:

Averaging across disconnected basins

Solutions are very far apart in parameter space

The minimum is extremely sharp

Models differ by hidden-unit permutations

Averaging unrelated runs can land in high-loss regions.

Special Case: Reinforcement Learning

In Q-learning and actor-critic methods:

Targets are bootstrapped
Instability compounds over time

Important

EMA stabilizes target drift. It smooths oscillations and prevents divergence.

That is why Polyak averaging is standard in modern RL.

Big Picture

Summary

Polyak averaging works not because neural networks are linear, but because modern loss landscapes are wide and connected.

EMA moves parameters toward high-volume, flat regions, which improves stability and often generalization.

Yanda's Random Notes

Explorer

Polyak Averaging

Core Idea

What EMA Really Does

Geometric Interpretation

When Polyak Averaging Works Best

When It May Fail

Special Case: Reinforcement Learning

Big Picture

Graph View

Table of Contents