This is generated with my conversation with ChatGPT 5.2.
Core Idea
Definition
Polyak Averaging (a.k.a. Exponential Moving Average, EMA) updates parameters via:
This is a linear interpolation in parameter space, not in function space.
We are averaging weight vectors in .
The resulting function is not a linear combination of network outputs — only the parameters are averaged.
What EMA Really Does
Unrolling the recursion:
Intuition
EMA is a low-pass filter over the training trajectory. It smooths SGD noise and tracks a moving average of recent models.
Geometric Interpretation
Modern neural networks are highly overparameterized.
Empirically:
- Many minima are connected by low-loss paths.
- Solutions often lie inside wide flat valleys, not sharp isolated pits.
Insight
SGD tends to wander inside a flat basin. EMA pulls the solution toward the center of that basin.
The center of a flat valley is often:
- More stable
- Flatter
- Better generalizing
When Polyak Averaging Works Best
Success
EMA works best when:
- Training remains inside a single basin
- The minimum is wide and flat
- SGD has moderate noise (small/medium batch size)
- Learning rate is not extremely small
- Model is large and overparameterized
Large transformers and CNNs benefit strongly.
When It May Fail
Failure
EMA can fail when:
- Averaging across disconnected basins
- Solutions are very far apart in parameter space
- The minimum is extremely sharp
- Models differ by hidden-unit permutations
Averaging unrelated runs can land in high-loss regions.
Special Case: Reinforcement Learning
In Q-learning and actor-critic methods:
- Targets are bootstrapped
- Instability compounds over time
Important
EMA stabilizes target drift. It smooths oscillations and prevents divergence.
That is why Polyak averaging is standard in modern RL.
Big Picture
Summary
Polyak averaging works not because neural networks are linear, but because modern loss landscapes are wide and connected.
EMA moves parameters toward high-volume, flat regions, which improves stability and often generalization.