Numbers
- Training GPT-3 (2020) took 3.14e23 FLOPs
- Training GPT-4 (2023) is speculated to take 2e25 FLOPs
- A100 has a peak performance of 3.12e14 FLOP/s
- H100 is 9.90e14 FLOP/s
Linear model
Forward: We have one multiplication (x[i][j] * w[j][k]) and one addition per triple. Backward: 4 per .