Latency & throughput
Latency is determined by memory IO (read all parameters and KV cache for each step).
latency = memory / memory_bandwidthThroughput is the inverse of latency, but we’re generating B tokens in parallel.
throughput = B / latencyTradeoff between latency and throughput:
- Smaller batch sizes yields better latency but worse throughput
- Larger batch sizes yields better throughput but worse latency
Easy parallelism: if you launch M copies of the model, latency is the same, throughput increases by M! Harder parallelism: shard the model and the KV cache. See Scaling book chapter on Transformers