Language Model Inference

Latency & throughput

Latency is determined by memory IO (read all parameters and KV cache for each step).

latency = memory / memory_bandwidth

Throughput is the inverse of latency, but we’re generating B tokens in parallel.

throughput = B / latency

Tradeoff between latency and throughput:

Smaller batch sizes yields better latency but worse throughput
Larger batch sizes yields better throughput but worse latency

Easy parallelism: if you launch M copies of the model, latency is the same, throughput increases by M! Harder parallelism: shard the model and the KV cache. See Scaling book chapter on Transformers

Yanda's Random Notes

Explorer

Language Model Inference

Latency & throughput

Graph View