motpod
Dwarkesh Podcast · May 13, 2026

Reiner Pope – The math behind how LLMs are trained and served

AI generated article / en / study
What you will learn
  • Reiner Pope – The Math Behind How LLMs Are Trained and Served In this blackboard lect...
  • The conversation moves from first-principles roofline analysis through practical impl...
  • [0:00] Batch Size, Latency, and the Economics of Inference The episode opens with a c...
Best for

Readers looking for surprising ideas from global podcasts they may not find on their own.

Source podcast

Dwarkesh Podcast / Dwarkesh Patel

Read
Open episodeFind more episodes

Reiner Pope – The Math Behind How LLMs Are Trained and Served

In this blackboard lecture, Reiner Pope (CEO of MatX, former Google TPU architect) walks through the fundamental equations that govern how frontier language models are trained and served, revealing that a surprisingly small set of parameters—batch size, sparsity, memory bandwidth, and compute throughput—determine everything from API pricing to model architecture decisions. The conversation moves from first-principles roofline analysis through practical implications for inference costs, the physical constraints of GPU racks, and the tradeoffs that explain why models are trained the way they are, all while demonstrating how much can be deduced about what AI labs are doing from public API prices and basic hardware specs.

0:00Batch Size, Latency, and the Economics of Inference

The episode opens with a concrete question: why do services like Claude and Cursor offer a "fast mode" that charges 6x more for 2.5x faster token generation, and could you push this further—paying 100x for even faster speeds, or waiting minutes for dramatically cheaper service? Pope explains that the core mechanism is batch size, and he proceeds to quantify exactly how it affects both latency and cost.

Pope introduces a roofline analysis framework for running a transformer model on a Blackwell NVL72 cluster (a rack of 72 GPUs). The analysis considers two fundamental constraints: memory bandwidth and compute performance. The time to run inference must be at least the maximum of two quantities: the time to fetch all weights and KV cache from memory, and the time to perform all the matrix multiplications. The KV cache is the stored internal representations of previous tokens that the attention mechanism needs to reference when generating each new token.

The key insight emerges when Pope plots batch size against latency. The compute time grows linearly with batch size, while the memory time has two components: a constant term for fetching weights (independent of batch size) and a linear term for fetching KV cache (which grows with batch size). The overall latency curve shows a lower bound determined by how fast the system can read all parameters from memory—you cannot beat the time required to move the weights from HBM to the compute units.

When converting this to cost per token (time divided by batch size), the picture becomes even more striking. At batch size 1, costs approach infinity because weight fetches aren't amortized. As batch size grows, weight fetch costs become negligible, and the system eventually becomes compute-bound, hitting a lower bound on cost. This explains why "slow mode" services would live on the flat part of the cost curve—they can't amortize KV cache costs across larger batches because those are unique per sequence.

12:34The Optimal Batch Size and Sparsity

Pope derives a remarkably simple formula for the batch size needed to balance memory and compute. Equating weight fetch time to compute time, and rearranging terms, he finds that the optimal batch size is approximately 300 times the sparsity ratio (total parameters divided by active parameters). For a model like DeepSeek V3, which activates 32 out of 256 experts, this gives a sparsity factor of 8, meaning the optimal batch is around 2,400 tokens.

This result is striking because it depends only on sparsity, not on absolute model size. The hardware constant (roughly 300) comes from the ratio of compute throughput to memory bandwidth, measured in a way that accounts for the precision of operations. Pope notes that this ratio has remained remarkably stable across GPU generations from A100 to H100 to B100, even as both flops and memory bandwidth have increased substantially.

The practical implication is that a single rack can serve about 128,000 tokens per second at optimal efficiency—a significant but not enormous number. When compared to global traffic numbers (Gemini reportedly serving hundreds of millions of tokens per second), this suggests that serving at competitive scale requires many racks, but the economies of scale from batching are not as extreme as one might think. The optimal batch size of roughly 2,000 concurrent sequences is achievable with a modest user base.

Pope then examines how far sparsity can be pushed. Citing a paper on unified scaling laws for routed language models, he shows that increasing sparsity (more experts, fewer active) yields diminishing returns in model quality. For example, a 370 million parameter model with 64 experts performs about as well as a dense 1.3 billion parameter model—a 4x improvement in active parameter efficiency requires a 64x increase in total parameters. However, from a systems perspective, this tradeoff is pure win: the extra memory cost of storing more total parameters is amortized over larger batch sizes, and the compute savings from fewer active parameters are direct. The only constraint is having enough users to fill the larger batches.

32:09How Mixture of Experts Models Are Laid Out Across GPU Racks

Pope shifts to explaining how mixture of experts (MoE) layers are physically mapped onto GPU racks, revealing the communication constraints that shape modern AI infrastructure. In a standard MoE layer, tokens pass through a router that sends each token to a small subset of experts (perhaps 1 in 32). Each expert is a normal MLP with up and down projections and a nonlinearity. The outputs from selected experts are summed and added to a residual connection.

The standard practice is expert parallelism: different experts live on different GPUs. For a DeepSeek-style model with 256 experts running on a Blackwell rack with 72 GPUs, you might use 64 GPUs with 4 experts each. The communication pattern is all-to-all: any GPU may need to send tokens to any other GPU, depending on the router's decisions. This creates a perfect fit for the NVLink topology within a single rack, where every GPU can talk to every other GPU in just two hops through the central NV switches.

The problem arises when you need to cross rack boundaries. The scale-out network connecting racks is typically about 8x slower than the scale-up network within a rack. If half the tokens need to go to GPUs in another rack, that slower connection becomes the bottleneck. This is why one rack effectively bounds the size of an expert layer you can efficiently deploy, and it explains the industry trend toward larger interconnect domains—from Hopper's 8 GPUs to Blackwell's 72 to Rubin's projected 500+.

Pope explains that the physical constraints on rack size are surprisingly mundane: power delivery, weight (racks need enough metal to not sag), cooling, and cable density. The cables themselves are a major constraint—connector density, bend radius, and the physical space to route thousands of cables all limit how many GPUs can be packed into a single rack. The jump from 8 to 72 GPUs was primarily a product decision to switch from tray-based to rack-based form factors, while the jump to 500+ requires genuinely new physical design to manage cable complexity.

47:12Pipeline Parallelism and the Limits of Multi-Rack Deployment

Pope introduces pipeline parallelism as an alternative for using multiple racks, where different layers of the model are placed on different racks. The key question is whether the communication cost of moving data between racks becomes a bottleneck. He derives a ratio comparing scale-up time (within a rack) to scale-out time (between racks), accounting for the 8x bandwidth difference, the number of activated experts per token, the number of layers per pipeline stage, and a factor of 2 for the round trip.

The result is that pipeline parallelism can work efficiently because the data sent between racks is relatively small—just the token representations—while within a rack, tokens must be broadcast to many experts. The ratio of scale-up to scale-out time can easily exceed 1, meaning the slower inter-rack connection is not the bottleneck. This allows a pipeline of racks where each rack handles one or a few layers before passing data to the next.

Pope notes an interesting convergence: the best parallelism strategy physically resembles the model architecture itself. Experts go on different GPUs, layers go on different racks. This is not some "galaxy brain" optimization but a natural mapping of the model's structure onto the hardware topology.

However, pipeline parallelism has significant drawbacks. During training, it creates "pipeline bubbles"—idle time when some stages are waiting for others. This is why Ilya Sutskever reportedly said, "As we now know, pipelining is not wise." The problem is that to avoid bubbles, you need micro-batches, which means you can't amortize weight loading across the full batch. The number of micro-batches must equal the number of pipeline stages, and this cancels out the memory savings from pipelining for the KV cache. Pope shows that while pipelining reduces weight memory proportionally to the number of stages, the KV cache memory per GPU stays constant because more sequences are in flight simultaneously.

For inference, pipelining is neither better nor worse for latency—the total time is the same whether layers are spread across racks or stacked in one rack. Its main benefit is reducing memory capacity requirements per rack, but with modern racks having terabytes of HBM (enough for a trillion-parameter model plus KV cache), this benefit is often unnecessary. The real value of larger scale-up domains is not memory capacity but memory bandwidth: having more GPUs in parallel to load weights dramatically reduces latency.

1:18:59How RL Changes the Optimal Training-Inference Tradeoff

Pope tackles the question of how much models are "overtrained" beyond what Chinchilla scaling laws would recommend, and how reinforcement learning (RL) generation changes this calculus. The Chinchilla-optimal point minimizes training compute for a given model quality, but the real objective is minimizing total compute (training + inference) for a given quality delivered to users. With RL, there's an additional term: the compute spent generating trajectories during RL training.

Pope's heuristic approach is that when minimizing a sum of costs, the optimum tends to occur where the costs are equalized. He breaks total cost into three components: pretraining (6 × active parameters × pretraining data), RL (2-6 × active parameters × RL data, depending on whether backward passes are done on all rollouts), and inference (2 × active parameters × inference data). The factor 6 comes from the famous "6ND" formula for training flops (forward + backward pass), while inference is just forward pass (factor 2).

Setting these costs equal and solving, Pope finds that the number of inference tokens should roughly equal the number of pretraining tokens, which should roughly equal the number of RL tokens—within factors that depend on the efficiency of RL training. Using plausible numbers (500 million tokens per second for a frontier model, deployed for 2 months), he estimates roughly 200 trillion inference tokens. Comparing this to Chinchilla-optimal pretraining (about 2 trillion tokens for a 100 billion active parameter model) suggests models are overtrained by a factor of about 100x.

This is a striking conclusion: frontier models are trained on roughly 100 times more data than would be optimal for training alone, because the savings in inference cost (from having a smaller, more efficient model) outweigh the extra training cost. Pope emphasizes that this is a rough estimate with large error bars, but the methodology—setting costs equal and solving—is powerful. It means that from public API pricing and usage numbers, one can deduce roughly how much data went into pretraining a model, even without insider information.

1:33:02Deducing Model Architecture from API Pricing

Pope demonstrates how much can be learned about a model's architecture from its public API pricing structure. He focuses on three pricing features: the increase in cost at longer context lengths, the difference between input and output token prices, and the pricing of cached versus uncached contexts.

For context length pricing (e.g., Gemini charging 50% more above 200k tokens), Pope shows that this inflection point reveals where the system transitions from being compute-bound to memory-bound. Using the roofline equations, he solves for the bytes per token in the KV cache at the crossover point. Assuming 200k context length and 100 billion active parameters, he calculates approximately 2 kilobytes per token. This is plausible for either dense attention with 8 KV heads and a head dimension of 128, or sparse attention with a sparsity factor. The fact that the pricing bump occurs at 200k suggests this is where memory bandwidth costs start to dominate compute costs.

For the difference between input (prefill) and output (decode) pricing—where output is typically 3-5x more expensive—Pope explains that this reveals how memory-bandwidth-bound the system is during decode. During prefill, many tokens are processed in parallel, so memory bandwidth costs are amortized across the batch, making the system compute-bound. During decode, only one token is generated at a time, so memory bandwidth (loading weights and KV cache) dominates. The 5x price ratio suggests that memory bandwidth costs are about 5x the compute costs during decode, confirming that these systems are heavily memory-bandwidth-limited.

For cached context pricing (10x cheaper for cache hits), Pope analyzes the memory hierarchy tradeoff. He compares three ways to produce KV cache: rematerializing from scratch (costs compute), storing in HBM (costs expensive memory capacity), and storing in slower tiers like DDR or flash. The key insight is that the optimal storage tier depends on how long you'll hold the data. The 5-minute versus 1-hour cache durations in API pricing likely correspond to flash and spinning disk tiers respectively. The drain time (capacity divided by bandwidth) of flash is roughly minutes, while for spinning disk it's roughly hours—matching the pricing tiers. This reveals that API providers are using a multi-tier memory hierarchy, with the cheapest tier being surprisingly old technology (spinning disks).

2:04:02Convergent Evolution Between Neural Networks and Cryptography

In the final segment, Pope explores the fascinating convergence between neural network architectures and cryptographic protocols. Both need to "jumble" information across all their inputs—cryptography to make structured data look random, neural networks to extract structure from seemingly random data. The mechanisms are similar, but the optimization goals are opposite.

Pope highlights that randomly initialized neural networks might actually serve as reasonable cryptographic ciphers, since random weights naturally scramble inputs. What makes neural networks interpretable is gradient descent, which requires differentiable operations. Residual connections and layer normalization keep gradients well-behaved. In contrast, cryptographic ciphers operate over binary fields and deliberately avoid differentiability—the entire point is that small input differences produce large output differences (the "avalanche effect").

The most productive crossover has been the Feistel cipher construction, which allows building invertible functions from non-invertible components. This was imported into neural networks in the 2017-18 "RevNet" (reversible networks) paper. The construction takes two inputs, applies a function to one, adds it to the other, and swaps them—creating an invertible layer. For transformers, this means the entire network can be run backwards, eliminating the need to store activations during training. Instead of saving all intermediate activations to HBM (which is the largest memory footprint during training), you can rematerialize them on the fly during the backward pass.

This is a compute-for-memory tradeoff: you spend more computation to save memory. Pope notes this is the opposite of the KV cache tradeoff (spending memory to save compute), and that the KV cache tradeoff is generally more profitable given current hardware economics. The Feistel-inspired reversible architecture remains more of an interesting idea than a practical necessity, but it demonstrates how deeply the mathematical structures of these two fields are connected.

Conclusion

This episode matters because it demonstrates that the seemingly opaque world of frontier AI—model sizes, training data volumes, architecture choices, and pricing strategies—can be understood through a remarkably small set of equations and physical constraints. Pope's blackboard lecture reveals that the same roofline analysis that explains why fast mode costs more also explains why models are trained on 100x more data than Chinchilla-optimal, why context lengths have plateaued around 200k tokens, and why GPU racks are getting bigger rather than faster per chip. The conversation leaves the listener with a powerful toolkit for reasoning about AI progress: given hardware specs and API prices, you can deduce what the labs are actually doing, and given the fundamental constraints of memory bandwidth and compute throughput, you can predict where the bottlenecks will emerge next.

Key takeaways

  • The optimal batch size for inference is approximately 300 times the sparsity ratio (total/active parameters), independent of absolute model size, and this determines the latency-cost tradeoff that underlies "fast mode" pricing.
  • Mixture of experts models are fundamentally constrained by the physical limits of GPU racks—cable density, power, and cooling—which is why larger scale-up domains (from 8 to 72 to 500+ GPUs) are the primary driver of progress.
  • Pipeline parallelism solves memory capacity problems but not memory bandwidth problems, and its benefits for inference are marginal given modern racks with terabytes of HBM.
  • Frontier models are likely overtrained by roughly 100x beyond Chinchilla-optimal because the inference cost savings from smaller models outweigh the extra training cost, especially when RL generation is factored in.
  • API pricing reveals model architecture: the context length where prices increase shows the memory-compute crossover point, the input/output price ratio reveals how memory-bandwidth-bound the system is, and cache pricing tiers correspond to different memory technologies (HBM, flash, spinning disk).
  • The convergent evolution between neural networks and cryptography is real but limited—both use mixing structures, but neural networks require differentiability while ciphers deliberately avoid it, though the Feistel construction has been successfully imported for reversible networks.
  • The memory wall (limited bandwidth and capacity of HBM) is the fundamental constraint on longer context lengths, and sparse attention provides only partial relief before quality degrades.