
Some thoughts on the Sutton interview
- Overview In this solo episode of the Dwarkesh Podcast, the host reflects deeply on Ri...
- The central tension is whether current large language models (LLMs) trained primarily...
- Dwarkesh argues that Sutton's conceptual dichotomies—imitation versus reinforcement l...
Readers looking for surprising ideas from global podcasts they may not find on their own.
Dwarkesh Podcast / Dwarkesh Patel
Overview
In this solo episode of the Dwarkesh Podcast, the host reflects deeply on Richard Sutton's worldview following a previous interview, offering a nuanced steelman of Sutton's position from his famous "Bitter Lesson" essay before articulating where he disagrees. The central tension is whether current large language models (LLMs) trained primarily on human data represent a dead-end path to artificial general intelligence, or whether imitation learning from humans is a necessary and complementary stepping stone to true reinforcement learning from the environment. Dwarkesh argues that Sutton's conceptual dichotomies—imitation versus reinforcement learning, world models versus human models, batch training versus continual learning—are not as mutually exclusive as Sutton suggests, and that current LLMs may already be on a trajectory that converges with Sutton's vision, even if they don't start there.
---
The Steelman of Sutton's Position
Dwarkesh begins by reconstructing what he believes is the strongest version of Richard Sutton's argument, acknowledging that his understanding has deepened since the interview itself. Sutton's "Bitter Lesson" is not a crude argument for throwing unlimited compute at problems, but rather a precise claim about which techniques most effectively and scalably leverage compute. The core critique is that current LLMs are extraordinarily inefficient in how they use compute: most of the compute spent on an LLM occurs during deployment (inference), yet the model learns nothing during this entire period—it only updates during a separate training phase.
This training phase itself is highly sample-inefficient. These models are trained on the equivalent of tens of thousands of years of human experience, yet all their learning comes from human-generated data. Even reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) operate within "human-furnished playgrounds" that teach prescribed skills rather than allowing organic, self-directed engagement with the world. Human data is an inelastic, hard-to-scale resource, making this approach fundamentally unscalable.
Furthermore, Sutton argues that LLMs do not build true world models—models that predict how the environment changes in response to different actions. Instead, they build models of what a human would say next, relying on human-derived concepts. Dwarkesh illustrates this with a thought experiment: if you trained an LLM on all data up to the year 1900, it likely could not derive relativity from scratch, because it lacks the causal understanding of the physical world that comes from interactive experience. The deeper architectural limitation is that LLMs cannot learn on the job; they require a separate training phase. Sutton envisions a future architecture that enables continual learning, where agents learn on the fly like humans and animals, rendering the current paradigm obsolete.
---
Imitation Learning Is Continuous with and Complementary to RL
Dwarkesh's primary disagreement with Sutton is that the concepts used to distinguish LLMs from true intelligence are not actually dichotomous. He argues that imitation learning and reinforcement learning exist on a continuum and can complement each other. The key question he posed to Sutton during the interview was whether pre-trained LLMs can serve as a useful prior upon which reinforcement learning can accumulate toward AGI.
To illustrate this, Dwarkesh invokes Ilya Sutskever's analogy comparing pre-training data to fossil fuels. Just because fossil fuels are non-renewable does not mean civilization took a dead-end path by using them—they were essential for transitioning from water wheels to solar panels and fusion. Similarly, human data may be a necessary intermediate step. The comparison between AlphaGo (initialized with human games) and AlphaZero (bootstrapped from scratch) is instructive: AlphaZero was better, but AlphaGo was still superhuman. The human data was not actively detrimental; it just became unhelpful at sufficient scale, and AlphaZero also used significantly more compute.
Dwarkesh points out that human cultural learning itself is far more analogous to imitation learning than to reinforcement learning from scratch. Humans do not invent their languages, legal systems, or the technologies in their phones—these are accumulated over thousands or millions of predecessors. While human imitation learning is not literally next-token prediction, neither is human learning perfectly described by any single machine learning regime. He suggests that supervised learning may be to human cultural learning what planes are to birds: not identical, but serving an analogous function.
Crucially, Dwarkesh argues that imitation learning and RL are not categorically different—imitation learning is simply "short-horizon RL" where the episode is one token long. The LLM makes a conjecture about the next token based on its world understanding and receives reward proportional to prediction accuracy. The more relevant question is whether imitation learning can help models learn better from ground truth, and the answer is clearly yes. After applying RL to pre-trained base models, they have achieved gold medals in International Math Olympiad competitions and built working applications from scratch—tasks that require ground-truth verification. These achievements would not have been possible from scratch without the prior provided by human data.
---
World Models vs. Human Models: A Semantic Debate
Dwarkesh pushes back on Sutton's claim that LLMs lack true world models, arguing that the distinction between a "model of humans" and a "model of the world" may be less important than Sutton suggests. What matters is whether the model of humans helps the system start learning from ground truth—that is, whether it can become a true world model through subsequent RL. He compares this to pasteurizing milk: you boil it as an intermediate step even though the final goal is to serve it cold.
He offers a pragmatic argument: LLMs clearly develop deep representations of the world because their training process incentivizes them to do so. He personally uses LLMs to learn about biology, AI, and history, and they demonstrate remarkable flexibility and coherence. The objection that LLMs are not specifically trained to model how their actions affect the world is, in Dwarkesh's view, a definitional move that defines "world model" by the process used to build it rather than by the capabilities it implies. If a system can answer questions about causal relationships and predict outcomes across diverse domains, refusing to call its representations a world model seems like semantic hair-splitting.
---
Continual Learning and the Context Window
Dwarkesh acknowledges that continual learning—the ability to learn from the environment in a high-throughput way—is a genuine gap in current LLMs. An LLM trained with RL on outcome-based rewards learns roughly one bit per episode, where an episode might be tens of thousands of tokens long. Humans and animals clearly extract far more information from each interaction than just the final reward signal. Conceptually, animals are learning to model the world through observations, with an outer-loop RL process incentivizing some other learning system to extract maximum signal from the environment. In Sutton's OAK architecture, this is called the transition model.
If one tried to shoehorn this feature into modern LLMs, the naive approach would be to fine-tune on all observed tokens. However, Dwarkesh reports from conversations with researcher friends that this straightforward method does not work well in practice. But he suggests there may be relatively straightforward workarounds. For example, one could make supervised fine-tuning a tool call for the model—the outer-loop RL incentivizes the model to teach itself using supervised learning to solve problems that don't fit in the context window.
Dwarkesh is genuinely agnostic about how well such techniques will work, but he would not be surprised if they effectively replicate continual learning. His reason is that models already demonstrate something resembling human continual learning within their context windows. In-context learning emerged spontaneously from the training incentive to process long sequences. If information could flow across windows longer than the context limit, models might meta-learn the same flexibility they already show in-context. This suggests that the gap between current LLMs and true continual learning may be narrower than Sutton's critique implies.
---
Concluding Thoughts: Evolution's Path vs. Our Path
Dwarkesh offers a final synthesis that frames the debate in evolutionary terms. Evolution performs meta-RL to create an RL agent, and that agent can then selectively do imitation learning. With LLMs, we are going in the opposite direction: we first build a base model that does pure imitation learning, then hope that enough RL will transform it into a coherent agent with goals and self-awareness. This might not work, but Dwarkesh argues that Sutton's first-principles arguments—for example, that LLMs lack true world models—are not proving as much as Sutton thinks, and are not strictly accurate for today's models, which are already undergoing substantial RL on ground truth.
However, Dwarkesh gives Sutton his due: even if Sutton's Platonic ideal does not end up being the path to the first AGI, his first-principles critique identifies genuine gaps that are so pervasive in the current paradigm that we barely notice them. Sutton's decades-long perspective makes these gaps obvious: the lack of continual learning, the abysmal sample efficiency, the dependence on exhaustible human data. Dwarkesh's prediction is that LLMs will likely reach AGI first, but the successor systems they build will almost certainly be based on Sutton's vision—systems that learn continuously from interaction with the environment rather than from static human data.
---
Conclusion
This episode matters because it crystallizes one of the most important debates in contemporary AI research: whether the current LLM paradigm is on the right track or fundamentally misguided. Dwarkesh does not simply defend the status quo; he takes Sutton's critique seriously, steelmans it, and then offers a nuanced counterargument that acknowledges genuine weaknesses while rejecting the claim that they are fatal. The listener comes away with a clearer understanding of both positions, the specific technical disagreements (sample efficiency, world models, continual learning), and the pragmatic question of whether human data is a crutch or a ladder. The episode's lasting impression is that the path to AGI may be messier and more iterative than either pure Suttonians or pure scaling enthusiasts would like to admit.
---
Key takeaways
- Sutton's "Bitter Lesson" is not about using as much compute as possible, but about finding techniques that most scalably leverage compute—and current LLMs fail this test because they learn nothing during deployment and are sample-inefficient during training.
- Dwarkesh argues that imitation learning and reinforcement learning are continuous, not dichotomous: imitation learning is simply short-horizon RL, and human data can serve as a necessary prior for subsequent RL from ground truth.
- The distinction between "world models" and "models of humans" may be semantic: what matters is whether the human-data prior enables learning from ground truth, which current LLMs demonstrably do after RL fine-tuning.
- Continual learning is a genuine gap in current LLMs, but in-context learning suggests that models may already have the capacity for it if information could flow across context windows.
- Ilya Sutskever's "fossil fuels" analogy for pre-training data captures why human data may be essential as an intermediate step even if it is not the final destination.
- Evolution builds RL agents that can do imitation learning; we are building imitation learners and hoping RL will make them agents—the opposite direction, but not necessarily doomed.
- Sutton's critique identifies real blind spots in the current paradigm (sample efficiency, lack of continual learning, dependence on human data) that are invisible to those immersed in it.
- Dwarkesh predicts LLMs will reach AGI first, but the successor systems they build will likely follow Sutton's vision of continual learning from environmental interaction.