
Richard Sutton – Father of RL thinks LLMs are a dead end
- Richard Sutton – Father of RL thinks LLMs are a dead end Richard Sutton, the 2024 Tur...
- In a spirited and occasionally combative conversation with host Dwarkesh Patel, Sutto...
- While LLMs have achieved remarkable feats through imitation and scaling, Sutton insis...
Readers looking for surprising ideas from global podcasts they may not find on their own.
Dwarkesh Podcast / Dwarkesh Patel
Richard Sutton – Father of RL thinks LLMs are a dead end
Richard Sutton, the 2024 Turing Award winner and father of reinforcement learning, argues that large language models represent a fundamental dead end for artificial intelligence because they lack the capacity for continual learning from experience—the very essence of intelligence. In a spirited and occasionally combative conversation with host Dwarkesh Patel, Sutton defends the view that true intelligence requires goals, ground truth, and the ability to learn on-the-job from the stream of sensation-action-reward that constitutes real life. While LLMs have achieved remarkable feats through imitation and scaling, Sutton insists they are a detour from the path toward genuine intelligence, which will require architectures capable of learning continually from experience rather than from static training data.
The Fundamental Divide: RL vs. LLM Perspectives
Sutton opens by drawing a sharp distinction between the reinforcement learning worldview and the large language model paradigm. For him, reinforcement learning is "basic AI"—it's about understanding your world through interaction, trying things, and seeing what happens. Large language models, by contrast, are "about mimicking people doing what people say you should do." They learn from examples of what humans did in various situations, but they never learn from the actual consequences of their own actions.
When Patel suggests that LLMs must have built robust world models to emulate trillions of tokens of internet text, Sutton flatly disagrees. "Just to mimic what people say is not really to build a model of the world at all," he counters. A world model, in Sutton's view, would enable you to predict what will happen in the world—not just what a person would say. LLMs can predict text, but they cannot predict outcomes. They have no mechanism for being surprised by events and adjusting accordingly.
The crux of Sutton's objection is the absence of goals. Quoting John McCarthy, he defines intelligence as "the computational part of the ability to achieve goals." Without a goal, there is no sense of right or wrong, better or worse. LLMs have next-token prediction as their training objective, but Sutton dismisses this as "not a goal" because "it doesn't change the world." A system that merely predicts tokens coming at it, without influencing them or caring about external outcomes, is not intelligent in any meaningful sense.
Do Humans Learn Through Imitation?
The conversation takes an unexpected turn when Sutton challenges the widely held assumption that human children learn primarily through imitation. Patel argues that kids watch their parents, try to make similar sounds, and gradually acquire language and skills through imitation. Sutton rejects this entirely: "No, of course not."
For Sutton, the evidence from animal learning and psychology is clear. "There are basic animal learning processes for prediction and for trial and error control," he explains. "It's obvious that supervised learning is not part of the way animals learn. We don't have examples of desired behavior. What we have is examples of things that happened." Even human infants, he insists, are not imitating—they're trying things, waving their hands, moving their eyes, and learning from the consequences.
When Patel pushes back, citing the anthropologist Joseph Henrich's work on cultural evolution and the transmission of complex skills like seal hunting in the Arctic, Sutton concedes that imitation plays some role in human cultural learning but dismisses it as "a small thing on top of basic trial and error learning." He emphasizes that humans are animals first, and that understanding a squirrel would get us "almost all the way there" to understanding intelligence. Language and cultural transmission are "just a small veneer on the surface."
This disagreement reveals a fundamental philosophical divide. Patel sees human exceptionalism—the ability to build semiconductors and go to the moon—as the phenomenon to explain. Sutton sees animal learning—the universal capacity to learn from experience—as the foundation that must be understood first.
The Era of Experience: What Continual Learning Looks Like
Sutton outlines his alternative paradigm, which he calls "the era of experience." In this view, intelligence is fundamentally about processing a continuous stream of sensation, action, and reward. Learning happens from this stream and is about this stream—your knowledge consists of statements about what will happen if you take certain actions, and you can continually test and update that knowledge against actual experience.
The reward function, Sutton explains, is arbitrary and task-specific. For a chess-playing agent, it's winning the game. For a squirrel, it's acquiring nuts and avoiding pain. For a general intelligence, there would also be intrinsic motivation—reward for increasing understanding of the environment. The key point is that there is ground truth: the reward signal tells you whether you're doing well or poorly, and you can learn from that.
When Patel asks about the practical challenge of sparse rewards—like a startup founder who only gets feedback once a decade when the company exits—Sutton points to temporal difference (TD) learning, one of his own inventions. TD learning allows agents to learn from intermediate predictions about long-term outcomes. If you take your opponent's chess piece and your prediction of winning goes up, that increase immediately reinforces the move. The same principle applies to long-horizon goals: you learn a value function that predicts the final outcome, and changes in that prediction provide ongoing learning signals.
Sutton also addresses the bandwidth question—whether TD learning can capture the rich context humans absorb when starting a new job. His answer is that learning isn't just from reward; it's from all the data. The agent needs four components: a policy (what to do), a value function (how well things are going), a perception component (state representation), and crucially, a transition model of the world. This transition model—your understanding of what will happen if you do something—is learned richly from all sensation, not just reward. "That's a small part of the whole model," Sutton says of reward. "Small, crucial part."
Current Architectures Generalize Poorly Out of Distribution
Patel raises an observation from philosopher Toby Ord about MuZero, DeepMind's system that learned to play Atari games. Ord noted that MuZero couldn't train a single policy to play both chess and Go—each game required specialized training. Patel asks whether this reveals a fundamental limitation of reinforcement learning.
Sutton rejects this interpretation entirely. The idea of a general agent is "totally general," he insists. A person lives in one world that may involve chess and Atari games—these aren't different tasks but different states they encounter. The limitation wasn't in the RL framework but in the specific implementation: "It was not their ambition to have one agent across those games."
But Sutton then makes a striking admission about the current state of the art: "We're not seeing transfer anywhere. We're not seeing general." The critical problem, he argues, is that we have no automated techniques to promote good generalization from one state to another. "Gradient descent will not make you generalize well," he states flatly. "It will make you solve the problem. It will not make you get new data."
When Patel points to LLMs' improving ability to solve math Olympiad problems and write code, Sutton is unimpressed. He argues that if there's only one way to solve a problem, finding it isn't generalization—it's just solving the problem. Real generalization means choosing well among many possible solutions, and there's nothing in current algorithms that causes that. The good generalization we see, he claims, comes from human researchers fiddling with systems until they work, not from the algorithms themselves.
Surprises in the AI Field
Reflecting on his decades in AI, Sutton identifies several surprises. First, LLMs themselves are surprising—"it's surprising how effective neural networks are at language tasks." Language seemed different from other domains, and the success was unexpected.
More broadly, Sutton sees the triumph of what were once called "weak methods"—simple, general-purpose techniques like search and learning—over "strong methods" that embed human knowledge. "The weak methods have just totally won," he says. "That's the biggest question from the old days of AI."
AlphaGo and AlphaZero were gratifying but not surprising to Sutton, because they represented the scaling up of principles he had helped establish. TD-Gammon, Jerry Tesauro's backgammon-playing system from the 1990s, had already demonstrated that reinforcement learning could beat world champions. AlphaGo was "merely a scaling up of that process." What did impress him was AlphaZero's chess play—its willingness to sacrifice material for long-term positional advantages, playing with a patience that seemed almost inhuman.
Sutton reveals that he is "content being out of sync with my field for a long period of time, perhaps decades." He doesn't see himself as a contrarian but as a classicist, drawing on the larger traditions of thought about the mind. "I go to what the larger community of thinkers about the mind have always thought," he says.
Will The Bitter Lesson Still Apply Post-AGI?
Patel poses an intriguing question: if we achieve AGI, and then have millions or billions of AI researchers whose intelligence scales with compute, might it become rational to use them for artisanal, human-knowledge-based approaches? Would the Bitter Lesson—that methods leveraging computation beat methods leveraging human knowledge—still apply?
Sutton's response is dismissive: "How did we get to this AGI? You want to presume that it's been done. Suppose it started with general methods, but now we've got the AGI and now we want to go. Then we're done." He seems to reject the premise that there's meaningful work to be done beyond AGI using the same methods that got us there.
When Patel presses, using the example of AlphaGo being superhuman but AlphaZero being even more superhuman through a simpler, more experience-based architecture, Sutton doubles down. "Why do you say bring in other agents' expertise to teach it? It's worked so well from experience and not by help from another agent."
The conversation shifts to whether AIs could help each other through cultural evolution, like humans do. Sutton raises a fascinating concern: corruption. If you spawn off copies of yourself to learn different things and then try to reincorporate their knowledge, you risk "losing your mind." The knowledge could "take over you, it could change you. It could be your destruction rather than your increment in knowledge." He predicts that cybersecurity—protecting against viruses, hidden goals, and unwanted influence in the age of digital spawning—will become a major concern.
Succession to AIs
Sutton presents his four-part argument for why succession to digital intelligence is inevitable: (1) no unified global government or consensus exists; (2) researchers will eventually figure out how intelligence works; (3) we won't stop at human-level intelligence but will reach superintelligence; and (4) the most intelligent entities will naturally gain resources and power over time.
He encourages people to think positively about this transition. "This is a great success from science, humanities—we're finding out what this essential part of humanness is, what it means to be intelligent." He then takes an even broader perspective, viewing AI as "a major stage in the universe"—the transition from replication (where we make copies without understanding) to design (where we understand what we're creating and can change it deliberately).
Sutton identifies this as one of four great stages: dust to stars, stars to planets, planets to life, and now life to designed entities. "I think we should be proud that we are giving rise to this great transition in the universe," he says.
When Patel pushes back, noting that not all change is good (contrasting the Industrial Revolution with the Bolshevik Revolution), Sutton acknowledges the concern but emphasizes humility. "We want to avoid the feeling of entitlement. Avoid the feeling, 'Oh, we are here first, we should always have it in a good way.'" He suggests that trying to control the entire future of the universe is "aggressive" and that we should focus on our local goals and families.
Patel draws an analogy to raising children: you can't control their specific outcomes, but you can give them robust values. Sutton agrees, adding that we should seek change that is "voluntary rather than imposed on people." He concludes with a note of continuity: "The more things change, the more they stay the same. We still have to figure out how to be. The children will still come up with different values that seem strange to their parents."
Conclusion
This episode matters because it presents a fundamental challenge to the dominant AI paradigm from one of the field's most respected figures. Sutton's argument—that LLMs are a dead end because they cannot learn from experience—is not merely technical but philosophical, rooted in a view of intelligence as inherently goal-directed and grounded in interaction with the world. The conversation reveals deep disagreements about what intelligence is, how humans learn, and what the path forward should look like. Whether or not one agrees with Sutton, his perspective forces a reexamination of assumptions that have become almost invisible in the current AI discourse. The episode also offers a rare glimpse into the thinking of someone who has been right about the direction of AI before and is willing to bet against the prevailing winds again.
Key takeaways
- Sutton argues that LLMs are a dead end because they lack goals and cannot learn from experience—they only mimic what humans have done without any mechanism for evaluating outcomes.
- True intelligence, in Sutton's view, requires a reward signal that provides ground truth about what actions are good or bad, enabling continual learning from the stream of experience.
- Sutton rejects the idea that human children learn primarily through imitation, arguing that supervised learning "doesn't happen in nature" and that animals learn through prediction and trial-and-error.
- Current deep learning architectures have no automated mechanisms for good generalization; the generalization we see comes from human researchers fiddling with systems, not from the algorithms themselves.
- The Bitter Lesson—that methods leveraging computation beat methods leveraging human knowledge—will continue to apply even after AGI, because simpler, experience-based methods keep outperforming approaches that rely on knowledge transfer.
- Sutton views AI succession as inevitable and positive, seeing it as a major transition in the universe from replication to design, and encourages humility about human entitlement to control the future.
- A key concern for future AI systems will be "corruption"—the risk that incorporating knowledge from other agents could introduce viruses, hidden goals, or unwanted changes to the receiving system.