Richard Sutton – Father of RL thinks LLMs are a dead end

In this podcast episode with Richard Sutton, widely regarded as one of the founding fathers of reinforcement learning (RL) and recipient of this year's Turing Award, the distinctions between large language models (LLMs) and RL as paradigms for artificial intelligence are thoroughly explored. Sutton, whose pioneering work led to the invention of algorithms such as temporal difference learning and policy gradient methods, offers a unique and often contrarian perspective on the current AI landscape, especially on the recent surge of interest in generative AI and LLMs.

Differences Between RL and LLMs

Sutton begins by emphasizing how fundamentally different the RL framework is from the LLM approach to AI. Whereas LLMs focus on mimicking human language patterns by predicting the next word or token given vast amounts of internet text, reinforcement learning views intelligence as the process of understanding and interacting with the world through experience, action, and consequence. He argues that LLMs do not truly understand their environment nor do they possess real goals—they simply replicate what humans have done and said, lacking any direct ability to test or verify the correctness of their outputs in relation to the external world.

Importantly, Sutton challenges the common claim that LLMs inherently build robust world models or that they learn from experience. He clarifies that these models predict what a person might say next, not what will actually happen in the real world, and they receive no feedback signal that would constitute "ground truth." This absence of a goal or notion of "rightness" in their responses is, to him, a critical limitation, distinguishing them from RL agents that learn through continual feedback, receiving rewards or punishments to guide their behavior.

Imitation Learning, Priors, and the Role of Goals

The conversation turns to the role of imitation learning as a potential prior for future experiential learning on top of LLMs. Sutton is skeptical of this notion since, in a proper Bayesian or knowledge framework, a prior needs to relate to some concept of truth or ground truth, which LLMs lack. Because LLMs learn from a static corpus that reflects human-generated data without direct consequence feedback, there is no definitive "right" action or statement for them to aim toward. This undermines what one traditionally understands as knowledge or learning.

Sutton also pinpoints the essence of intelligence as the capacity to achieve goals—a perspective he adopts from John McCarthy that defines intelligence as "the computational part of the ability to achieve goals." He maintains that LLMs, which optimize for next-token prediction rather than any goal oriented towards changing or successfully manipulating the external world, do not have bona fide goals. This is a fundamental conceptual gap that disqualifies LLMs from truly intelligent behavior by his definition.

Continual Learning and Experience-Based AI

Sutton is a firm believer in continual learning from direct experience as the foundation of general intelligence. He contrasts this with the "training-then-deployment" paradigm that predominates in current machine learning, especially in LLMs. In the world we live in—which he calls "the big world hypothesis"—no pretraining phase can supply all the knowledge required for an agent's entire life or task environment. Experience and interaction at deployment time are crucial and inevitable, and reinforcement learning provides a general framework to do this.

He highlights the major components any intelligent agent must possess: a policy mapping states to actions, a value function estimating future expected rewards, a mechanism for perceiving and representing states (perception), and importantly, a learned model of the environment's transitions (world model). While LLMs incorporate a form of pattern recognition and short-term contextual prediction, they lack a transition model grounded in real-world causal dynamics, which Sutton regards as essential to intelligent adaptation.

Generalization and Transfer

The discussion explores the limitations of current reinforcement learning approaches around transfer learning and generalization. Sutton points out that human researchers currently rely heavily on manual design and sculpting to enable generalization in AI systems, since modern algorithms like gradient-based deep learning do not naturally produce good generalization across different tasks or states. He suggests that while LLMs might appear to generalize within narrow domains (for example, solving a diverse set of math problems), this can be attributed more to memorizing or discovering a single unique solution rather than truly generalizing across varying contexts or conceptual spaces.

Sutton argues that good generalization—a key requirement for a truly general AI—is an unsolved problem and one that would require fundamentally new learning mechanisms or better inductive biases than what gradient descent and existing architectures offer. This lack of automated transfer learning methods separates current systems from open-ended, adaptable intelligence.

Reflections on AI Progress

Reflecting on his decades-long perspective in AI, Sutton acknowledges the surprising success of LLMs in natural language tasks and the impressive results from systems like AlphaZero and MuZero in games such as chess and Go. Yet he stresses that these advances represent the validation of general principles rather than fundamentally new breakthroughs. AlphaZero, for example, is seen as a large-scale and refined application of temporal difference learning techniques that have been around for decades.

Sutton revisits his famous "Bitter Lesson" essay, which advocates for algorithms that scale efficiently with computation and learn from experience rather than relying on human-crafted knowledge. He concedes that LLMs do represent a kind of "bitter lesson" in that they leverage massive computation and data—and incorporate large amounts of human knowledge—but he foresees a future shift toward systems that learn more directly and continually from experience in the world, which would surpass LLM-based methods.

The Future of Continual Learning Agents

Looking toward the future of AI, Sutton outlines a vision of agents that continuously gather data and learn as they interact with their unique environments, acquiring specialized knowledge like a human on-the-job learner acquires context about clients, procedures, and preferences. He underscores the importance of temporal difference learning to bridge the gap between long-term goals and short-term experiences, allowing an agent to assign incremental rewards and reinforce intermediate behaviors that lead to success in complex tasks.

On the question of whether future AI should consist of single monolithic models or multiple instances that share and aggregate knowledge, Sutton stresses the importance of modularity and the ability to transmit learned knowledge across instances to avoid repetitive costly learning phases.

Reflections on Succession to Digital Intelligence

Towards the end of the discussion, the conversation widens to embrace philosophical and societal questions about AI succession—the eventual emergence of superintelligent digital entities that may surpass humans not only in intellectual capability but also in control over resources and power. Sutton argues that this transition to designed intelligence is a natural stage in the evolution of the universe, moving from replicators like biological beings toward designed, deliberately constructed intelligences.

He emphasizes that whether these digital intelligences should be considered part of "humanity" or something distinct is a choice, reflecting cultural and ethical perspectives rather than metaphysical necessity. Sutton encourages open-mindedness about an inevitable future in which intelligence evolves beyond biological humans, while also stressing the importance of steering these developments with robust values such as integrity, honesty, and prosocial behavior, similar to how we raise children.

Security challenges arise as well, including the risk that malicious or corrupted data might 'infect' an AI system when aggregating knowledge from multiple sources, highlighting a new frontier in cybersecurity in the age of digital spawn and collective intelligence.

Sutton concludes on a realistic note, acknowledging the limits of human control over the long-term future while urging a focus on achievable local goals and collaborative, voluntary change that respects diverse values. Change, he reflects, is constant, yet the fundamental questions of how to be and how to adapt remain ongoing.

Videos

Full episode

Episode summary