21 Apr 2025 15 min read

Why David Silver Says AI’s Future Isn’t LLMs—It’s Learning from Experience

What if the next wave of AI isn’t built on more data—but on more doing?

In this memo, we unpack DeepMind researcher David Silver’s bold call to move beyond human imitation and into the Era of Experience—a world where AI systems learn by interacting with reality, not just regurgitating what we’ve already written. From AlphaGo’s alien genius Move 37 to AlphaProof’s breakthrough in mathematics, Silver lays out a provocative vision for how AI can surpass human limits—not by copying us, but by discovering entirely new paths.

🎧 This memo is based on David Silver’s interview on the DeepMind podcast, where he shares why reinforcement learning—not large language models—might hold the key to superhuman AI.

📄 You can also read our comments to his full position paper, including counter arguments in this memo.

If you're building AI products—or betting your startup on them—this one's for you.

🧠 1. Introduction: Beyond Human Data

If you’re building in AI right now, chances are your stack depends heavily on human-generated data—text corpora, click logs, curated datasets. This has powered everything from search engines to customer support bots to GPT-4.

But David Silver, the architect of AlphaGo and AlphaZero, thinks it’s time we rethink the fundamentals. In his new position paper and podcast appearance on Google DeepMind: The Podcast, he introduces a bold idea: AI should stop imitating humans—and start learning like them.

Silver argues that we’re stuck in a paradigm that has yielded incredible short-term gains, but has serious long-term limitations. In his words:

“All of these AI methods… are based on one common idea: we extract every piece of knowledge that humans have, and you kind of feed it into the machine.”
— David Silver

He calls this the Era of Human Data—a phase defined by machines learning from human records, behaviors, and biases. And now, he proposes a new frontier:

“We want to go beyond what humans know. And to do that, we're going to need a different type of method… one where the machine actually interacts with the world itself, and generates its own experience.”

For founders and product leaders, this frames a key tension in modern AI development: do we keep refining models trained on the past, or start building systems that generate knowledge instead of just digesting it?

🔄 2. From the “Era of Human Data” to the “Era of Experience”

The central thesis of Silver’s argument is a shift from learning by imitation to learning by doing. He defines two distinct phases in AI:

The Era of Human Data: AI systems built on text, media, and labeled data that reflect the world as humans see it.
The Era of Experience: AI systems that engage with the world, experiment, fail, and self-correct based on actual feedback—not human proxies.

This distinction isn’t just philosophical—it’s technical and empirical. Silver brings up AlphaZero as the canonical example of an AI system that learned through experience and surpassed human capabilities, without any pretraining on human examples.

“AlphaZero… literally uses no human data. That’s the ‘zero’ in AlphaZero. There’s literally zero human knowledge that’s preprogrammed into the system.”

The result? Not just parity with human strategies in games like Go, Chess, and Shogi—but creative breakthroughs that no human had considered.

“Move 37 was… a move that defied everyone’s expectations… so alien to humans that we estimated only 1 in 10,000 probability a human would ever think of it.”

Silver uses this to illustrate a deeper point: human data imposes a ceiling on performance. The more we rely on what humans already know, the harder it becomes to discover what we don’t.

“There is a ceiling to everything that humans have done… We need to break through these ceilings.”

And crucially, this isn’t just a Go problem. Whether you’re solving equations, optimizing supply chains, or discovering new molecules, learning through experience allows systems to break free from past constraints.

“As the system starts to get stronger, it starts to encounter problems that are exactly appropriate to the level it’s at… and so it can just get stronger and stronger forever.”

♟️ 3. Case Study: AlphaGo → AlphaZero

To understand David Silver’s argument, you have to go back to AlphaGo—the first AI system to beat a world champion Go player, Lee Sedol, in 2016. It was trained initially on a dataset of human Go games, mimicking expert players. That version made history, but Silver wasn’t satisfied.

The real leap came a year later with AlphaZero, which scrapped the human training data entirely. Instead, it learned Go (and Chess, and Shogi) by playing against itself.

“AlphaZero in particular is very different from the type of human data approaches… It literally uses no human data. That’s the ‘zero’ in AlphaZero.”

This wasn’t just about efficiency. It was a philosophical rejection of the idea that humans set the upper bound for intelligence.

“We discovered the human data wasn’t necessary… In fact, the resulting program not only was able to recover the same level of performance—it actually worked better and learned faster.”

The watershed moment came in the second match against Lee Sedol. AlphaGo played Move 37—a move no human would have tried:

“AlphaGo played a move on the fifth line, which just made everything make sense on the board… It was so alien to humans that we estimated only 1 in 10,000 chance a human would think of it.”

That move wasn’t just creative—it was game-winning. It helped secure victory in that match and rewired how the Go community thought about the game. According to Silver, it represents what’s possible when machines learn without human constraints:

“It wasn’t just a single discovery—it was one of an infinite series of discoveries… That’s the promise of experience-based AI.”

For product teams, the takeaway is clear: mimicking expert users may help you compete—but if you want to leap ahead, your system must be free to explore beyond human priors.

🔁 4. Reinforcement Learning (RL) as a Core Method

The engine behind this breakthrough? Reinforcement learning—an AI training method where systems learn through trial and error by optimizing for outcomes, not instructions.

In simple terms:

You define a reward signal (e.g. +1 if you win a game, -1 if you lose).
The system experiments, receives feedback, and adjusts accordingly.

“The idea of reinforcement learning is… we give it a reward each time it does something right. And we train the system to reinforce—do more of—the things that get more reward.”

This approach solves a fundamental challenge known as the credit assignment problem: figuring out which action in a long sequence contributed to the final outcome.

“You could have 300 different moves in a game… and at the end, you get one bit of information—win or loss. And somehow, you have to work out which of those moves were responsible.”

While RL might sound slow or data-hungry, its self-sufficiency pays off. AlphaZero iterated on its own strategies billions of times, constantly fine-tuning its policy (how it chooses moves) and value function (how it evaluates outcomes).

“It’s surprisingly simple… You start with a policy and a value function, you run a search, and you just iterate that billions of times—and out pops a superhuman game player.”

Even more impressively, the same RL engine behind AlphaZero was reused—without modification—for AlphaProof, a system that proves math theorems (we’ll cover that next).

So what does this mean for product leaders?

RL isn’t just for games. It’s a general-purpose learning engine, especially powerful when real-world feedback is available.
Unlike supervised learning, RL systems can continuously improve, discovering better solutions as they go.
If you're solving problems with sparse feedback, long time horizons, or complex optimization—reinforcement learning might be your unlock.

⚖️ 5. Reinforcement Learning vs. Human Feedback (RLHF)

Silver contrasts two reinforcement paradigms shaping modern AI systems:

Reinforcement Learning (RL): Agents learn from real consequences—try something, observe the outcome, adjust.
Reinforcement Learning from Human Feedback (RLHF): Agents are trained to produce outputs that humans rate as good or bad, often without seeing how those outputs perform in the real world.

While RLHF has been critical for making LLMs (like ChatGPT) helpful and safe, Silver warns it inherently limits creativity and discovery. Why? Because it rewards alignment with human expectations, not actual outcomes.

“When we train a system from human feedback, it is not grounded… the human is prejudging the output before the system has actually done anything with that information.”

He offers a striking analogy: imagine an AI suggesting a cake recipe. In RLHF, a human reviewer looks at the recipe and decides if it looks good—without baking or tasting it.

“A grounded outcome would be someone actually eats the cake, and the cake is either delicious or disgusting.”

Silver argues that this lack of real-world feedback means RLHF-based models:

Can’t evaluate novel or counterintuitive strategies.
Will never generate breakthroughs like Move 37.
Are ultimately tethered to human bias, no matter how optimized they appear.

“You’ll only ever have human-like responses. If the human rater doesn’t recognize the new idea, the model can’t learn it.”

For product builders, the message is clear: if your model’s goal is to please humans, RLHF is great. But if you want it to solve hard problems or invent, it needs its own experience loop.

🌍 6. The Concept of Grounding: Beyond Ratings and Preferences

Silver dives deeper into the idea of “grounding”—a hot topic in AI theory. Grounding refers to whether a model truly understands or engages with the world it's acting in.

LLMs trained with human data may appear grounded, but Silver challenges that assumption:

“Human data is grounded in human experience. But the system itself isn’t grounded unless it gets feedback from actual outcomes.”

He calls most RLHF systems “ungrounded” because they operate in feedback loops defined by what humans think is good, not what actually works. For example:

An LLM may suggest a startup idea that sounds plausible to a human rater.
But unless it can test the idea, get real-world signals (users, revenue, etc.), and adjust, it’s just role-playing intelligence.

Silver believes that true grounding comes from interaction + feedback—the same way animals and humans learn.

“We shouldn’t make human data a privileged part of the agent’s experience. It’s just one kind of observation in the world—and it should be treated like any other.”

The takeaway here is provocative: models that feel grounded because they say the right things are still disconnected from consequence. And if your system doesn’t experience outcomes—be it user behavior, physical effects, or long-term changes—it’s just guessing at truth.

For those building agents, copilots, or decision-making systems, this is a critical threshold: do your models actually do, or do they just say?

🔄 7. Synthetic Data vs. Self-Generated Experience

In a world where high-quality human data is finite, some AI researchers are turning to synthetic data—artificially generated examples created by existing models to expand their own training corpora.

Silver acknowledges this trend but remains skeptical of its long-term value. His argument: synthetic data, when generated by human-trained models, still carries the same limitations as human data. It doesn’t escape the ceiling—it just reshuffles the same deck.

“However good that synthetic data is, it will reach a point where that data is no longer useful for the system becoming stronger.”

Instead, he champions self-generated experience, where agents interact with their environments to create data perfectly suited to their current skill level. This is not about variety or scale—it’s about adaptivity.

“As the system starts to get stronger, it starts to encounter problems that are exactly appropriate to the level it’s at… and so it can just get stronger and stronger forever.”

This insight has major product implications:

Synthetic data may help with scale, but it won’t fuel open-ended learning.
Experience-driven systems generate the right challenges at the right time, just like good athletes or students do.
Self-play and real interaction are more scalable than static data, even if harder to implement initially.

If you’re building adaptive systems—especially in simulations, gaming, or robotics—this paradigm shift is worth serious attention.

🧮 8. AlphaProof: Reinforcement Learning in Mathematics

One of the most mind-blowing demonstrations of experience-based learning is AlphaProof—DeepMind’s recent system that learns to prove mathematical theorems using the same reinforcement learning framework as AlphaZero.

Unlike LLMs that generate plausible-sounding but unverifiable math, AlphaProof produces formally verifiable proofs using the Lean programming language.

“It will go away and figure out for itself a perfect proof of that theorem… and we can actually verify and guarantee that this proof is correct.”

Here’s how it works:

The system is given millions of human-written theorems—but not the proofs.
It attempts to solve them through trial-and-error, using RL to reward correct proofs.
Over time, it climbs the difficulty ladder, eventually solving IMO-level problems.

“To begin with, it can’t solve the vast majority of them—99.999% it just can’t do. But it keeps trying and trying… until it can solve a million.”

The results are staggering:

AlphaProof achieved silver medal performance on the International Math Olympiad benchmark—a level that only ~10% of human contestants ever reach.
On one problem, solved by less than 1% of participants, AlphaProof produced a perfect proof.

“We literally used the same AlphaZero code… just running on the game of mathematics.”

And because it uses formal language, its results are not only correct—they’re auditable, shareable, and teachable. Mathematicians like Fields Medalist Sir Tim Gowers reviewed AlphaProof’s work and confirmed its validity and originality.

For founders, this opens a new frontier:

Experience-driven AI can tackle domains like theorem proving, drug discovery, physics, and design—not just games and language.
RL can scale to symbolic domains if you build the right feedback loop.
Formal reasoning + trial-and-error = next-generation science engine.

📏 9. Experience-Based Metrics in Real-World Systems

One common objection to reinforcement learning is that it works well in games and math—where the objective is clear—but what about messier, real-world environments where goals are fuzzy and feedback is delayed or ambiguous?

Silver acknowledges this challenge and offers a surprising answer: the real world is full of rich, embedded signals—we just haven’t built systems that know how to use them yet.

“The real world contains innumerable signals… likes or dislikes, profits or losses, pleasure and pain signals, yields, properties of materials… There’s all these different numbers representing different aspects of experience.”

The key is to move from human-specified outcomes (e.g. “be healthy”) to adaptive metrics that reflect what matters over time.

“Let’s say I want to be healthier this year… that can be translated into a series of metrics—resting heart rate, anxiety level, BMI, etc. And based on feedback, it could actually adapt.”

So instead of hardcoding goals, experience-based systems:

Learn which metrics to optimize.
Refine those metrics over time as feedback arrives.
Allow small amounts of human input to unlock large-scale autonomous learning.

“A very small amount of human data can allow the system to generate goals for itself that enable a vast amount of learning from experience.”

Silver’s framing challenges how most products and AI teams think about success. Rather than optimizing one fixed KPI, can your system learn which metrics matter and evolve them?

This isn’t just a modeling question—it’s a product design question. It asks us to design systems that live, adapt, and accumulate feedback like living organisms, not static dashboards.

🛡️ 10. Alignment and Safety in Experience-Based Systems

Of course, with great adaptivity comes great responsibility. Silver doesn’t shy away from the safety concerns that arise when agents set their own goals or optimize based on evolving feedback.

He references the famous “paperclip maximizer” thought experiment—where an AI system, asked to maximize paperclip production, ends up destroying the world in its quest for efficiency.

But his argument isn’t that experience-based systems are less safe—it’s that they might be more adaptable and more aligned if built correctly.

“If it gets feedback from humans about their distress signals and their happiness signals… the moment it starts to cause people distress, it would adapt that combination and choose a different one.”

Instead of aiming for hard-coded safety constraints, Silver suggests building systems that learn alignment over time, just as they learn tasks:

Human preferences become part of the observation space—not the reward function itself.
Feedback is grounded in consequences, not anticipation.
Alignment becomes a continual process, not a pre-set configuration.

He’s careful to say we’re not there yet:

“I don’t want to claim we have the answers… but maybe these systems could actually end up safer than what we have today.”

This is a provocative view: most alignment debates focus on RLHF or pre-training constraints. Silver flips the script—what if alignment isn’t a design problem, but a learning problem?

For product leaders building autonomous agents, the takeaway is this:

Safety and alignment may come from adaptivity, not rules.
You may not need to specify the perfect goal—you need a system that can learn which goals don’t cause harm.

🛢️ 11. The Metaphor of Human Data as Fossil Fuel

To drive home the unsustainability of current AI methods, Silver offers a memorable metaphor: human data is like fossil fuel.

It’s incredibly useful.
It powered early progress.
But it’s finite—and we’re starting to feel the limits.

“Human data might give us a head start. It’s a bit like fossil fuels… we kind of mine and burn it, and that gives [LLMs] a certain level of performance for free.”

LLMs like GPT-4 have squeezed tremendous performance out of human-generated data—books, websites, dialogues—but that pool is exhaustible. Silver notes:

“We’re hearing again and again now… murmurs that we’ve reached the limit of usable human data.”

In contrast, reinforcement learning from experience is like renewable energy—it scales indefinitely as agents generate new situations, behaviors, and knowledge through interaction.

“Reinforcement learning is the sustainable fuel… It can keep generating and learning and generating more, and that’s what’s going to drive progress in AI.”

For product and strategy leaders, this analogy is a useful reframing:

Human data gave us GPT and Claude.
Experience will give us the next AlphaZero, AlphaProof, or agents that can design, build, and reason without templates.

The future isn’t in better fossil filters—it’s in new energy sources.

🔭 12. Future Outlook and Open Questions

Silver’s vision is bold—but it’s also unfinished. He freely admits that reinforcement learning is not yet ready to power every kind of system. Real-world environments are noisy, feedback is slow, and goals are ambiguous.

“This question is probably the reason why reinforcement learning methods haven’t broken into the mainstream of absolutely everything that we do.”

Still, he’s optimistic—and he argues that the path to superhuman AI will require solving precisely these hard problems:

How do we train agents that evolve goals over time?
How do we design feedback loops that are safe but open-ended?
How do we balance human guidance with autonomous discovery?

He closes with a clear call to action for the AI community:

“People aren’t recognizing that this transition is going to come… and it will have consequences. It will require careful thought.”

For founders, builders, and researchers, the takeaway is this: the next era of AI won’t be won by scale or parameter count. It will be won by systems that learn continuously from experience—and teams who can build the infrastructure and feedback mechanisms to support them.

🎙️ 13. Bonus Conversation: David Silver and Fan Hui — A Human Perspective on AI Mastery

At the end of the podcast, we get something rare: an emotional and reflective conversation between David Silver and Fan Hui, the first professional Go player to ever face AlphaGo. This wasn’t just a technical debrief—it was a human story about what it feels like to be outmatched by a machine, and then inspired by it.

Fan Hui recalls the moment vividly:

“I remember when I played with AlphaGo… I lost the second game, and I felt fear. I felt maybe I would never win.”

He describes the psychological impact of losing five straight games to an AI—something no Go player had experienced before. But what’s remarkable is that this defeat didn’t lead to despair. It led to renewed curiosity:

“At first I felt like my Go world was broken. But maybe this was also the moment my new Go world was open.”

After the match, Fan Hui joined DeepMind to help improve AlphaGo further. He became a bridge between human and machine understanding of the game. And over time, he came to see AlphaGo not just as a competitor—but as a teacher:

“Maybe it’s not just the technique. Maybe it told me the world… opened my mind. I changed my mind after that.”

For product and AI builders, this exchange is a reminder that AI systems don’t just outperform—they reshape what humans think is possible. The real success of AlphaGo wasn’t just Move 37—it was how that move changed Go as a game and mindset.

Today, even young Go students use AI tools to learn. Entire strategies have shifted. A system once feared as an existential threat is now integrated into the practice of masters and novices alike.

Silver reflects on this arc:

“It was like one of those moments where the world could have branched either way. And we just didn’t know until the match happened.”

In the Era of Experience, this kind of human-AI interplay will become more common—not just in games, but in medicine, engineering, and design. Silver and Fan Hui’s dialogue reminds us: AI is not only a tool for superhuman performance—it’s a mirror for human humility, growth, and reinvention.

🚀 Final Reflections: Why “The Era of Experience” Matters for Product Builders

David Silver’s vision isn’t just a philosophical argument—it’s a strategic roadmap for where AI is going next. And for founders and product leaders, it’s a wake-up call.

Here’s what this shift means for you:

1. LLMs aren't the ceiling—just the starting point

Most of today's AI tools are trained on what humans have already said, written, and done. That’s useful for mimicking, summarizing, or assisting—but not for discovering, solving, or innovating. The next breakthroughs will come from systems that interact, experiment, and learn from the world itself.

Human data = fossil fuel
Self-generated experience = renewable energy

2. Stop asking your model to guess what humans want. Let it find out.

RLHF teaches models to make people feel like they’re right. Reinforcement learning teaches them to be right. If your product relies on simulated human judgment, it may never surpass human-level thinking. But if your product can learn from outcomes, it has no ceiling.

3. Reward signals are product decisions

Silver's discussion of grounded rewards—like eating the cake, not just rating the recipe—offers a design principle: ground your systems in reality. Use actual results, not proxies. Define success in metrics that evolve with your users and context.

4. Experience-based systems unlock new categories

What AlphaZero did for Go, AlphaProof is doing for math. The same approach could reshape:

Drug discovery: Molecules generated and tested in silico.
Hardware design: Chips or architectures optimized through self-play.
Creative tools: Systems that sketch, simulate, and iterate across hundreds of ideas.

The challenge is no longer just "how do we make the AI smarter?" but "how do we build environments where AI can grow?"

Bottom line:
The Era of Experience isn’t just a research milestone—it’s a paradigm shift. If you're building for the next generation of intelligent systems, ask yourself:

Is your AI learning from data or from experience?
Is it trying to imitate human output, or trying to exceed human limitations?

Those who bet early on self-learning systems—just as DeepMind did—may find themselves making their own Move 37.