The Rise of Reasoning Models: How AI is Learning to Think Step by Step

A concise introduction to reasoning models as the next frontier in AI, highlighting their structured thinking capabilities beyond text prediction.

GÁBOR HORVÁTH

April 03, 2025

A new paradigm is now mainstream: reasoning models. This advanced form of language models represent a significant leap forward in AI capabilities, offering more than just text generation: they provide structured, logical thinking that mimics human reasoning processes. Let's explore what makes these models special and how they're changing the AI paradigm today.

What is Reasoning?

Reasoning is the cognitive process through which we draw conclusions from available information, encompassing logical thinking and problem-solving. In AI terms, reasoning involves:

Deductive reasoning: Drawing specific conclusions from general principles
Inductive reasoning: Forming generalizations based on specific observations
Abductive reasoning: Inferring the most likely explanation from incomplete data

The significance of reasoning in AI is profound: it enables machines to simulate human decision-making and problem-solving abilities.

Unassailable logic

Starting point #1: Witches burn.
Starting point #2: Wood burns.
Conclusion: Witches are made of wood.
Starting point #3: Wood floats on water.
Starting point #4: Ducks float on water.
Conclusion: Therefore, if a woman weighs the same as a duck, she is made of wood, and thus a witch.

Monthy Python Witch Scene Screencap — Monty Python: Monty Python and the Holy Grail:
Witch scene https://www.youtube.com/watch?v=yp_l5ntikaU

The Limitations of Current LLMs

Large language models have already achieved impressive capabilities:

Natural language understanding: Processing and generating human-like text
Retrieval capabilities: Accessing vast information to provide relevant answers
Basic reasoning: Solving simple logical tasks

However, they still struggle with complex reasoning challenges. When faced with multi-step problems requiring deep logical analysis, traditional LLMs often falter, producing plausible-sounding but incorrect results.

Example of a faulty output, demonstrated by the infamous “strawberry problem”

Chain-of-Thought: The First Step Forward

One of the earliest approaches to improve reasoning was Chain-of-Thought (CoT) prompting: a technique that encourages models to generate intermediate reasoning steps before providing a final answer.

Advantages of Chain-of-Thought:

Increases transparency and accuracy in the model's reasoning process
Makes the "thinking" visible and verifiable

Disadvantages:

Significantly more tokens used, increasing costs
Still relies on the model's inherent capabilities

The Birth of True Reasoning Models

Reasoning models take this concept to the next level. Unlike standard LLMs with Chain-of-Thought prompting bolted on, these are models specifically trained to:

Think before answering, using dedicated <think> phases
Structure their reasoning explicitly
Verify and correct their own thought processes

As DeepSeek researchers observed in their groundbreaking paper on the R1 model:

"These models first think about the reasoning process in their mind and then provide the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively."

The key insight is that by increasing inference compute—specifically by extending the thinking phase—accuracy improves dramatically. This seems to contradict traditional understanding of autoregressive processes, where generating longer outputs typically increases the probability of errors.

The Paradox of Reasoning Models

This brings us to a fascinating contradiction. According to traditional understanding of autoregressive models (as highlighted by researchers like Yann LeCun), the error probability in large language models should increase with output length—a phenomenon known as "compounding error."

Yann LeCun – A Path Towards Autonomus Machine Intelligence
https://www.youtube.com/watch?v=OKkEdTchsiE

Yet reasoning models demonstrate the opposite effect. When given more "thinking time" (more tokens dedicated to reasoning), these models become significantly more accurate. This challenges fundamental assumptions about how large language models function and learn.

How Reasoning Models Are Trained

Creating a reasoning model involves several key stages that differ from traditional LLM development:

The Complete Pipeline (DeepSeek-R1 Example):

Cold Start Phase: Begin with an existing pretrained model and fine-tune it on cleaned Chain-of-Thought outputs
Reasoning Reinforcement Learning: Apply specialized RL techniques (like GRPO - Group Relative Policy Optimization) to reward sound reasoning
Rejection Sampling SFT: Further fine-tune using only the best reasoning examples
Diverse RL: Apply RLAIF (Reinforcement Learning from AI Feedback) to increase versatility

The Key Innovation: Rule-Based Reinforcement Learning

Unlike traditional supervised fine-tuning (which teaches models to imitate specific reasoning patterns), reasoning models employ reinforcement learning in easily verifiable domains:

They let the model attempt to solve problems using its own Chain-of-Thought approach
The model receives rewards based on reaching correct conclusions, not following prescribed reasoning paths
This encourages exploration of diverse reasoning strategies

This approach is particularly effective because:

Verification is simpler than generation: It's often much easier to check if an answer is correct than to produce the correct reasoning path
Binary feedback promotes learning: Clear right/wrong signals in domains like mathematics provide strong learning signals
Format consistency is enforced: Models learn to maintain proper <think> and <answer> structures

The "Aha Moment": Emergent Capabilities

Perhaps the most fascinating aspect of reasoning models is what DeepSeek researchers call the "aha moment"—when the model, trained with reinforcement learning, spontaneously develops self-correction mechanisms:

Backtracking: Recognizing dead-end reasoning paths
Self-reflection: Questioning its own assumptions
Re-evaluation: Reconsidering earlier approaches
Exploring alternative solutions: Testing different problem-solving strategies

Example of a mathematical solution with an "aha moment" from the DeepSeek-R1 paper

These capabilities emerged without explicit training, demonstrating how reinforcement learning can yield behaviors beyond what was directly programmed.

Generalization: The True Test

The million-dollar question for reasoning models is generalization: can reasoning skills learned in closed, reward-rich domains (like mathematics) transfer to open-ended problems?

Research shows a stark contrast between supervised fine-tuning and reinforcement learning approaches:

SFT models: Perform well on in-distribution problems but fail catastrophically on out-of-distribution challenges
RL-trained reasoning models: Maintain reasonable performance even on novel problem types

Graph showing RL vs SFT performance on out-of-distribution tasks
https://arxiv.org/pdf/2501.17161

This transfer learning capability is critical for the real-world utility of reasoning models. As Andrej Karpathy noted, the success of reasoning models depends heavily on whether "knowledge acquired in closed reward-rich domains can transfer to open-ended problems."

The Current State of Play: Available Models

Several leading reasoning models are now available, each with different capabilities:

Model Provider	Model	Public Access	Math Performance (AIME)	Science (GPQA)	General Reasoning (ARC)
OpenAI	o1-mini	Chat only	63.6%	60%	9.5%
OpenAI	o1	Chat only	83.3%	78.8%	25-32%
OpenAI	o3-mini	Chat & API	60-87.3%	70.6-79.7%	11-35%
OpenAI	o3	Through products	96.7%	87.7%	76-87%
DeepSeek	R1	Chat & API/Weights	79.8%	71.5%	15.8%
Google	Gemini 2.0 (Flash Thinking)	Chat & API	73.3%	74.2%	-

Notably, these models also show impressive performance on coding tasks, with most achieving high scores on platforms like Codeforces and SWE-bench verified challenges.

How to Prompt Reasoning Models

Prompting reasoning models differs significantly from working with traditional LLMs:

Technique	Traditional LLMs	Reasoning Models
Zero-shot prompting	Weak performance	Good performance
Few-shot prompting	Significantly improves accuracy	Can actually harm reasoning performance
RAG (Retrieval-Augmented Generation)	Recommended	Minimally helpful
Structured prompts (XML tags)	Helpful	Essential
Chain-of-Thought prompting	Very beneficial	Not recommended, may interfere with built-in reasoning
Ensemble methods	Improves performance	Minimal improvement

As the team at PromptHub notes: "Reasoning models work best when given clean, direct instructions without examples that might constrain their thinking process."

What's Next: The Future of Reasoning Models

Looking ahead, several key developments seem imminent:

Product-focused delivery: As OpenAI has demonstrated with Deep Research, advanced reasoning capabilities may increasingly be packaged as complete products rather than raw API access
The rise of agents: 2025 is shaping up to be the year of AI agents, with offerings like OpenAI's Operator, Claude's Computer Use, and Claude Code leveraging reasoning capabilities
Model distillation: Making reasoning abilities available in smaller, locally-runnable models

What's Still Missing?

Despite their impressive capabilities, reasoning models still have significant limitations:

They remain fundamentally language models with statistical foundations
They lack true multimodality (beyond text)
They don't have grounded knowledge in physical reality
They can't learn in real-time during inference like humans
They lack environment-adaptive capabilities

You can read more about solving these shortcomings in our previous blog post.

Potential Applications and Impact

The practical applications of reasoning models are vast:

In daily work:

Tackling complex reasoning-intensive workplace problems
Reducing the need for extensive prompt engineering

In development workflows:

Enabling truly agentic systems rather than static workflows
Making high-quality models available for on-premises deployment

In software development:

Raising the abstraction level of programming with tools like Cursor and Replit
Supporting privacy-preserving local model usage through tools like Continue and Cline

Broader Implications

The rise of reasoning models signals several important shifts in AI development:

A new scaling paradigm: pretraining data + pretraining compute + inference-time compute
Accelerated release cycles and innovation
Superhuman capabilities becoming accessible
Full task automation becoming feasible
Initially decreasing LLM development costs
Increasing value of specialized AI chips (Jevons paradox)

Conclusion

Reasoning models represent a significant evolutionary step for artificial intelligence. By incorporating explicit reasoning processes and leveraging the power of reinforcement learning, these models are pushing the boundaries of what AI can accomplish.

While we're still in the early days of this technology, the trajectory is clear: AI systems that can think through problems step by step, evaluate their own reasoning, and arrive at solutions through logical deduction are becoming reality. The implications for industries, knowledge work, and software development are profound and far-reaching.

As these technologies continue to develop, we're likely to see an increasing shift toward agentic systems that can operate with greater autonomy and tackle increasingly complex reasoning challenges.

Sources

Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165

Instruction Tuning for Large Language Models: A Survey
https://arxiv.org/abs/2308.10792

Training language models to follow instructions with human feedback
https://arxiv.org/abs/2203.02155

Deep reinforcement learning from human preferences
https://arxiv.org/abs/1706.03741

Constitutional AI: Harmlessness from AI Feedback
https://arxiv.org/abs/2212.08073

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
https://arxiv.org/pdf/2501.12948

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https://arxiv.org/pdf/2201.11903

Learning to reason with LLMs
https://openai.com/index/learning-to-reason-with-llms/

Model distillation – Improve smaller models with distillation techniques
https://platform.openai.com/docs/guides/distillation

DeepSeek R1 Distill Now Available in Private LLM for iOS and macOS
https://privatellm.app/blog/deepseek-r1-distill-now-available-private-llm-ios-macos

Reward Hacking in Reinforcement Learning
https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

OpenAI o3-mini
https://openai.com/index/openai-o3-mini/

OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12
https://www.youtube.com/watch?v=SKBG1sqdyIU

Gemini 2.0 Flash Thinkin
https://deepmind.google/technologies/gemini/flash-thinking/

ARC-AGI – About the Benchmark

https://arcprize.org/arc-agi

Humanity's Last Exam
https://agi.safe.ai/

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
https://www.anthropic.com/news/3-5-models-and-computer-use

Introducing deep research
https://openai.com/index/introducing-deep-research/

Introducing SWE-bench Verified
https://openai.com/index/introducing-swe-bench-verified/

Introducing Operator
https://openai.com/index/introducing-operator/

OpenAI o1 System Card
https://cdn.openai.com/o1-system-card-20241205.pdf

OpenAI o3 Model Is a Message From the Future: Update All You Think You Know About AI
https://www.thealgorithmicbridge.com/p/openai-o3-model-is-a-message-from

Prompt Engineering with Reasoning Models
https://www.prompthub.us/blog/prompt-engineering-with-reasoning-models

Reasoning models – Explore advanced reasoning and problem-solving models
https://platform.openai.com/docs/guides/reasoning?api-mode=chat

Anthropic Cookbook on Github

https://github.com/anthropics/anthropic-cookbook

OpenAI Scale Ranks Progress Toward ‘Human-Level’ Problem Solving
https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai

OpenAI Plots Charging $20,000 a Month For PhD-Level Agents
https://www.theinformation.com/articles/openai-plots-charging-20-000-a-month-for-phd-level-agents

Claude Code overview
https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview

Jevons paradox
https://en.wikipedia.org/wiki/Jevons_paradox

Article by GÁBOR HORVÁTH

Artificial Intelligence

Explore more stories

AI Predictions: Tracking the Prophecies
Humanity has always tried to predict the future. We’re swapping tea leaves for datasets. We explore our collective obsession with AGI timelines, the moving goalposts of superintelligence, and why we ultimately decided to build a scoreboard to keep everyone honest.
Think Big, Start Small: Digital Twinning a Coffee Machine
Digital twinning allows you to minimize downtime when maintaining large-scale machinery. But it’s not exclusive to gigantic pieces of equipment—even something as small as a coffee machine can showcase its benefits.
Building an AI Application with Databricks Apps in 30 Days
Discover how to build a production-ready AI application on Databricks Apps in just under a month. Learn from our journey, challenges, and architectural choices.

Flying high with Hifly

We want to work with you

Hiflylabs is your partner in building your future. Share your ideas and let’s work together.