AI Reasoning Models: o1, o3, and Extended Thinking

Reasoning models are the single most significant AI capability shift since ChatGPT. They represent a fundamentally different way to use compute at inference time — spending more tokens on internal deliberation before producing an answer — and the quality improvement on hard problems is dramatic. OpenAI's o1 opened the category in late 2024. Anthropic's extended-thinking Claude, Google's Gemini reasoning modes, DeepSeek R1, and a growing list of other reasoning variants followed. By 2026 reasoning is standard at the frontier of serious AI use. This guide explains what reasoning models actually are, how they differ from traditional models, when they genuinely help, when they do not justify their cost, and how to integrate them into production workflows.

The chain-of-thought breakthrough

The idea behind reasoning models has been around for years. "Chain of thought" prompting — explicitly asking a language model to reason step by step before answering — was shown to dramatically improve accuracy on hard problems as early as 2022. The breakthrough of reasoning models is that chain-of-thought is built in. Rather than a prompting trick, it is a trained behaviour, and the model is taught specifically to use extensive internal deliberation effectively.

The training regime for reasoning models involves generating long reasoning traces, evaluating the correctness of the final answer, and reinforcing the traces that led to correct answers. Over billions of training examples, the model internalises that longer, more thorough reasoning on hard problems produces better outcomes.

The result is a qualitatively different kind of AI. When you give a reasoning model a hard problem, you can see (optionally) the model spending seconds or minutes thinking — trying different approaches, checking work, backtracking, considering edge cases — before committing to an answer. For the right problems, this dramatically outperforms even frontier non-reasoning models.

o1, o3, and Claude extended thinking

The major reasoning models in 2026.

OpenAI o-series. o1 (launched late 2024), o3, and subsequent variants. Notably strong on mathematical and scientific reasoning. Accessible through ChatGPT Plus and Pro, and via the OpenAI API. Prices are higher than non-reasoning models, typically by a factor of 5-20x depending on task.

Claude with extended thinking. A mode of Claude (Opus and Sonnet) where you enable extended deliberation. Configurable thinking-token budgets let you trade depth for cost. Available through Claude.ai, Anthropic API, AWS Bedrock, and GCP Vertex.

Gemini reasoning modes. Google's equivalent, integrated into Gemini. Strong on maths and scientific reasoning with good grounding capabilities.

DeepSeek R1 and successors. The Chinese-origin reasoning model that shocked the industry with its cost-efficiency. Open-weight; can be self-hosted for strong reasoning at much lower cost than closed alternatives.

All of these serve the same conceptual purpose but have different strengths on specific benchmarks. For the hardest tasks, benchmarking them on your specific problem matters more than relying on published scores.

When reasoning models earn their cost

Reasoning models cost more and take longer than non-reasoning models. They are not always worth it. The situations where they earn their premium.

Complex mathematical problems. Competition-level maths (AIME, Olympiad problems), advanced calculus, proof-writing — reasoning models dramatically outperform non-reasoning ones.

Scientific reasoning. Chemistry problems requiring multi-step analysis, physics derivations, biology questions involving complex causal chains. The extended deliberation catches errors that quick responses miss.

Code requiring careful algorithmic thought. Non-trivial algorithms, systems design questions, debugging that requires considering many cases. Reasoning models often produce markedly better code for these problems.

Multi-step analytical questions. Business analysis requiring integrated reasoning across data, legal analysis with multiple factors, strategic planning with tradeoffs. Reasoning models handle the integration better.

High-stakes decisions. Even if a regular model could handle it, for decisions that matter, the extra compute for extended thinking is cheap insurance against plausible-but-wrong answers.

The common pattern: reasoning models are worth their premium for the subset of queries that are genuinely hard. For routine queries, they are overkill.

When reasoning models are overkill

Equally important: tasks where reasoning models add cost without adding value.

Simple factual questions. "What is the capital of France?" does not benefit from extended deliberation.

Well-defined code completions. If the task is clear and the pattern is common, a fast non-reasoning model produces the same output faster and cheaper.

Creative writing. Reasoning models do not consistently write better creative content. Sometimes they over-deliberate and produce stiffer prose.

High-throughput classification and extraction tasks. For operations at scale, the per-query latency and cost of reasoning models is prohibitive without compensating quality gains.

Chat applications with real-time latency requirements. Reasoning models are slower. Users waiting 30+ seconds for an answer feel the lag.

Cost-sensitive bulk processing. Reasoning modes used indiscriminately blow up API bills fast.

How chain of thought actually works inside the model

Under the hood, reasoning models generate long internal thought traces before their final answer. A typical reasoning trace for a hard problem might include hypothesis generation, trying approaches and abandoning them, intermediate calculations, consistency checks, and considered judgement about the best final answer.

These traces are trained to be useful, not theatrical. The reasoning does real cognitive work — the final answer emerges from the deliberation, not in spite of it. A reasoning model with its thinking suppressed would perform worse than the same model with thinking enabled.

Different vendors handle the thinking visibility differently. OpenAI's o-series generally hides the reasoning traces; Claude's extended thinking makes them visible by default; DeepSeek R1 shows detailed reasoning openly. Visible reasoning helps users understand and verify; hidden reasoning protects certain business interests of the provider.

For production use, whether you need visible reasoning depends on your use case. For user-facing applications with technical audiences, visible reasoning builds trust. For consumer products, hidden reasoning with clean final output often produces better UX.

Thinking budgets and control

Modern reasoning APIs let you configure how much thinking to allow per request.

Claude's extended thinking lets you set a budget in thinking tokens — typically anywhere from 1,000 to 64,000. Higher budgets allow more deliberation but cost more and take longer.

OpenAI's o-series offers a "reasoning effort" parameter ("low", "medium", "high") that similarly controls depth.

This control is more important than it first appears. For applications that mix easy and hard queries, adaptive thinking budgets — low for easy queries, high for hard ones — produce dramatic cost improvements without sacrificing quality.

The practical pattern. Classify incoming queries by difficulty (with a cheap classifier). Route easy queries to non-reasoning models or low thinking budgets. Route hard queries to reasoning models with generous budgets. The combined cost is much lower than using reasoning uniformly, with similar quality.

Integrating reasoning into production applications

A production-ready pattern for leveraging reasoning models.

Step 1: default to fast non-reasoning models. Most queries route here. Cheap, fast, good enough for the majority of cases.

Step 2: identify hard queries. Either through explicit classification (a small classifier predicts difficulty), heuristics (question length, presence of reasoning keywords), or fallback (non-reasoning model produces a response it is uncertain about).

Step 3: escalate hard queries to reasoning models. Use appropriate thinking budgets; the hardest queries get the largest budgets.

Step 4: cache aggressively. Reasoning outputs for identical or near-identical queries can often be cached, amortising the expensive computation across many users.

Step 5: monitor costs and quality. Reasoning is expensive; track spend by query type. If certain query types are costing too much, consider tuning the classification or caching.

This pattern is standard in mature reasoning-model applications. Teams that try to use reasoning for everything burn through budgets; teams that use it selectively get most of the benefit at a fraction of the cost.

Evaluating reasoning quality

How to know whether reasoning is actually helping on your tasks.

Build a test set of hard queries where you know the right answers. Compare outputs from a non-reasoning model and a reasoning model on the same queries.

Grade not just final accuracy but the quality of reasoning. If the final answer is right, does the reasoning chain also make sense? Hollow reasoning that lands on the right answer by coincidence is less reliable.

Track failure modes. Reasoning models sometimes produce long, confident-sounding reasoning that is nonetheless wrong. The mistakes are more subtle than with non-reasoning models; the reasoning looks plausible, and the wrong answer is buried in otherwise good logic.

Measure cost-per-correct-answer. If reasoning models are 20x more expensive but only catch 10% more correct answers, the per-correct-answer cost is actually higher than the non-reasoning option.

The downside: reasoning hallucination

Reasoning models do not eliminate hallucination. They introduce a specific new failure mode: plausible-looking but wrong reasoning chains.

A reasoning model can walk confidently through a false chain of logic. Each step looks reasonable; the conclusion follows from the chain; but one of the intermediate steps was a hallucination that the model failed to catch.

This failure mode is particularly insidious because the reasoning chain provides a plausible-sounding justification for the wrong answer. Reviewers who skim the reasoning and accept the final conclusion may miss that a crucial intermediate step was invented.

Countermeasures. Verify key intermediate steps for high-stakes applications. Cross-check with a different model. Prefer tool-use-augmented reasoning where possible (having the model use calculators, verified databases, or code execution to check intermediate steps rather than just reasoning about them).

Reasoning models are a net improvement but not a panacea. The discipline of verifying outputs remains essential.

Reasoning plus tool use

The most powerful combination in 2026: reasoning models with tool use.

A reasoning model with a code interpreter can reason about a maths problem and run code to verify calculations. A reasoning model with web search can reason about a factual question and verify specific claims. A reasoning model with a database tool can reason about analytical questions and query real data.

This combination addresses the reasoning-hallucination failure mode. The model reasons, but it also verifies via tools. The combined system is more reliable than reasoning or tools alone.

Most major reasoning models support tool use natively. The integration is straightforward via API. For production applications where accuracy matters, reasoning-plus-tools is the right default pattern.

A worked example: using reasoning for bug investigation

Making the value concrete. Consider a hard production bug — an intermittent race condition in a distributed system that only appears under specific load patterns.

Non-reasoning approach. Paste the symptoms and code into Claude Sonnet, get a list of plausible causes, try the most likely first. After three wrong hypotheses and several hours, eventually find the real cause.

Reasoning approach. Paste the same context into Claude with extended thinking or o3. The model spends a few minutes internally considering many hypotheses, eliminating ones that do not fit the specific failure pattern, and focusing on the ones consistent with all the evidence. The final answer is often correct on first try.

Cost comparison. The non-reasoning approach used moderate compute but wasted hours of engineer time. The reasoning approach used 10x more compute but saved hours. At typical engineer hourly rates versus reasoning-model costs, the reasoning approach usually wins economically for hard bugs.

The value of reasoning models for hard problems is rarely about the API cost in isolation. It is about the total engineering time saved when the model gets the answer right on the first attempt.

Cost patterns that work

Concrete cost-management patterns for reasoning-heavy applications.

Routing. Classify each query's difficulty; route accordingly. A small non-reasoning model can classify queries cheaply; the savings from routing easy queries away from reasoning models are substantial.

Thinking budget tuning. For queries that do use reasoning, use the minimum thinking budget that produces acceptable quality. Over-budgeting wastes compute.

Caching. Cache reasoning outputs aggressively. For queries that recur, the cached response is essentially free.

Tiered models. Use cheaper reasoning variants (Sonnet with extended thinking) for most cases, escalating to premium reasoning (Opus with extended thinking, o3) only when quality demands it.

Batch processing. For non-latency-sensitive reasoning workloads, batch processing at discount rates can reduce costs substantially.

Teams that apply all these patterns often run reasoning-heavy applications at costs 5-10x lower than naive deployments.

Common mistakes with reasoning models

Anti-patterns.

Using reasoning for everything. The most common mistake. Blow up your budget without improving quality on easy queries.

Ignoring the hallucination failure mode. Reasoning traces look authoritative but can be wrong. Verify high-stakes outputs.

Skipping tool use. Reasoning alone is good; reasoning plus tools is much better for factual work.

Setting reasoning budgets blindly. The default budgets are often too high for simple reasoning tasks and too low for the hardest ones. Tune per task type.

Not monitoring costs. Reasoning bills can surprise you quickly. Track by query type and iterate.

Assuming faster reasoning models are coming soon. They are, but today's reasoning models have inherent latency from the extended thinking. Design UX around the current reality, not the hoped-for future.

What is coming next

Near-term developments in reasoning.

Faster reasoning. Model efficiency improvements are shrinking the latency gap between reasoning and non-reasoning models. Expect sub-second reasoning on moderate problems by 2027.

Integrated reasoning. The distinction between "reasoning mode" and "non-reasoning mode" may blur. Models may start reasoning automatically on hard queries without explicit mode selection.

More accessible reasoning. Open-weight reasoning models (DeepSeek R1 and successors) are closing the gap with closed reasoning. Self-hosted reasoning becomes economically attractive at scale.

Specialised reasoning models. Models tuned for specific domains (legal reasoning, medical diagnosis, scientific computation) will deliver better performance in their domains than general-purpose reasoning models.

Multi-agent reasoning. Orchestration patterns where multiple reasoning models collaborate on hard problems, cross-checking each other. Emerging in advanced AI systems today; will become standard.

Reasoning versus expertise: where humans still lead

A frank note: reasoning models are good at the kind of hard problems benchmark-creators can construct. They are less reliably good at the hard problems real expertise encounters, which often require knowledge the model does not have, judgement shaped by decades of experience, or creative insight that no training regime fully captures.

For medical diagnosis, legal reasoning, complex engineering design, creative strategy, and similar domains, reasoning models augment expert judgement but do not replace it. They handle the parts of the problem that are structurally well-defined; humans handle the parts that require taste, context, and wisdom.

Over time, this balance may shift further toward AI. But in 2026, the right framing is reasoning-models-as-powerful-assistants, not reasoning-models-as-substitutes-for-experts.

Reasoning models spend extra compute to think before answering, and they crush hard problems — but they are 5-20x slower and pricier. Route thoughtfully; do not use them for queries they cannot improve.

The short version

Reasoning models represent arguably the most significant AI capability shift since the public launch of ChatGPT. They dramatically outperform non-reasoning models on hard maths, science, coding, and analytical problems — at 5-20x the cost and latency. Use them selectively for the queries that benefit; use fast models for everything else. Thinking budgets, smart routing by difficulty, and aggressive caching of outputs keep costs sane. Pair reasoning with tool use for maximum reliability. Watch for reasoning-chain hallucinations in high-stakes applications. The field is still evolving rapidly, but reasoning modes have already become permanent features of any serious AI stack in 2026 and every subsequent year.