From Reflex to Reason: How AI Learns to Think

The step-by-step guide from Base LLMs to Reasoning Models

Gus Levinson

June 3, 2025

Large Language Models (LLMs) exploded into public consciousness as the brains behind ChatGPT and have become a core pillar of modern AI. In this article, I’ll explore the two pivotal categories: Base LLMs, which learn language’s seemingly invisible patterns, and Reasoning LLMs, which build on that foundation to perform multi-step, deliberate thinking. You’ll discover how Base LLMs evolve into their more sophisticated counterparts, how they draw inspiration from human psychology, and why this progression represents a crucial step toward autonomous AI agents.

For clarity, I’ll focus primarily on OpenAI’s models, as they’ve pioneered many of the key breakthroughs we’ll discuss. However, companies like Anthropic, DeepSeek, and Google are equally pushing the boundaries of what’s possible.

Before diving into the distinction between base and reasoning LLMs, let’s start with the fundamentals: what exactly is an AI model?

What is an AI Model?
The ChatGPT Breakthrough
Beyond Base Models
- Human Learning, Machine Learning
- Instinctive Reasoning
From Thinking to Acting

What is an AI Model?

An AI model is fundamentally a mathematical function. You give it inputs and it produces outputs, much like the simple equation y = x + z. If x = 1 and z = 2, then the function outputs y = 3.

AI models work in a similar way, but their equations are vastly more complex. Modern AI models like LLMs discover these intricate mathematical relationships through a process called training, where the model analyses enormous datasets to learn how inputs should map to outputs. This approach is called machine learning, where the AI learns patterns from data instead of following hand-coded rules.

LLMs are a specific type of AI model that work with human language: they read text and generate text responses. While many leading LLMs can also process images, audio, and/or video, we’ll stick to text examples to keep things simple.

Base Models: The First Layer of Intelligence

The ChatGPT Breakthrough

Cast your mind back to November 30, 2022. A groundbreaking AI product had just been released to the public: ChatGPT. Within two months, it surpassed 100 million users making it the fastest-growing consumer app in history.

ChatGPT is not the name of an AI model, but a product from OpenAI that lets you interact with their models. When it launched, it ran on GPT-3.5, which is what we now call a base model.

“GPT” stands for Generative Pretrained Transformer:

Generative means the AI can generate entirely new content
Pretrained refers to the model’s training process on huge quantities of text
Transformer is the underlying architecture, first introduced by Google researchers in 2017

Since ChatGPT’s launch, progress has been extraordinary. OpenAI has released newer base LLMs such as GPT-4 and GPT-4.1, each faster, more affordable, and more capable than the last.

How Base Models Learn Language

Base LLMs learn to imitate vast swathes of internet text through a process called pretraining. These models discover statistical patterns of which words tend to follow one another in different contexts, empowering it to mimic human language.

What’s remarkable is that all LLMs really do is predict the next token (roughly corresponding to a word) based on the context so far. Each input token is converted into numbers, passed through the model’s internal function, then decoded back into a token and added to the conversation. This loop continues, token by token, until the response is complete.

For example, if you prompt a base model to “Tell me a story”, it might begin like this:

Step 1 → “Once”
Step 2 → “Once upon”
Step 3 → “Once upon a”
Step 4 → “Once upon a time”

Why? Because the model has learned that “Once upon a time” is a statistically common story opening.

That’s not to say the model will always start a story with those same words. Under the hood, it estimates how likely each token in its vocabulary is to come next. For the prompt “Once upon a,”, it might judge “time” to be 85% likely, “dream” 7%, “midnight” 5%, with the remaining 3% split across other tokens. It then draws a random number between 0 and 1. If it falls within the first 85%, the model selects “time”; if it lands in the next 7%, it picks “dream”; if it’s in the following 5%, it outputs “midnight”, and so on.

This process is known as sampling, and it’s what predominantly makes LLMs non-deterministic: the same prompt can yield different outputs on different runs. Since LLMs are modelling language, this behaviour works. Language, by nature, is not linear or fixed. There are countless ways to express the same concept. If someone asked you to tell a story, you might begin differently depending on your audience or the context of the conversation. LLMs behave similarly, mirroring the open-ended nature of human language and thought.

This unpredictability is a key distinction between LLMs and traditional software. The latter is deterministic by design: given the same input, it always produces the same output. This determinism is fundamental to ensure code behaves reliably and systems act consistently. For example, a calculator will always output 4 when you input 2+2, no matter how many times you try. In contrast, LLMs are non-deterministic, which can make building with them very challenging. While it’s possible to toggle a model to respond more consistently, doing so often comes at the expense of answer creativity and quality. LLMs tend to perform best when given the freedom to traverse language with greater flexibility.

After pretraining, base models undergo post-training, which typically involves two key steps. First, supervised fine-tuning trains the model on carefully curated examples of high-quality conversations, teaching it how to be a helpful assistant rather than just mimicking random internet text. Then comes Reinforcement Learning with Human Feedback (RLHF), where the model generates multiple possible responses to a prompt, and a human ranks them based on how well they align with desired values or behaviors. This feedback is used to fine-tune the model’s outputs, nudging it toward responses that are more helpful, safe, and aligned with human preferences.

How AI Thinks: Fast and Slow

AI has long been inspired by neuroscience and psychology. After all, if you’re building an artificial intelligence, it makes sense to draw from the most advanced and relatable form of intelligence we know: the human brain. Cognitive psychology identifies two distinct modes of thought: system 1 and system 2.

System 1 refers to fast, intuitive and automatic responses. These are the kinds of answers that come to mind instantly, like your name or the names of your family members. You don’t have to consciously think them through; they arise automatically.

System 2 thinking is slower, more deliberate and logically structured. Think of the kind of thoughts required to solve a math problem or plan a complex project.

Base LLMs are instinctively system 1 thinkers. These models excel at producing fast, fluent, knee-jerk responses. But many of the most valuable human tasks require something more: multi-step reasoning, careful deliberation, and the ability to follow a logical chain of thought. In other words, System 2 thinking.

Chaining Thoughts

You can work around a base LLM’s instinctive System 1 thinking and encourage System 2 reasoning by prompting it to “think step by step.” This technique is known as Chain-of-Thought (CoT) prompting, and it often dramatically improves the quality and reliability of the model’s responses, especially for more complex and valuable tasks.

This makes intuitive sense. A knee-jerk response to a complex question is likely to be wrong. But thinking it through methodically greatly improves your chances of getting it right. The same goes for LLMs: prompting them to reason step by step generally leads to more accurate and reliable answers. For example:

Prompt:

Tom has 3 apples. He buys 2 more apples. Then he gives 1 apple to his friend. How many apples does Tom have now?
Think step by step.

Model output:

Tom starts with 3 apples.
He buys 2 more apples: 3 + 2 = 5
He gives 1 apple to a friend: 5 - 1 = 4

While this is a simple example, the underlying concept is incredibly powerful. By encouraging step-by-step reasoning, Chain-of-Thought prompting allows base models to tackle far more complex problems than they otherwise could, greatly expanding their usefulness and value.

Tiny wording, huge lift: how researchers nudged an LLM into step-by-step reasoning.

However, Chain-of-Thought only works when the model is explicitly prompted to reason step by step. What if models could do this kind of reasoning automatically, without needing to be asked?

Beyond Base Models

Reasoning LLMs transform base models from fast, instinctive responders into deliberate, methodical thinkers. The goal is to make step-by-step reasoning the default behavior, where the model works through each part of a problem systematically rather than giving quick, surface-level answers. To achieve this, we can take inspiration from how humans learn.

Human Learning, Machine Learning

At a high level, humans learn in two ways: by imitation and by trial and error.

Imitation is when a child learns by observing their parents, like how to sit on a chair or drink from a glass.

Trial and error is when a child learns through curiosity and experimentation. If a child touches a hot stove while trying to help in the kitchen, they quickly learn from that painful moment to be more careful around heat.

When AI learns, it uses similar approaches. Imitation maps to supervised learning, where the model learns by observing correct and incorrect examples. Trial and error mirrors reinforcement learning, where the model explores different approaches and is rewarded for good ones and penalised for bad ones, gradually learning which choices lead to better outcomes.

Trial and error — AI learns through imitation (supervised learning; left) and trial and error (reinforcement learning; right).

Instinctive Reasoning

Reasoning LLMs are essentially base LLMs that have been further trained to imitate high-quality chains of thought.

AI researchers begin by prompting a base model to “think step by step” on a range of problems. They collect the best reasoning examples and fine-tune the model using supervised learning. This teaches the model to reason systematically by default, similar to how it first learned language patterns during pretraining.

The key breakthrough: self-correction. Reasoning models can evaluate whether each step brings them closer to their goal and backtrack when needed.

Imagine trying to solve a complex problem. You might start with one approach, realise halfway through that it’s leading nowhere, then backtrack and try a different method. The ability to recognise when you’re on the wrong track and course-correct is essential for tackling difficult challenges.

Through reinforcement learning during training, reasoning models learn to recognise the patterns of productive vs unproductive reasoning. This enables them to evaluate their own chains of thought as they work through problems, learn from missteps in real-time, and effectively traverse the tree of logical possibilities by abandoning the dead ends and exploring more promising alternatives.

At OpenAI, all reasoning models start with the letter “o”. Yes, even they acknowledge that the naming convention is ridiculous. The first reasoning model to be released was o1, followed more recently by o3 and o4-mini. The latter two models not only rank among the most capable available today, but also represent a major leap in independence and autonomy.

From Thinking to Acting

We’ve seen how reasoning models can think through complex problems step by step. But what happens when you combine this reasoning ability with the power to take action in the real world? In The Road to Autonomy, I explore how AI agents are evolving from passive chatbots into autonomous digital workers that can independently complete complex tasks.

We’re hiring

If you’re interested in solving hard problems in the legal space by building and leveraging agentic AI systems, check out our open roles and get in touch via our Careers Page. Even if nothing listed quite matches your experience, feel free to connect and message our CTO, Andrew Thompson, directly on LinkedIn. He’s always happy to have a casual chat over video or coffee.