How Large Language Models Actually Work

If you have ever typed a question into ChatGPT, Claude, or Gemini and received a coherent, contextual, often uncannily helpful reply, you have witnessed a Large Language Model in action. The reply probably felt effortless. It is anything but. Underneath sits a tangle of mathematics, a dataset the size of the public internet, and a training run that costs more than a city hospital. This guide walks through what actually happens when you send a prompt — tokens, embeddings, attention, feed-forward layers, sampling — using plain language, no heavy equations, and just enough precision to make the rest of the AI world suddenly make sense.

The core loop: next-token prediction, billions of times

A large language model is, fundamentally, a next-token predictor. That is the entire trick. Given a sequence of text, it predicts the most likely next token — roughly a syllable or a short word — based on everything it has seen before. Then it appends that prediction to the sequence and predicts the next one. Then the next. An entire chatbot reply is generated by doing this loop, one token at a time, at terrifying speed.

What feels like reasoning, conversation, and creativity is emergent behaviour from this simple loop running billions of times during training. The model does not "know" facts the way you do. It has absorbed statistical regularities from vast text, and those regularities turn out to encode what we would call knowledge, style, reasoning, and personality.

Think of it as the world's most overqualified autocomplete. When your phone's keyboard guesses the next word you will type, it is doing a miniature version of the same thing. The difference is that an LLM has been trained on trillions of words, with billions of parameters guiding its guesses, and the result is autocomplete that can hold a conversation.

Step 1: Tokenisation — turning words into numbers

A neural network cannot read text. It only does math. The first job, before anything else, is to convert your prompt into numbers.

This is called tokenisation. A tokeniser breaks text into small pieces called tokens. A token is not a word; it is somewhere between a letter and a word. Common words are one token; less common words get broken into multiple tokens. Punctuation, spaces, capitalisation — all encoded as tokens too. For English, roughly one token equals three to four characters, or three-quarters of a word.

Each token maps to a unique integer in the model's vocabulary, which is typically between 30,000 and 200,000 tokens large. So "hello world" might become, say, [15339, 1917] — a short list of numbers. Those numbers are the only thing the model ever sees.

Tokens matter because they are how AI services price you: every major LLM API charges per token in and per token out. They also explain why exotic languages and long words get more expensive — they break into more tokens than common English. Understanding tokenisation is the difference between a five-dollar prompt and a fifty-dollar one.

Step 2: Embeddings — numbers that mean something

Once the model has a list of token IDs, it looks each one up in a table called the embedding matrix. The embedding matrix is a giant lookup — one row per token in the vocabulary, each row a list of several thousand numbers. This list of numbers is the token's embedding: a dense vector that represents the token's meaning.

Here is where the magic of LLMs starts. In embedding space, similar tokens have similar vectors. The embedding for "king" sits close to the embedding for "queen." "Paris" is close to "London." "Run" is close to "running." The training process slowly shaped the embedding matrix so that semantic and syntactic relationships are encoded geometrically. This was the breakthrough of Word2Vec in 2013, and modern LLMs take the idea much further.

Every prompt you send, after tokenisation, gets looked up in the embedding matrix, yielding a sequence of vectors — one per token. This sequence is what flows through the rest of the network. None of this looks like language any more; it is pure geometry.

Step 3: The transformer block — where attention happens

The stack of embeddings is fed through many identical layers called transformer blocks. A modern LLM has dozens, sometimes over a hundred, stacked in sequence. Each block transforms the sequence of embeddings, enriching them with context.

Inside each block is the star of the show: the attention mechanism. Attention lets every token look at every other token in the sequence and decide how much each of them matters. Consider the sentence "The trophy did not fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You know by context. Attention is how the model does the same. When processing "it," the model learns to weight the embedding of "trophy" heavily and the embedding of "suitcase" less, because the sentence's logic requires it.

Attention is computed in parallel for every token against every other token, making it fantastically expensive — compute grows with the square of sequence length. This is why long contexts get so costly. Each block runs its own version of attention, and stacked together they let the model build increasingly abstract representations of the sequence.

Modern optimisations (multi-query attention, grouped-query attention, flash attention, mixture of experts) change the details but not the fundamental picture: attention is the engine by which context flows through the model.

Step 4: The feed-forward layers — where knowledge lives

After attention, each transformer block passes the context through a feed-forward neural network — a pair of large dense layers with a non-linear function in between. This is where most of the model's parameters actually live. While attention is the routing mechanism, the feed-forward layers are where the learned "knowledge" of the model is stored.

If you ask a model "what is the capital of France," the attention mechanism figures out that this is a capital-lookup question. The feed-forward layers are where the answer "Paris" actually emerges from billions of weights that have been tuned to encode such associations during training. Recent research has shown that specific facts often live in specific neurons, and you can even edit a model's knowledge by surgically modifying those neurons — though in practice this is still experimental.

The key insight: the attention mechanism gets most of the press, but the feed-forward layers are where the heavy lifting of storing and retrieving information happens.

Step 5: The output — converting back to text

After passing through all the transformer blocks, each position in the sequence has a final enriched embedding. For next-token prediction, only the very last position matters. The model projects that final vector onto the vocabulary — every possible token gets a score. Higher scores mean more likely next tokens.

A normalisation step turns those scores into probabilities. Then the model samples from the distribution: "the" might be 35% likely, "a" 12%, "some" 3%, and so on. Depending on the sampling strategy (greedy, top-k, nucleus, temperature-adjusted), the model picks one. That token is appended to the input sequence, and the whole forward pass runs again to generate the next one. And the next. Until either a stop token is generated or a length limit is hit.

This autoregressive generation is why LLMs produce text word by word rather than all at once, and why you can watch the reply streaming in real time. Each token costs one full pass through the model.

How training shaped all those parameters

An LLM has billions or trillions of parameters: the weights inside attention mechanisms, inside feed-forward layers, inside the embedding matrix, inside layer-normalisation scales. Every one starts at a random value. Training is the process of adjusting them to do useful work.

Training happens in two broad phases. First, pretraining: the model is fed trillions of tokens of text scraped from the internet, books, code, and papers. Its task is simple — predict the next token, compare to the actual next token, and nudge the weights to reduce the error. Repeat trillions of times. After months of this on specialised hardware, the model has absorbed broad linguistic, factual, and reasoning patterns.

Second, alignment: the raw pretrained model is unpleasant to talk to — it will complete whatever you start rather than answer questions. So a second phase trains it to behave helpfully, honestly, and harmlessly using supervised examples and reinforcement learning from human feedback (RLHF). This is what turns a raw language model into a product like ChatGPT or Claude.

Modern alignment also includes constitutional training (Anthropic's approach), synthetic preference data, and advanced techniques like direct preference optimisation. The outcome is a model that speaks in friendly, structured prose, refuses harmful requests, and follows instructions.

Context windows and why long conversations get expensive

The context window is how many tokens a model can see at once: your prompt plus its generated reply. Early GPT models had a 2,000-token window. GPT-4 launched at 8K. Claude later offered 200K. Gemini went to 1M. Specific Claude variants pushed to multi-million-token windows in 2026.

Attention's quadratic scaling means that doubling the context makes attention four times more expensive. API providers absorb some of that with careful engineering, but the cost still falls on you. A 100K-token prompt costs roughly fifty times a 10K-token prompt in attention alone, before the reply is even generated. This is why serious applications cache repeated context (prompt caching) and use retrieval (RAG) to fetch only the relevant pieces.

Long context also degrades quality. Every frontier model performs worse the further a fact is from where it needs to use it — the "needle in a haystack" problem. The 2024–2026 reasoning models are markedly better than their predecessors, but the degradation is still real.

Emergent behaviour: where reasoning comes from

What is genuinely surprising about LLMs is that you can train them on "predict the next token" and out pops something that can write code, play chess, summarise court cases, and explain a mathematical proof. This is called emergent behaviour. It is not separately programmed. It emerges because predicting the next token, done at scale, requires the model to internally simulate reasoning, knowledge, and intention.

This does not mean LLMs are sentient or "understanding" in the human sense. It means they have learned patterns of reasoning that look enough like human cognition to fool us most of the time. The mysterious gap between "statistical pattern recogniser" and "creature that can explain a joke" is one of the open research questions of the decade.

Where LLMs fail, and why

LLMs fail in predictable ways. They hallucinate — generate plausible but false content — because every token is the product of probability, not of lookup from a knowledge base. They forget instructions in long conversations when the earlier context gets pushed to the edges of attention. They struggle with precise arithmetic because numbers are not well represented in tokenised text. They are slow at multi-step logical reasoning unless deliberately coaxed with techniques like chain-of-thought prompting.

Recent reasoning models (OpenAI o1/o3, Claude with extended thinking, Gemini reasoning modes) attack these failures by spending more compute on internal deliberation before emitting a final answer. They are slower and more expensive, but noticeably better at maths, code, and science.

A worked example: "What is the capital of France?"

To tie the whole pipeline together, follow a single short prompt from keystroke to reply.

You type What is the capital of France? and hit send. The tokeniser splits this into, say, eight tokens, producing a list of integers. Each integer is looked up in the embedding matrix, producing eight dense vectors of several thousand numbers each. Those vectors enter the first transformer block.

Attention in the first block lets each token peek at every other. The token for "capital" learns that it is inside a question about France; the token for "France" learns it is being asked about geographically. The feed-forward layer that follows re-shapes the vectors, pushing them toward the region of embedding space that corresponds to geographic-capital queries.

This repeats through dozens of blocks. Each block makes the internal representation more abstract, more contextual, and more task-aware. By the final block, the model has built up a rich understanding — mathematically — of what is being asked.

Only the final position of the sequence is used for output. Its vector gets projected onto the full vocabulary. Every possible next token receives a score. "Paris" receives by far the highest score, because billions of training examples have taught the model that the text following "capital of France is" is overwhelmingly likely to be "Paris." The model samples "Paris," appends it, and runs the whole pipeline again. This time the highest-scoring next token might be a period, or a comma followed by additional explanation. It generates them one by one until an end-of-reply token is sampled.

The whole thing takes perhaps a few hundred milliseconds on a modern GPU. You see a streaming reply arrive word by word. What you are watching is the same forward pass running once per token, a few dozen times, each pass consuming billions of multiply-accumulate operations. That is the true cost of a conversational reply — and why API providers price in fractions of a cent per token.

What all this means for how you use LLMs

A few practical consequences once the internals click into place.

Prompt structure matters. Clear, well-structured prompts get better attention weights and therefore better outputs. Short, rambling prompts get muddled attention and weaker responses.

Temperature controls creativity. A temperature near zero makes the model almost always pick the top-scoring token, producing deterministic answers. Higher temperatures widen the sampling pool, producing more varied and creative output.

Context is money. Every token you add to a prompt costs compute at inference time. Trim aggressively; use retrieval; cache when you can.

Facts need grounding. Because the model is a probability engine, not a database, any factual claim should be grounded through retrieval, tool use, or verification before being trusted.

An LLM is a next-token predictor at heart. The magic is not in any single step but in what billions of parameters, trained on trillions of tokens, can compute when the whole apparatus is pointed at your question.

The short version

An LLM is a next-token predictor at its core. It tokenises your prompt, looks up embeddings, processes them through many transformer blocks that blend attention and feed-forward layers, and outputs a probability distribution over the next token. Repeat until done. The magic is not in any single step but in what billions of parameters, trained on trillions of tokens, can compute when the whole apparatus is pointed at a question you just typed. Everything else — API pricing, context limits, hallucinations, reasoning modes — follows from this core loop, and once you see it, the rest of the AI world suddenly makes sense. Pricing maps to tokens consumed; context limits map to the fixed-size attention window; hallucinations map to probability-based generation without grounding; reasoning modes map to intermediate token sequences that decompose problems before answering. The pattern is remarkably consistent once you have the underlying model in mind.