Transformers Architecture Without the Math

If you trace every modern AI product you use — ChatGPT, Claude, Gemini, Midjourney, Sora, GitHub Copilot — back to a single technical idea, you land on a 2017 research paper titled "Attention Is All You Need." That paper introduced the transformer architecture, and the entire subsequent decade of AI has been essentially one long application of its ideas. This guide walks through what a transformer is, how it works, why it swept away the previous generation of neural networks, and why it has turned out to apply far beyond the text problem it was originally designed for — all without a single equation.

The 2017 paper that changed everything

Before 2017, the dominant architecture for sequence data — language, speech, time series — was the recurrent neural network, or RNN, most famously in its LSTM variant. RNNs processed sequences one step at a time, carrying a hidden state from step to step. They worked well enough to power Google Translate, Siri, and a lot of other products, but they had two durable problems. They were slow to train because they were inherently sequential. And they struggled with long-range dependencies; by the time they had processed the end of a long sentence, the information from the beginning had faded.

A team at Google Research, led by Ashish Vaswani, proposed replacing recurrence with a different mechanism called self-attention. The idea was radical. Instead of processing tokens one at a time, let every token look at every other token simultaneously and decide which ones matter. This was massively parallelisable, trained much faster, and handled long-range dependencies directly.

The paper demonstrated this on machine translation, where the transformer beat the previous state of the art. But the real significance was not the benchmark. It was that transformers scaled. In the years that followed, researchers discovered that making a transformer bigger and training it on more data kept improving quality in a remarkably predictable way. This property — that scale paid off reliably — is what produced GPT, Claude, and the rest.

The one-sentence version

A transformer is a neural network made by stacking blocks, where each block does two things: it lets every token in the sequence look at every other token (attention), then processes each token independently (a feed-forward layer). Stack enough of these blocks, train them on enough data, and you get a model that can translate languages, write code, summarise documents, and hold conversations.

That is the architecture. The rest is engineering, scale, and a few clever tricks that keep the whole thing numerically stable. You can draw a transformer on a napkin, and most engineers who ship AI products in 2026 could do so from memory.

Self-attention: every token peeks at every other

Self-attention is the heart of the transformer. Given a sequence of tokens — say, a sentence — the attention mechanism produces, for every token, a weighted blend of all other tokens that captures context.

Consider the sentence "The trophy did not fit in the suitcase because it was too big." When the model processes the word "it," self-attention lets it look back at every earlier word and ask: which of these is "it" referring to? The answer depends on context — here, "trophy" is the likely referent. Self-attention computes a weight between "it" and every other token; higher weights mean more relevance. The final representation of "it" is a blend of the other tokens weighted by those scores.

The mechanics, in plain language: each token produces three vectors called query, key, and value. The query of one token is compared against the keys of all others to compute similarity scores. These scores are normalised (softmax) so they sum to one, then used to take a weighted average of the value vectors. That weighted average is the attended representation for that token. The same operation runs in parallel for every token in the sequence.

The genius of self-attention is that it is fully parallel. Every token's attention can be computed independently, making it ideal for GPU training. The catch is that it scales quadratically with sequence length: double the context, quadruple the compute. This is why long context windows are expensive and why mechanisms like prompt caching, sliding-window attention, and sparse attention have become important in 2026.

Multi-head attention in plain English

Single-head attention is useful, but it has a limitation: a single attention pattern can only focus on one aspect of the relationship between tokens. A word might relate to others grammatically, semantically, and thematically all at once. You would want to model several of these relationships in parallel.

Multi-head attention does exactly that. Instead of one attention mechanism, the model runs several (typically 8 to 128) in parallel, each with its own learned Q/K/V projections. Each head can attend to different aspects of the input. One head might specialise in syntactic relationships (subject-verb agreement); another might specialise in coreference (which pronoun refers to which noun); another might specialise in topical similarity. The outputs of all heads are concatenated and projected back to the model's hidden dimension.

In practice, researchers have found that different attention heads in a well-trained transformer do specialise in interpretable ways. Some attend to punctuation; some attend to rare words; some attend to the previous line of text. This emergent specialisation happens without anyone explicitly programming it. Training simply discovers that having multiple parallel attention patterns produces better outputs.

Modern refinements include grouped-query attention and multi-query attention, which reduce the cost of attention by sharing some projections across heads. These are engineering optimisations; the core idea is the same.

Positional encoding: how order survives

If self-attention treats tokens as a set — every token looking at every other — how does the model know that "the dog bit the man" is different from "the man bit the dog"? Word order matters enormously in language, but the raw attention mechanism is position-agnostic.

The answer is positional encoding. Before the first transformer block, the model adds position information to each token's embedding. The original transformer used sinusoidal functions: each position gets a unique wave-like pattern added to its embedding, so that tokens at position 1 look distinguishable from tokens at position 2, and so on. Subsequent models have used learned absolute positions, relative positions, and most recently rotary position embeddings (RoPE) that encode position through rotation of the query and key vectors.

RoPE has become the de facto standard for frontier LLMs in 2026. It has good extrapolation properties — models trained on 8K context can often handle 32K or more with minor tweaks — and integrates more elegantly with attention than the original sinusoidal scheme. Yet another reminder that the 2017 paper established the architecture; nearly every component has since been refined or replaced with a better variant.

The feed-forward layers: where knowledge lives

After attention mixes information across tokens, the feed-forward layer processes each token independently. Typically it expands the hidden dimension by a factor of four, applies a non-linearity (GELU or SwiGLU in modern designs), and projects back. Two large matrix multiplications with a non-linearity in between.

This is where most of the model's parameters live. In a typical transformer, the feed-forward layers account for roughly two-thirds of the parameter count. Recent interpretability research suggests that specific facts and associations are stored in specific feed-forward neurons. Editing facts in a language model — telling it that the Eiffel Tower is in Rome, say — can often be done by surgically modifying a small number of feed-forward weights.

The intuition is that attention decides "which other tokens matter for this position," and the feed-forward layer decides "given that context, what should this position compute next?" Together they alternate: gather context, process, gather more context, process more. Stacked through dozens of layers, this produces increasingly abstract representations of the input.

Encoders, decoders, and why LLMs picked one side

The original transformer had two halves: an encoder that processed input sequences (like the source language for translation) and a decoder that produced output sequences (like the target language). Both were stacks of transformer blocks, with the decoder having an additional cross-attention step that let it look at the encoder's output.

This encoder-decoder design made sense for translation. For general language modelling, it turned out to be overkill. Decoder-only models — transformers with masked attention that can only look backwards, so that token n only sees tokens 1 through n-1 — proved simpler, scaled better, and generalised beautifully to almost any language task via in-context learning. GPT was the first decoder-only LLM to become famous; virtually every current frontier LLM is a decoder-only design.

Encoder-only variants (BERT and descendants) still have important roles in classification, retrieval, and embedding, but they have quietly become a niche technology compared to decoder-only generative models. When someone says "LLM" in 2026, they almost always mean a decoder-only transformer.

Why transformers dethroned RNNs so fast

The takeover of transformers was remarkably complete. By 2020, almost no serious new language research used RNNs. What happened?

Transformers trained faster on the same hardware because attention is fully parallel across the sequence, whereas RNNs are inherently sequential. This meant researchers could iterate experiments faster and train bigger models in the same wall-clock time.

Transformers handled long-range dependencies directly through attention, while RNNs struggled because information had to flow through the hidden state step by step, fading over distance.

Transformers produced better benchmarks. On almost every sequence task of the era — translation, question answering, summarisation — transformers matched or exceeded RNN performance within a year of their introduction.

Transformers scaled beautifully. When researchers tried to make RNNs bigger, the quality gains were modest. When they made transformers bigger, quality gains were dramatic. The scaling laws discovered in 2020 all emerged from transformer experiments.

Collectively, these advantages were too large to ignore. RNNs are still occasionally used for niche tasks where sequential inductive bias matters, but the transformer architecture has won the sequence-modelling war decisively.

A walk-through: what happens when you prompt "Paris is in" and the model completes "France"

To see all the pieces working together, trace the simplest possible completion.

You type Paris is in. The tokeniser produces, say, four tokens: [Paris][ is][ in][end-of-prompt]. These get looked up in the embedding matrix, producing four dense vectors of, say, 4096 dimensions each. Positional encoding is added so the model knows "Paris" is first and "in" is third.

These vectors enter the first transformer block. The attention mechanism lets each token peek at every prior token. The token " in" attends heavily to "Paris" because the model has learned, across trillions of training tokens, that "city + in" strongly predicts geographic completion. The multi-head variant does this in parallel along several relationship axes.

The feed-forward layer then processes each position. For the " in" position, the feed-forward layer's learned weights push the vector toward a region of embedding space associated with geographic-completion continuations. The output vector of this layer now encodes something like "this position is about to generate a country name related to Paris."

This repeats through every transformer block — dozens of them — each refining the representation further. By the final block, the last-position vector is a highly enriched, context-aware embedding.

That final vector is projected onto the vocabulary, producing a score for every possible next token. "France" scores highest by a massive margin; "Italy" and "Germany" and "Europe" score lower but non-trivially. The softmax normalises these into probabilities. Sampling (with a low temperature for factual queries) picks "France," appends it to the sequence, and the whole pass runs again to generate the next token, and so on, until an end-of-reply signal appears.

That is the complete transformer in action, running exactly the same mechanics that power every prompt you have ever sent to a modern LLM.

Emerging variants: linear attention, Mamba, and state-space models

Transformers have dominated since 2017, but research on alternatives continues. The main pressure point is attention's quadratic cost at long contexts.

Linear attention variants approximate full attention with linear-cost mechanisms. Performer, Linformer, and various kernel-based approaches have been proposed. They generally trade a modest quality drop for dramatic efficiency gains on very long sequences.

State-space models (SSMs) like Mamba process sequences with constant cost per token, independent of context length. Mamba matched transformer quality on several benchmarks in 2023-2024 and sparked a wave of hybrid models that mix SSM layers with attention layers.

Hybrid architectures like Jamba combine transformer blocks with Mamba blocks to get the best of both. Several 2025-2026 frontier models have explored these hybrids, especially for million-token contexts.

So far, none of these has unseated the transformer. But the competition ensures that research on sequence modelling remains lively, and future LLMs may well be descendants of these variants rather than pure transformers. The architectural winner of 2030 might not exist yet.

Transformers outside text: vision, audio, DNA, and more

Perhaps the most surprising thing about transformers is how well they have transferred beyond text.

Vision transformers (ViT) treat an image as a sequence of patches. The original 2020 ViT paper showed that transformers on image patches could match or exceed the performance of convolutional networks on large datasets. Today's frontier multimodal models are transformers that handle text and image patches in the same architecture.

Audio transformers treat audio as a sequence of time-frequency patches or tokens. Whisper, the leading speech-recognition model, is a transformer. Text-to-speech models, music generators like Suno and Udio, and multimodal audio-video systems all use transformer backbones.

Biology transformers have emerged for protein structure prediction (AlphaFold uses transformer-inspired components), DNA sequence modelling, and single-cell transcriptomics. The universality of the architecture has been one of the most striking scientific stories of the decade.

Even robotics policies are increasingly transformer-based. A single architecture turning out to be the best model for text, images, audio, biology, and control is a remarkable empirical result — and suggests that whatever transformers capture is something fundamental about sequential data structure.

Things a transformer does not do

A final round of myth-correction, because the transformer mystique runs deep.

A transformer does not store explicit memories of training data. It stores statistical patterns. You cannot query it for "what was in your training set about topic X" directly.

A transformer does not have a goal or intention. It computes a fixed function over its input. Apparent goal-directedness in LLM conversations emerges from the training data (which contains lots of goal-directed human text) and from instruction tuning.

A transformer does not update its weights during a conversation. Every token generated uses the same frozen weights. "Learning from conversation" in chat products is implemented by maintaining context, not by changing the model.

A transformer does not understand time or the physical world except through text patterns. It has no clock, no gravity, no body. Responses about real-world causality emerge from having read many texts about it, not from any direct experience.

Keeping these limitations in mind helps you use the technology honestly — with its real strengths, not its imagined ones.

Transformers replaced sequence models by letting every token attend to every other token at once. That one trick unlocked the last decade of AI.

The short version

A transformer is a stack of blocks, each consisting of a self-attention layer that lets every token peek at every other and a feed-forward layer that processes each token independently. Multi-head attention lets multiple relationship patterns run in parallel. Positional encoding preserves order. Decoder-only variants dominate large language modelling. The architecture has generalised spectacularly beyond text, to images, audio, biology, and control. Every major AI product you use is a transformer under the hood. The 2017 paper really was that important, and everything since has been one long elaboration of its ideas.