Every AI model you have heard of comes with a parameter count tattooed to its name: GPT-3 had 175 billion, Llama 3 comes in 8B and 70B and 400B, Claude and Gemini do not officially disclose theirs, DeepSeek ships a 671B mixture-of-experts, and Microsoft's Phi quietly hits frontier-like quality at 3.8B. What does any of this mean? Does a model with twice as many parameters think twice as well? Is bigger always better? This guide walks through what a parameter actually is, where scale helps, where it stops helping, and how to read a model spec sheet without getting fooled by the marketing.

What a parameter actually is

A parameter is a single number inside the model that the training process adjusted from its initial random value. That is it. There are no tiers of parameters, no special ones that matter more than others; they are all just numbers stored in very large matrices and vectors. A 70-billion-parameter model contains 70 billion of these numbers, stored in GPU memory, multiplied against inputs billions of times per forward pass.

The parameters live in three main places. Inside the embedding matrix that converts tokens to vectors. Inside the weight matrices of every attention mechanism in every transformer block. And inside the weight matrices of every feed-forward layer. A small number live in layer-normalisation scales and biases. For a typical large language model, the majority of parameters sit in the feed-forward layers, which is where most of the learned "knowledge" is thought to be stored.

Parameters are not knowledge in any symbolic sense. The model does not have a parameter that says "Paris is the capital of France." Facts are distributed across millions of parameters through patterns of activation. This is why you cannot simply look inside a model to see what it knows — and why editing factual knowledge is a genuine research problem rather than a database update.

Each parameter is typically stored as a 16-bit or 32-bit floating point number. At 16-bit precision, a 70B-parameter model takes roughly 140 gigabytes of memory just to hold its weights, before any activations or gradients. This is why big models need clusters of GPUs to run at all.

Where the billions come from: an inventory

It helps to see where the parameter count actually accumulates. Consider a notional 7B-parameter transformer with 32 layers, a hidden size of 4096, a vocabulary of 32,000 tokens, and an attention head dimension of 128.

The embedding matrix holds 32,000 × 4,096 ≈ 131 million parameters. The output projection (tying embeddings) shares these in many modern designs, which saves half a billion parameters in a typical architecture.

Each attention block holds four large weight matrices (Q, K, V, output projection), each 4,096 × 4,096. That is roughly 67 million parameters per block's attention layer. Across 32 layers, that is about 2.1 billion parameters in attention alone.

Each feed-forward block holds two large weight matrices, typically expanding the hidden size by a factor of four (4,096 → 16,384 → 4,096). That is around 134 million parameters per block, or 4.3 billion across 32 layers. The feed-forward layers are where most of the model's bulk lives.

Add a few hundred million more in layer norms, biases, and miscellaneous projections, and you land at roughly 7 billion. For a 70B model, the numbers scale up commensurately. For a 405B model, they scale up further still. The naming convention — 7B, 13B, 70B, 405B — refers to total parameter count, not some special measure of capability.

Scaling laws: the idea that bigger was predictably better

Around 2020, researchers at OpenAI, Anthropic, and DeepMind began formalising something they had been observing empirically for years: model quality improves predictably as you scale up three things together — model size, training data, and compute budget. These became known as scaling laws.

The key insight was that the relationship is smooth and extrapolatable. If you double the compute and scale the model and data appropriately, the loss on held-out text decreases by a specific, predictable amount. There was no obvious ceiling. Every time researchers scaled up an order of magnitude, they got a better model. This is why the industry raced to build bigger models: you could plot the curve, look at the next point, and project the capability.

Scaling laws gave the AI industry a road map from 2020 to 2024. GPT-3 to GPT-3.5 to GPT-4 followed it. Claude's progression followed it. Gemini's progression followed it. Every one of these jumps could have been predicted roughly from the compute budget before training even began.

The downside of scaling laws is that they made progress look mechanical. For a while it seemed like capability was simply a function of how many GPUs you could afford. That was never quite right, but it was close enough to be a useful approximation.

Where scaling breaks: data, compute, and diminishing returns

The simple scaling story ran into three problems as models got bigger.

First, data ran out. Scaling laws require that data grows along with model size. By 2024, frontier models were approaching the limit of high-quality text available on the public internet. Synthetic data, private data, and cross-modal data became the new frontiers. Simply continuing to scale without proportional data no longer worked; models trained on too few tokens showed diminishing returns.

Second, compute got expensive. Training a frontier model in 2026 costs hundreds of millions of dollars just in electricity and chip time. The next doubling costs a billion. The one after that costs ten. Physical data centre power and chip supply became real constraints.

Third, benchmark saturation. As models got bigger, standard benchmarks stopped being useful measurement tools because every frontier model scored above 95%. New harder benchmarks had to be invented, and the perception of progress slowed because it became harder to show clean numerical gains.

The industry responded by moving to a different axis: rather than bigger pretraining, spend more compute at inference time through reasoning models. This is why o1, o3, Claude extended thinking, and Gemini reasoning modes have been the story of 2024-2026 rather than just-bigger-models.

Data quality vs parameter count: the Chinchilla lesson

A landmark 2022 paper from DeepMind, nicknamed the Chinchilla paper, upended the industry's default assumption. DeepMind showed that for a fixed compute budget, previous frontier models were massively under-trained. A model half the size of GPT-3 trained on twice the data would beat GPT-3. The optimal ratio of parameters to training tokens turned out to be roughly 1:20 — much more data per parameter than was being used.

This had immediate consequences. Post-Chinchilla models shrunk in parameter count but grew dramatically in training data. Llama 2 was smaller than GPT-3 but trained on far more data and matched or beat it on most benchmarks. Mistral, Llama 3, Phi, and others took this lesson even further, training relatively small models on exceptionally large amounts of high-quality data.

The practical takeaway is that parameter count alone tells you very little. A 70B model under-trained on mediocre data can be worse than a 7B model lovingly trained on a curated trillion-token dataset. In 2026, this is now widely understood, and vendors quietly compete as much on data curation as on model size.

Mixture-of-experts: cheating scale

One of the most important architectural innovations of the last few years is mixture-of-experts (MoE). A dense model uses all of its parameters on every token. An MoE model has many specialised "expert" sub-networks, but only routes a few of them to each token. The model has, say, 400 billion total parameters, but only 40 billion activate for any given inference.

The upside is obvious: you get the capacity of a much larger model with the inference cost of a much smaller one. Mixtral, DeepSeek V3, and several frontier closed models in 2025-2026 are MoE architectures. The naming convention now distinguishes "total parameters" from "active parameters" because they differ dramatically.

The downside is operational complexity. MoE models are harder to train stably, harder to serve efficiently, and need specialised infrastructure to route tokens to the right experts. They also tend to perform worse than dense models of equivalent active-parameter count on some tasks. But the economics are compelling enough that every major lab has been shipping MoE variants.

When you see a spec sheet advertising "671B parameters, 37B active," it means MoE. Both numbers matter; neither tells the whole story alone.

How to read a model spec without getting misled

Put the pieces together and you get a more sophisticated way to read model specs.

Total parameter count tells you the ceiling of what the model could know, but not what it actually knows. A 1T-parameter model trained on bad data is worse than a 10B-parameter model trained on good data.

Active parameter count tells you the inference cost, roughly. Dense models have active = total. MoE models have active much smaller than total.

Training tokens tells you whether the model was adequately trained. Modern well-trained models use 15 to 25 tokens per parameter. Less than that and the model is under-trained.

Benchmark scores tell you how the model performs on standardised tasks, but benchmarks are often gamed or saturated. Always evaluate on your own use cases.

Context window tells you how much text the model can consider at once — useful but unrelated to parameter count.

Price per million tokens tells you real economic cost, which reflects a combination of model size, architecture efficiency, and vendor margin. This is usually the most practically useful number on a spec sheet.

A 2026 snapshot of parameter counts

A rough guide to the landscape, approximate because several frontier labs do not disclose.

Frontier closed models: Claude, GPT, Gemini — undisclosed but estimated in the hundreds of billions of total parameters, with active counts likely lower due to MoE architectures. These run on large GPU clusters and are priced accordingly.

Frontier open-weight models: Llama 3 405B, DeepSeek V3 671B (37B active), Mistral Large — these are what the open community has access to and are approaching frontier quality on many benchmarks.

Mid-tier open models: Llama 3 70B, Mixtral 8x7B, Qwen 72B — excellent general-purpose models that run on a single high-end GPU or a small cluster.

Small language models: Phi-4 (14B), Gemma 9B, Llama 3 8B — the workhorses of on-device and cost-sensitive applications. Quality has climbed dramatically at this tier.

Edge models: Phi-mini 3.8B, Gemma 2B, and specialised distilled variants that run on phones and laptops. Not as capable as the big ones, but astonishingly close for well-scoped tasks.

Why parameter counts are sometimes misleading numbers

Several subtle points make raw parameter counts less informative than they seem.

Quantisation changes the real memory footprint without changing parameter count. A 70B model at 4-bit quantisation fits into 40GB of GPU memory; the same 70B at 16-bit precision needs 140GB. The underlying parameter count is identical, but the running system is radically different. Modern on-device deployments almost always quantise.

Embedding parameters vs compute parameters. A model with a huge vocabulary has many parameters in its embedding matrix that do not contribute to compute during most tokens. Comparing a code model (large vocabulary, many symbols) to a prose model (smaller vocabulary) using raw parameter count is apples to oranges.

Tied vs untied weights. Many modern models share weights between the input embedding and the output projection ("tied embeddings"). This can save half a billion parameters on a large vocabulary model. Two models that look identical on paper may differ in subtle ways that affect this count.

Distillation and compression. A "70B" model distilled from a 400B teacher can match the teacher on many tasks while having the smaller model's parameter count. The nominal size hides the fact that most of the capability came from a larger source.

Once you account for these nuances, parameter count becomes a rough guide rather than a crisp metric.

A glossary of parameter jargon you will see

A few terms that will appear in every model's technical report.

Dense: every parameter participates in every forward pass. The default for most transformers.

Sparse / MoE: only a subset of parameters activates per token. Total vs active counts diverge.

Attention heads: number of parallel attention mechanisms per layer, typically 8 to 128.

Hidden size: the width of each vector flowing through the network, typically 2048 to 16384.

Layer count: depth of the transformer stack, typically 24 to 128.

Context length: maximum tokens the model can process at once; independent of parameter count but bound by attention compute.

FLOPs: total compute during training. Often a better scale measure than parameter count alone.

These numbers, taken together, tell you much more about a model than its headline parameter count ever will.

Why small models are catching up

Perhaps the biggest quiet trend of 2024-2026 is how rapidly small models have been closing the gap with frontier ones. A 2024 Phi-3.5-mini or a 2025 Gemma 2 9B on a well-scoped benchmark from 2022 would have been considered frontier-level at the time. This is because of better data curation, better training recipes, distillation from larger models, and smarter architectures.

For many real-world applications, the right answer is no longer "call GPT-4 via API" but "run a fine-tuned small model on your own hardware for a hundredth of the cost." The parameter-count arms race is becoming a two-track competition: the frontier still grows at the top end, but the value is increasingly being delivered by carefully curated smaller models. Betting on small models is one of the most reliably lucrative positions for any AI practitioner in 2026.

Practical implications for buyers and builders

A few takeaways worth pinning to the wall.

Do not judge a model by parameter count alone. Always evaluate on your own task with your own data. Headline numbers in vendor decks are marketing, not engineering reality.

If a model twice the size is only marginally better on your task, buy the smaller one. The gap between "best on benchmarks" and "best on your problem" is often dramatic.

For self-hosting, start small and scale up only when quality demands it. A well-fine-tuned 8B model often beats a zero-shot 70B model on narrow tasks at a tenth the cost.

For API consumption, experiment with cheaper tiers before paying for flagships. Haiku, Mini, and Flash-class models have improved enough that they handle most production traffic; you should reserve premium models for genuinely hard queries.

And remember that 2026's expensive model is 2028's commodity. Build your stack so you can swap models out as prices fall and quality climbs.

Parameters are learned weights. More is usually better, but only when paired with enough high-quality data, smart training, and a sensible architecture. The era of "scale is all you need" is over.

The short version

A parameter is a single number inside a model that training tuned from random. Bigger models have more of them and can represent more complex patterns, but only if you feed them enough high-quality data and enough compute. Scaling laws predicted cleanly-compounding returns for a few years, then ran into data limits, compute limits, and benchmark saturation. Mixture-of-experts and reasoning models became the new frontier. Small well-trained models keep catching up. The practical lesson for anyone evaluating a model: parameter count is a proxy, not a verdict. Look at active parameters, training tokens, architecture, and your own benchmark results before you believe the headline.

Share: