Two numbers determine how much your AI conversation costs and how much of your document the model can actually see: tokens and context windows. Both are simple concepts dressed up in awkward vocabulary, and both are at the heart of every cost surprise and quality regression that teams encounter when they start shipping AI products. This guide walks through what a token actually is, how context windows work in 2026, why long conversations get expensive in ways that feel disproportionate, and the practical techniques — prompt caching, retrieval, summarisation, compression — that production teams use to keep both cost and quality under control.
What a token actually is
A token is the smallest unit of text that a language model sees. It is somewhere between a letter and a word. Common short English words are one token. Longer words get split into multiple tokens. Punctuation, spaces, capital letters, and special characters each take their own tokens.
A few concrete examples using a typical tokeniser. The word "hello" is one token. "world" is one token. "unbelievable" might be three (roughly "un," "believ," "able"). A newline is a token. A comma is a token. The space before a word is usually bundled with the word itself, so " hello" (with leading space) is one token distinct from "hello" with no leading space.
On average, for English prose, one token works out to roughly 3 to 4 characters, or about three-quarters of a word. A 1000-word essay is typically 1300 to 1500 tokens. A typical page of a book is 400 to 500 tokens. A long email is perhaps 300 tokens. These numbers are worth internalising because they determine everything about AI cost and context planning.
Other languages tokenise differently. Mandarin Chinese, written without spaces, often has one token per character. Japanese and Korean are similar. European languages with lots of accents and compound words can tokenise less efficiently than English. This is a reason why AI services sometimes feel more expensive for non-English content: you are paying for more tokens per word.
How tokenisation actually happens
The algorithm behind modern tokenisation is called byte-pair encoding, or BPE. It works by starting with individual characters and progressively merging the most common pairs into new tokens, building up a vocabulary of 30,000 to 200,000 tokens over a very large training corpus. Common character sequences like "the" or "ing" end up as single tokens; rare sequences like "zymogenic" get split into multiple pieces.
The specific vocabulary is part of the model. GPT-4 has one tokeniser; Claude has another; Llama has another. Token counts can differ by 10-20% between tokenisers for the same input text. The tiktoken library (from OpenAI) and the tokenizers library (from Hugging Face) let you compute exact token counts for any supported model before sending a prompt, which is essential for predicting cost.
One practical consequence: every API provider prices in tokens. "Millions of tokens per dollar" is the standard unit. You pay for input tokens (everything you send) and output tokens (everything the model generates). Output tokens are typically priced 3-5 times higher than input tokens because generating is more expensive than reading. Understanding this pricing structure is the difference between an AI product with healthy margins and one that bleeds money on every query.
What a context window is
The context window is the maximum number of tokens the model can process in a single call — the combined length of your prompt plus the response the model generates. If a model has a 128K context window, and your prompt is 100K tokens, the model can only generate up to 28K tokens of response before hitting the limit.
Context windows have grown dramatically in the last five years. GPT-3 launched with 2K. GPT-4 launched with 8K. Claude offered 100K, then 200K. Gemini pushed to 1M, then 2M. By 2026, Claude has multi-million-token variants for specific workloads. Reasoning models have "thinking" budgets measured in hundreds of thousands of tokens spent on internal deliberation.
Larger context windows are mostly a quality-of-life improvement. They let you feed whole codebases, whole books, whole case files into a single prompt without splitting. They enable simpler architectures that skip retrieval for smaller corpora. And they unlock use cases — analysing entire research papers, processing hour-long meeting transcripts, reviewing complete legal contracts — that were previously impossible.
But larger windows come with two costs: money and quality. Both are worth understanding.
Why long conversations get expensive
The cost of processing a prompt is not linear in its length. Attention — the core operation in a transformer — scales quadratically with sequence length. A prompt twice as long takes four times more compute for attention alone.
In practice, API providers absorb some of this complexity through clever engineering (flash attention, paged attention, KV caching), but the cost signal still reaches the user. A 100K-token prompt is not ten times more expensive than a 10K-token prompt; it is considerably worse than that on the raw compute axis, even if the dollar pricing is set linearly by the provider for simplicity.
On top of that, long conversations in chat products often pass the entire history back to the model on every turn. So if you have been chatting for twenty messages and each averages 500 tokens, your next turn sends 10,000+ tokens of history plus the new message. The twentieth turn costs thirty times the first turn in input tokens. This is why long chat sessions appear to slow down and get more expensive as they continue.
The main tools for controlling this cost: prompt caching, retrieval, and summarisation. Each solves a different part of the problem.
Prompt caching: the cheapest cost reduction you are not using
Prompt caching is the most impactful cost-saving feature of the 2024-2026 era, and it is shockingly underused. Here is the idea. If a large part of your prompt is repeated across many requests — a system prompt, a static document, a knowledge base — the provider can cache the computed intermediate state for that prefix and reuse it across requests. You pay full price the first time; subsequent requests with the same prefix cost a fraction (often one-tenth).
Anthropic, OpenAI, and Google all support prompt caching in 2026, with slightly different APIs but similar economics. For a chatbot with a 5000-token system prompt handling 10,000 queries a day, prompt caching turns a $500/day system prompt bill into a $50/day bill. For document-grounded use cases where you repeatedly query the same large document, the savings are even more dramatic.
The gotcha is that caching keys on exact prefix matches. If your system prompt changes character-by-character between requests (because of a timestamp, a random ID, a user-specific field at the top), caching does nothing. Structuring prompts with static content first and dynamic content at the end unlocks caching for free.
If you are shipping an AI product at scale and not using prompt caching, you are leaving a meaningful fraction of your API bill on the table.
When long context degrades quality
A million-token window is not the same as a million tokens of effective attention. Every frontier model performs worse the further a relevant fact is from where it is needed in the prompt — a phenomenon called "lost in the middle."
Empirical studies consistently show that information near the beginning or end of a long context is retrieved more reliably than information buried in the middle. For the most demanding long-context tasks, accuracy drops measurably at 50K, 100K, or 500K tokens compared to the same task at 10K tokens.
The practical consequence is that stuffing a million tokens into a prompt, while technically possible, is rarely the best approach. Retrieval-augmented generation — fetching just the relevant pieces — typically beats long-context stuffing on both cost and accuracy. Reasoning models that reason about the context before answering can partially compensate, but do not eliminate the degradation.
A good rule of thumb: if you are pushing beyond 50K tokens of input, switch to RAG unless you have tested that the model actually handles your specific task well at that scale.
Summarisation, compression, and memory management
For chat applications with long histories, the standard trick is to summarise older turns as they age out. A chat with 50 turns might keep the most recent 10 turns verbatim, summarise turns 11-30 into a single paragraph, and drop turns 31-50 entirely (or archive them to RAG for later retrieval).
Compression via summarisation is an ongoing research area. Simple approaches use a smaller, cheaper LLM to summarise old history. More sophisticated approaches use learned compression — training small models to produce summaries that preserve specific kinds of information. Some production systems use rolling summaries, where each new turn updates a short running summary of the conversation state.
Agentic and long-horizon applications take this further with explicit memory systems: external databases (often vector databases with episodic memory) that store structured facts about the user, their preferences, and conversation history, retrieved as needed rather than always present in context.
Counting tokens, estimating cost, and measuring reality
Before shipping, every serious AI project should have a way to estimate token usage and cost in advance. The basic building blocks.
For any major model, the provider's tokeniser is accessible as a library. Count the tokens in your prompt template, your retrieved context, and your expected output. Multiply by the per-million-token price for input and output separately. This gives you the cost per query.
For cost projection, multiply by expected query volume. Run this calculation before you write much code, because it is often the constraint that forces architecture decisions. A product that costs $5 per query is not shippable at consumer prices; a product that costs $0.002 per query is.
In production, log token counts on every API call. Track input tokens, output tokens, cache-hit tokens, and total tokens per user, per feature, and per model. Anomalies in these metrics are often the first sign of quality regressions, prompt bugs, or users trying to abuse the system with oversized inputs.
A worked example: cost breakdown of a chat product at scale
To see how these numbers play out in reality, consider a typical AI chat product with 10,000 daily active users. Each user sends, on average, 5 messages per day. Each conversation has a 2,000-token system prompt, 500 tokens of retrieved RAG context, and builds up to an average of 4,000 tokens of conversation history by the end of a session. Average response length: 300 tokens.
Without prompt caching, each message sends 6,500 input tokens and generates 300 output tokens. At Claude Sonnet pricing (roughly $3 per million input, $15 per million output), that is $0.0195 + $0.0045 = $0.024 per message. Multiplied by 50,000 messages per day: $1,200 per day in API costs, or about $36,000 per month.
With prompt caching applied to the 2,000-token system prompt, the cached portion costs one-tenth: $0.0006 input for system prompt instead of $0.006. That is a saving of about $0.005 per message. Multiplied out: $250/day in savings, or $7,500/month. A quarter of the API bill reclaimed by one engineering change.
Moving from Sonnet to a fine-tuned Haiku for the majority of queries, while keeping Sonnet for the hard ones through smart routing, can further halve the bill without measurably hurting quality. These are the kinds of optimisations that separate AI products with healthy unit economics from ones that bleed money at scale.
Structured outputs and format constraints
A quiet but important 2026 development is that all major model providers now support strict structured output — the model is guaranteed to produce JSON matching a specified schema, or a function call with specified arguments, or XML matching a grammar. This feature interacts with tokens in interesting ways.
Structured outputs cost the same per token but remove an entire class of parsing errors and retries. Without them, you might have to retry a prompt three times to get valid JSON, which is three times the output tokens and three times the latency. With strict outputs, the first response is always valid.
This also affects how you design prompts. You do not need to spend hundreds of tokens explaining the exact JSON format you want; the schema definition (fed to the structured output API) handles it. Prompts become shorter and more focused on intent rather than format.
If you are sending structured data to or from an AI model, use structured outputs. They are effectively free, save tokens, and eliminate a frustrating class of bugs.
When long context beats RAG
RAG is usually cheaper and often more accurate, but there are cases where long-context stuffing genuinely wins.
Small, static corpora. If the entire document you need is 30K tokens, just include it in the prompt. Prompt caching makes this cheap on repeated queries. RAG adds complexity you do not need.
Tasks that need holistic understanding. Summarising a whole novel, or catching inconsistencies across a long contract, benefits from seeing everything at once. RAG might miss the cross-cutting pattern.
Unfamiliar content where chunking strategy is unclear. Sometimes the work of designing good chunking exceeds the savings. Long context lets you prototype quickly before deciding whether RAG is worth it.
For anything larger than a single document or any system serving many diverse queries, RAG wins. Know both tools and choose deliberately.
Tokens, quietly, matter to users too
A final point worth making. Tokens are not purely a back-end concern. They influence user experience in ways users feel without naming.
Streaming responses feel faster because tokens arrive one by one — even though total time to completion may be identical to a non-streaming response. Users perceive interactivity, which is why every serious chatbot streams.
Token limits cause visible quality degradation. When a conversation nears the context window, older messages get dropped and the model "forgets" earlier parts of the discussion. Users describe the bot as becoming "inconsistent" or "losing the plot"; usually the real reason is context truncation.
Output token limits cause the classic "the response got cut off" problem. The model simply ran out of room. Thoughtful products handle this gracefully — detecting a truncated response and offering to continue — rather than leaving users staring at an abrupt period.
Every AI product team should instrument token counts per request and alert on anomalies. A sudden jump in average input tokens usually signals a prompt-injection attack, a retrieval bug, or a runaway conversation. A sudden jump in output tokens usually signals a model mis-behaving and rambling. These signals are more actionable than most other metrics and almost free to collect.
A token is roughly 3-4 characters. The context window is how many tokens the model can see at once. Longer context feels magical but costs real money — and understanding both is the single most useful skill for shipping AI economically.
The short version
Tokens are the units in which AI models think and are priced. Context windows are the limit of how many tokens a single call can contain. Long contexts unlock new use cases but get expensive quickly because attention scales quadratically. Prompt caching, RAG, summarisation, and memory management are the production tools that keep costs sane. Measure token usage from day one; ignoring these numbers is how AI products quietly become unprofitable.