Understanding Tokens and LLM Inference

When you type a message to ChatGPT, Claude, or any large language model, something remarkable happens in the fraction of a second before you see a response begin streaming back. Your text is disassembled into pieces, converted to numbers, passed through billions of mathematical operations, and reassembled into coherent language—one token at a time.

Understanding this process isn’t just academic. It explains why AI has context limits, why longer conversations cost more, why certain prompts work better than others, and why models sometimes produce unexpected outputs. This knowledge is essential for anyone building production AI systems or working to control the hidden costs of AI projects. This guide walks through the complete journey of a message, from the moment you hit send to the final token of the response.

What Are Tokens?

Tokens are the fundamental units that language models actually process. They’re not characters, not words, not sentences—they’re something in between, optimized for how language actually works.

The Problem with Characters and Words

A naive approach might process text character by character. But individual characters carry little meaning—the letter “t” tells you almost nothing without context. Processing at the character level would also make sequences extremely long, which is computationally expensive.

Processing word by word seems more natural, but it creates other problems:

Vocabulary explosion. English has hundreds of thousands of words. Add technical terms, names, misspellings, and multiple languages, and you need millions of entries. Each entry requires its own learned representation, making the model enormous.
Unknown words. Any word not in your vocabulary becomes unparseable. “Cryptocurrency” might work, but “DeFi” or “zkSync” might not.
Morphological blindness. The words “run,” “running,” “runner,” and “ran” would be completely separate entries with no shared understanding of their common root.

Subword Tokenization

Modern LLMs use subword tokenization—algorithms that break text into pieces larger than characters but often smaller than words. The most common approaches are:

Byte Pair Encoding (BPE): Starts with individual characters and iteratively merges the most frequent pairs. After training on a large corpus, common words like “the” become single tokens, while rare words get split into pieces. GPT models use a variant of BPE.

WordPiece: Similar to BPE but uses a different merging criterion based on likelihood. Used by BERT and related models.

SentencePiece: A language-agnostic approach that treats the input as a raw byte stream, handling any language or character set uniformly. Used by models like Llama 4 and T5.

Tokenization in Practice

Here’s how GPT-style tokenization handles various inputs:

"Hello, world!" → ["Hello", ",", " world", "!"]  (4 tokens)

"The quick brown fox" → ["The", " quick", " brown", " fox"]  (4 tokens)

"tokenization" → ["token", "ization"]  (2 tokens)

"GPT-4.5" → ["G", "PT", "-", "4", ".", "5"]  (6 tokens)

"こんにちは" → ["こん", "にち", "は"]  (3 tokens, varies by tokenizer)

"🚀" → ["🚀"] or multiple tokens depending on model

Notice several things:

Spaces are often attached to the following word. ” quick” is one token, not “quick” preceded by a space token. This is intentional—it helps the model learn that words after spaces behave differently than words at the start of text.
Common words stay whole. “Hello” and “The” are single tokens because they appear constantly in training data.
Rare words get split. “tokenization” becomes “token” + “ization” because the full word is less common than its parts.
Non-English text often requires more tokens. Languages not well-represented in training data get split into smaller pieces. This means the same semantic content in Japanese might use 2-3x more tokens than English.
Special characters and emoji vary. Some tokenizers handle emoji efficiently; others break them into multiple tokens representing their underlying byte sequences.

Why Token Count Matters

Every LLM has a context window—the maximum number of tokens it can process at once. This includes both your input and the model’s output:

Model	Context Window
GPT-4.5	128,000 tokens
Claude Opus 4.5	200,000 tokens
Claude Sonnet 4	200,000 tokens
Gemini 2.0 Pro	2,000,000 tokens
Llama 4 (70B)	128,000 tokens

A rough heuristic: 1 token ≈ 4 characters or 100 tokens ≈ 75 words in English. But this varies significantly:

Code is often less efficient (more special characters, precise formatting)
Technical writing with jargon uses more tokens per concept
Non-Latin scripts typically require more tokens

Understanding tokenization helps you:

Estimate costs (API pricing is per-token)
Manage context windows effectively
Write more efficient prompts
Debug unexpected model behavior (sometimes tokens split in surprising places)

The Journey of a Message

Now let’s follow a message through the complete inference pipeline.

Step 1: Preprocessing and Tokenization

When you send “What is the capital of France?” to an API, several things happen before the model sees it:

┌─────────────────────────────────────────────────────────────────┐
│                        Raw Input                                 │
│  "What is the capital of France?"                               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    System Prompt Prepended                       │
│  "You are a helpful assistant.\n\nWhat is the capital of..."   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Tokenization                               │
│  [9906, 374, 264, 11190, 18328, 13, 198, 198, 3923, ...]       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Embedding Lookup                              │
│  Each token ID → 4096-dimensional vector (or similar)           │
└─────────────────────────────────────────────────────────────────┘

Tokenization converts each piece of text to a token ID—an integer that indexes into the model’s vocabulary. A typical vocabulary has 32,000 to 100,000+ entries.

Embedding lookup converts each token ID into a high-dimensional vector (commonly 4,096 or 8,192 dimensions for large models). These embedding vectors are learned during training and capture semantic relationships. Words with similar meanings have vectors that point in similar directions.

Step 2: Positional Encoding

Transformers process all tokens in parallel, which means they have no inherent sense of order. The sentence “dog bites man” would look identical to “man bites dog” without additional information.

Positional encodings solve this by adding position information to each token’s embedding:

Token embedding for "capital" (position 5):
  [0.23, -0.45, 0.12, ...] (semantic meaning)
+ [0.01, 0.02, -0.01, ...] (position 5 encoding)
= [0.24, -0.43, 0.11, ...] (input to transformer)

Modern models use various positional encoding schemes:

Absolute positional encodings: Simple addition of position-specific vectors
Rotary Position Embeddings (RoPE): Encodes position through rotation in the embedding space, enabling better length generalization
ALiBi (Attention with Linear Biases): Adds position-dependent penalties to attention scores rather than modifying embeddings

Step 3: The Transformer Stack

The positioned embeddings now pass through the transformer—a stack of repeated layers, each containing two main components:

┌─────────────────────────────────────────────────────────────────┐
│                    Transformer Layer (×N)                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Multi-Head Self-Attention                   │   │
│  │  • Each token attends to all previous tokens            │   │
│  │  • Learns which context is relevant for each position   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                      Add & Normalize                             │
│                              │                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              Feed-Forward Network                        │   │
│  │  • Processes each position independently                │   │
│  │  • Transforms representations non-linearly              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                   │
│                      Add & Normalize                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                         (repeat N times)

A model like GPT-4.5 has roughly 120 transformer layers. Llama 4 70B has 80 layers. Each layer refines the representations, building increasingly abstract understanding.

Self-Attention: The Core Innovation

Self-attention is what makes transformers powerful. For each token, it computes how much to “pay attention” to every other token when updating its representation.

For the input “What is the capital of France?”, when processing the word “capital,” attention might look like:

Attention weights for "capital":
  "What"    → 0.05  (low relevance)
  "is"      → 0.02  (low relevance)
  "the"     → 0.08  (moderate—articles often matter)
  "capital" → 0.15  (self-attention)
  "of"      → 0.25  (high—"capital of" is a phrase)
  "France"  → 0.35  (highest—this is what we're finding the capital OF)
  "?"       → 0.10  (indicates question context)

The model learns these attention patterns during training. Different attention heads (typically 32-128 per layer) learn different patterns—some focus on syntax, others on semantic relationships, others on coreference.

Crucial detail for generation: During inference, models use causal masking—each token can only attend to tokens that came before it. This prevents “cheating” by looking at future tokens and enables autoregressive generation.

Feed-Forward Networks

After attention, each token’s representation passes through a feed-forward network—typically two linear transformations with a non-linearity between them:

FFN(x) = activation(x · W₁ + b₁) · W₂ + b₂

Recent models use activation functions like SwiGLU or GeGLU instead of simple ReLU. The feed-forward layers are where much of the model’s “knowledge” is stored—factual associations learned during training.

Step 4: Output Projection and Sampling

After passing through all transformer layers, the final hidden state for the last token is projected to vocabulary size:

┌─────────────────────────────────────────────────────────────────┐
│              Final Hidden State (last position)                  │
│                    [4096 dimensions]                            │
└─────────────────────────────────────────────────────────────────┘
                              │
                    Linear projection (4096 → vocab_size)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Logits (unnormalized scores)                  │
│     Token 0: -2.3    Token 1: 1.2    Token 2: -0.8   ...       │
│                    [vocab_size dimensions]                      │
└─────────────────────────────────────────────────────────────────┘
                              │
                         Softmax
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Probability Distribution                      │
│     "Paris": 0.82   "The": 0.03   "France": 0.02   ...         │
└─────────────────────────────────────────────────────────────────┘

The model doesn’t deterministically pick the highest-probability token. Instead, it samples from this distribution, controlled by parameters:

Temperature: Scales the logits before softmax. Temperature > 1 flattens the distribution (more random), temperature < 1 sharpens it (more deterministic). Temperature = 0 is equivalent to always picking the highest-probability token (greedy decoding).

Temperature = 0.3 (focused):    "Paris": 0.95, "The": 0.01, ...
Temperature = 1.0 (balanced):   "Paris": 0.82, "The": 0.03, ...
Temperature = 1.5 (creative):   "Paris": 0.45, "The": 0.12, ...

Top-p (nucleus sampling): Instead of considering all tokens, only sample from the smallest set whose cumulative probability exceeds p. With top_p=0.9, if “Paris” (0.82) + “The” (0.03) + “France” (0.02) + “Lyon” (0.02) + “It” (0.01) = 0.90, only these five tokens are candidates.

Top-k: Only consider the k highest-probability tokens. Simpler than top-p but can cut off reasonable options if probability is spread across many tokens.

Step 5: Autoregressive Generation

Here’s the key insight: the model generates one token at a time. After predicting “Paris,” the entire process repeats with “Paris” appended to the input:

Input: "What is the capital of France?"
       → Predict: "Paris"

Input: "What is the capital of France? Paris"
       → Predict: ","

Input: "What is the capital of France? Paris,"
       → Predict: " the"

Input: "What is the capital of France? Paris, the"
       → Predict: " capital"

... and so on until generating <EOS> (end of sequence) or hitting max tokens

This is why longer responses take longer to generate—each token requires a full forward pass through the model. It’s also why streaming works: each token can be sent to the user as soon as it’s generated.

The KV Cache

Recomputing attention for all previous tokens on every generation step would be extremely wasteful. Instead, models use a key-value cache:

During attention, each token produces key (K) and value (V) vectors. Once computed, these are cached. On subsequent generation steps, the model only computes K and V for the new token, then attends over the cached values.

This is why GPU memory usage grows during generation—the KV cache accumulates:

KV cache memory ≈ 2 × num_layers × num_heads × head_dim × sequence_length × bytes_per_param

For a 70B parameter model generating a 4,000-token response, the KV cache alone might consume 10+ GB of GPU memory.

What the Model “Knows” vs. Computes

A common misconception is that LLMs “look things up” like a database. The reality is more nuanced.

Knowledge in Weights

During training, the model adjusts billions of parameters to predict the next token across trillions of tokens of text. Factual associations get encoded into the weights—particularly in the feed-forward layers. When you ask “What is the capital of France?”, the model isn’t searching a database. The association between “France,” “capital,” and “Paris” is distributed across millions of parameters in ways that produce “Paris” as the high-probability next token.

This has implications:

Knowledge has a training cutoff. The model knows what was in its training data, nothing after.
Knowledge is fuzzy. The model might “know” something well enough to usually get it right, but occasionally hallucinate related-but-wrong information.
Knowledge competes. If training data contained conflicting information, the model learns a blend—which can produce confident-sounding wrong answers.

Computation in Inference

Some capabilities emerge from computation during inference rather than memorized knowledge:

Logical reasoning (to a degree) happens through the transformer’s attention and feed-forward operations
Following instructions involves attending to the prompt and adjusting generation accordingly
In-context learning works by attending to examples in the prompt and extracting patterns

This is why chain-of-thought prompting helps with math and reasoning—it forces the model to “show its work,” using the generated tokens as working memory to perform multi-step computation. For practical techniques on leveraging this, see our guide on prompt engineering patterns for production systems.

Why Things Go Wrong

Understanding the inference process explains common failure modes:

Hallucinations

The model is always predicting the most likely next token given context. If the training data associated “the first person to walk on the moon” with both “Neil Armstrong” (correct) and occasionally with other astronaut names, the model might confidently generate a wrong name—especially if the prompt context subtly shifts probabilities.

Context Window Limits

Once you exceed the context window, early tokens literally cannot be attended to. The model isn’t “forgetting”—that information simply isn’t in the computation anymore. This is why long conversations can seem to lose coherence.

Repetition

Autoregressive generation creates feedback loops. If the model generates a phrase that slightly increases the probability of generating that phrase again, it can get stuck in loops. Repetition penalties in sampling help but don’t fully solve this.

Prompt Sensitivity

Because generation depends entirely on the input token sequence, small changes can shift probability distributions significantly. The token boundaries matter too—phrasing that tokenizes differently might produce different results even if the meaning is identical.

The Cost Equation

Inference cost scales with:

Input tokens: Must all be processed through the transformer
Output tokens: Each requires a full forward pass
Model size: Larger models require more computation per token
Batch size: Processing multiple requests together is more efficient

This is why API pricing distinguishes input vs. output tokens—output tokens are roughly 2-3x more expensive because they can’t be parallelized across the sequence (each depends on all previous).

Typical API pricing structure:
  Input:  $X per 1M tokens
  Output: $3X per 1M tokens

Understanding this helps with cost optimization:

Cache common system prompts
Use shorter prompts where possible
Set appropriate max_tokens limits
Choose model size appropriate to task complexity

Practical Implications

This technical understanding translates to practical guidance:

For prompt engineering:

Tokens are the real unit—think about how your prompt tokenizes
Position matters—recent tokens are attended to more strongly
Clear structure helps the model’s attention mechanisms
Few-shot examples work because attention can extract patterns

For system design:

Context window is a hard limit—plan for it
Generation speed is token-bound—streaming helps perceived latency
KV cache means memory grows during generation
Batching improves throughput but adds latency

For cost management:

Output tokens cost more—be precise about what you need
Long system prompts are amortized across many user queries
Smaller models are dramatically cheaper—use them where sufficient
Token-efficient prompting compounds savings at scale

Looking Forward

The transformer architecture and autoregressive generation have proven remarkably capable, but active research continues on:

Longer context windows through sparse attention and architectural innovations
More efficient inference through quantization, speculative decoding, and distillation
Better reasoning through chain-of-thought, tree-of-thought, and agent architectures
Reduced hallucination through retrieval augmentation and improved training—learn more in our guide on designing RAG pipelines

Understanding the fundamentals helps you evaluate these advances and apply them appropriately. The model isn’t magic—it’s math. Complex, learned, surprisingly capable math—but math nonetheless. That understanding is the foundation for using these tools effectively.