Foundations of Transformer Reasoning

This paper provides a rigorous treatment of the theoretical and practical foundations underlying modern large language models, with particular emphasis on reasoning capabilities, scaling behavior, and architectural innovations. We synthesize results from across the field to present a unified view of where the technology stands, why current systems exhibit their characteristic strengths and limitations, and what architectural and algorithmic approaches show promise for the next generation of AI systems.

The intended audience includes researchers, engineers building production AI systems, and technical leaders making architectural decisions. We assume familiarity with deep learning fundamentals, linear algebra, and probability theory.

1. The Transformer as a Computational Primitive

The transformer architecture, introduced by Vaswani et al. (2017), has become the dominant paradigm for sequence modeling. Understanding its computational properties—what it can and cannot express—is foundational to understanding both its capabilities and limitations.

1.1 Attention as Soft Dictionary Lookup

Self-attention can be understood as a differentiable dictionary lookup operation. Given input sequence $X \in \mathbb{R}^{n \times d}$ , attention computes:

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$

Source: Attention(Q,K,V) = softmax((Q K^T) / sqrt(d_k)) V

where $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$ are linear projections, and $d_k$ is the dimension of the key vectors (typically $d_k = d_{\text{model}} / n_{\text{heads}}$ ). The $\sqrt{d_k}$ scaling factor prevents the dot products from growing too large in magnitude as dimensionality increases, which would push the softmax into regions of extremely small gradients.

The softmax-weighted combination creates a soft lookup: rather than retrieving a single value, each query retrieves a weighted combination of all values, with weights determined by query-key similarity. This soft retrieval is what makes attention differentiable and trainable.

Key insight: The attention pattern matrix is:

$A = \mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)$

Source: A = softmax((Q K^T) / sqrt(d_k))

This represents a learned, input-dependent routing of information. Unlike convolutions (fixed local patterns) or recurrence (sequential propagation), attention allows any position to directly access any other position in a single operation.

1.2 Expressiveness and Computational Complexity

What transformers can compute: The computational power of transformers depends critically on implementation details—particularly the precision of arithmetic and the form of attention used.

Pérez et al. (2021) showed that transformers with hard attention (where attention weights are exactly 0 or 1) are Turing complete under certain assumptions. However, this result does not directly apply to standard soft-attention transformers used in practice.

For bounded-precision (log-precision) transformers, the picture is more constrained. Merrill & Sabharwal (2023) demonstrate that such transformers can be simulated by constant-depth logspace-uniform threshold circuits ( $\mathsf{TC}^0$ ). This places fundamental limits on what bounded-precision transformers can compute in a single forward pass—they cannot solve problems outside $\mathsf{TC}^0$ without growing depth or precision.

The depth-width tradeoff: Theoretical results demonstrate that transformer depth is fundamentally more powerful than width for certain computations. Specifically, there exist functions computable by $O(\log n)$ -depth transformers that require exponential width in constant-depth transformers. This has practical implications: for tasks requiring multi-step reasoning, depth matters.

Computational complexity: Standard self-attention is $O(n^2 d)$ in sequence length $n$ and dimension $d$ . This quadratic scaling motivates extensive research into efficient attention variants:

Variant	Complexity	Notes
Standard attention	$O(n^2 d)$	Full expressiveness
Sparse attention	$O(n \cdot k \cdot d)$	$k$ = attended keys per query; depends on sparsity pattern (local, strided, block)
Linear attention	$O(n \cdot d \cdot r)$	$r$ = feature/rank dimension of kernel approximation; trades expressiveness for speed
FlashAttention	$O(n^2 d)$	Same complexity, ~2-4× faster via memory hierarchy optimization (SRAM vs HBM)
Ring attention	$O(n^2 d / p)$ compute	Distributed across $p$ devices; communication overhead not shown

FlashAttention (Dao et al., 2022) deserves special mention: it achieves the same exact computation as standard attention but with dramatically better wall-clock time by respecting GPU memory hierarchy. This is a reminder that algorithmic complexity is not the only consideration—constant factors and hardware characteristics matter enormously at scale.

1.3 Positional Encoding and Length Generalization

Transformers process positions in parallel and thus require explicit position information. The choice of positional encoding significantly affects model behavior, particularly for length generalization.

Absolute positional encodings (sinusoidal or learned) embed each position as a vector added to token embeddings. Simple but limited: models trained on sequences up to length $L$ often fail catastrophically on longer sequences.

Rotary Position Embeddings (RoPE) from Su et al. (2021) encode position through rotation in embedding space:

$f_q(x_m, m) = R_{\Theta,m} W_q x_m$

where $R_{\Theta,m}$ is a rotation matrix with angle proportional to position $m$ . RoPE encodes relative position information directly in the attention computation: the dot product $q_m^\top k_n$ depends on the relative position $m - n$ . This improves length generalization but does not fully solve it.

ALiBi (Attention with Linear Biases) from Press et al. (2022) adds position-dependent biases directly to attention logits:

$\mathrm{Attention}_{\mathrm{ALiBi}}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}} + B\right)V$

where the bias matrix is defined as:

$B_{i,j}=-m \cdot |i-j| \quad \text{(head-specific slope } m \text{)}$

Source: softmax((Q K^T)/sqrt(d_k) + B) V, where B_{i,j} = -m * |i-j|

The bias penalizes attention to distant positions, with different heads using different slopes (geometrically spaced, e.g., $m \in \{2^{-8}, 2^{-7}, \ldots\}$ ) to capture different distance scales. ALiBi provides strong length generalization with minimal computational overhead.

The length generalization problem remains open. Current models can generalize somewhat beyond training lengths with appropriate positional encodings and training procedures, but robust extrapolation to arbitrary lengths remains challenging. This has implications for tasks requiring long-range reasoning.

1.4 Inference-Time Systems Constraints

Production deployment introduces constraints beyond raw model capability:

KV-cache memory: During autoregressive generation, key and value vectors for all previous tokens must be stored. Memory (in bytes) grows as:

$\text{KV}_{\text{bytes}} = 2 \times L \times n_{\text{tokens}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times b_{\text{elem}} \times B$

Source: KV_bytes = 2 * L * n_tokens * n_kv_heads * d_head * bytes_per_elem * batch_size

Notation definitions:

$L$ = number of transformer layers
$n_{\text{tokens}}$ = sequence length (cached tokens so far)
$n_{\text{kv\_heads}}$ = number of key/value heads (may be $< n_{\text{heads}}$ under GQA/MQA)
$d_{\text{head}}$ = per-head dimension (typically $d_{\text{model}} / n_{\text{heads}}$ )
$b_{\text{elem}}$ = bytes per element (2 for fp16/bf16, 1 for int8, 4 for fp32)
$B$ = batch size

For a 70B model (80 layers, 8 KV heads with GQA, 128-dim heads) generating 4K tokens in fp16: $2 \times 80 \times 4096 \times 8 \times 128 \times 2 \times 1 \approx 1.34\text{ GB}$ (≈1.25 GiB).

Multi-Query and Grouped-Query Attention: MQA (Shazeer, 2019) shares key-value heads across all query heads, reducing KV-cache by the number of heads. GQA (Ainslie et al., 2023) groups query heads to share KV heads, trading off between MQA’s efficiency and full multi-head attention’s expressiveness. Most production models now use GQA.

Batching and throughput: Inference systems batch requests to amortize fixed costs, but different sequence lengths complicate batching. Continuous batching, speculative decoding, and paged attention (vLLM) address these challenges.

2. Scaling Laws and Emergent Capabilities

One of the most significant empirical discoveries in deep learning is the existence of predictable scaling laws governing model performance as a function of compute, parameters, and data.

2.1 The Scaling Laws

Kaplan et al. (2020) established that language model loss follows power laws in parameters, data, and compute:

$L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C}$

Source: L(N) ∝ N^(-α_N), L(D) ∝ D^(-α_D), L(C) ∝ C^(-α_C)

where $N$ is parameters, $D$ is data (tokens), $C$ is compute (FLOPs), and the $\alpha$ exponents are empirically determined (approximately 0.076, 0.095, and 0.050 respectively). Loss decreases as each factor increases—the negative exponents capture this inverse relationship.

Chinchilla scaling from Hoffmann et al. (2022) refined these laws with a crucial insight: for compute-optimal training, parameters and data should scale approximately equally. The original GPT-3 (175B parameters, 300B tokens) was significantly undertrained by this analysis. Chinchilla (70B parameters, 1.4T tokens) achieved better performance with the same compute by using more data and fewer parameters.

The Chinchilla-optimal ratio is approximately:

$D \approx 20N$

where $D$ is training tokens and $N$ is parameters. This has major practical implications:

Many deployed models are undertrained relative to compute optimum
Training data volume is increasingly the bottleneck
Inference-optimized models (fewer parameters, more training) may be preferable in deployment

2.2 Emergence and Phase Transitions

Perhaps the most fascinating aspect of scaling is the emergence of capabilities that appear discontinuously—near-zero performance below some scale threshold, then rapid improvement above it.

Documented emergent capabilities include:

Few-shot learning (Brown et al., 2020)
Chain-of-thought reasoning (Wei et al., 2022)
Instruction following
Multi-step arithmetic

The emergence controversy: Wei et al. (2022) initially characterized emergence as sharp phase transitions in capability. Subsequent work by Schaeffer et al. (2023) argued that some apparent emergence is an artifact of evaluation metrics—with continuous metrics, performance improvements look gradual rather than discontinuous.

The truth likely involves both: some capabilities genuinely require sufficient model capacity and exhibit threshold behavior, while others improve gradually but appear sudden due to task structure or evaluation methodology. Distinguishing these cases is important for predicting capability development.

Theoretical frameworks for emergence:

Skill composition: Arora & Goyal (2023) propose that emergence occurs when models acquire sufficient component skills that compose to solve harder tasks.
Grokking: Power et al. (2022) demonstrated that models can suddenly generalize after extended training, well past the point of memorizing training data—suggesting emergence may relate to phase transitions in the loss landscape.
Superposition release: Emergence may correspond to when model capacity becomes sufficient to represent task-relevant features without destructive interference (see Section 6).

2.3 The Bitter Lesson and Its Implications

Rich Sutton’s “Bitter Lesson” observes that historically, methods leveraging computation at scale outperform methods incorporating human knowledge. The transformer era exemplifies this: relatively simple architectures with massive scale outperform carefully engineered alternatives.

However, the bitter lesson has limits:

Data efficiency: Humans learn from far less data than current models. Techniques that improve data efficiency compound the benefits of scale.
Reliability: Scale improves average performance but does not eliminate failure modes. Architectural innovations for robustness complement scaling.
Inference cost: Training compute scales down the inference cost curve, but inference still matters. Architectural efficiency remains important.

The pragmatic synthesis: scale is necessary but not sufficient. The most capable systems combine scale with architectural innovations that improve efficiency, reliability, and capability.

3. The Reasoning Problem

Despite remarkable capabilities, current language models exhibit characteristic failures on tasks requiring systematic reasoning. Understanding these failures—and developing approaches to address them—is central to advancing AI capabilities.

3.1 What “Reasoning” Means for Language Models

We can decompose reasoning into several components:

Retrieval: Accessing relevant information from parameters or context
Composition: Combining information according to rules or relationships
Multi-step inference: Chaining multiple reasoning steps
Verification: Checking whether conclusions follow from premises
Search: Exploring possible inference paths when direct inference fails

Standard transformer inference—a single forward pass—performs all these operations implicitly. This works remarkably well for tasks that fit within the model’s “implicit reasoning depth” but fails when explicit multi-step computation is required.

3.2 Empirical Failure Modes

Length generalization in arithmetic: Transformers trained on $n$ -digit addition fail on $(n+1)$ -digit addition, despite the algorithm being identical. The model learns a procedure entangled with specific position patterns rather than the abstract algorithm.

Compositional generalization: Models struggle with novel compositions of known primitives. A model that understands “the cat sat on the mat” and “the mat is red” may fail to infer properties of “the cat sat on the red mat” in novel configurations.

Reasoning path consistency: When models produce multi-step reasoning (via chain-of-thought prompting), the stated reasoning often doesn’t causally determine the answer. Lanham et al. (2023) showed that models sometimes produce correct answers with incorrect reasoning chains, and corrupting reasoning steps doesn’t always corrupt answers—suggesting the reasoning is sometimes post-hoc rationalization rather than genuine computation.

Reversal curse: Berglund et al. (2023) demonstrated that models trained on “A is B” do not automatically learn “B is A.” If trained that “Tom Cruise’s mother is Mary Lee Pfeiffer,” models fail to answer “Who is Mary Lee Pfeiffer’s son?” This reveals fundamental limitations in how knowledge is stored and accessed.

3.3 Theoretical Analysis of Transformer Reasoning

Several theoretical results illuminate these limitations:

Circuit complexity bounds: Merrill & Sabharwal (2023) analyze transformers with bounded (logarithmic) precision as circuit families. Their key result: log-precision transformers can be simulated by logspace-uniform $\mathsf{TC}^0$ circuits (constant-depth threshold circuits with polynomial size). This places an upper bound on what such transformers can compute in a single forward pass—they are limited to problems in $\mathsf{TC}^0$ , regardless of model width.

Important caveats:

These results apply specifically to the log-precision model analyzed (where intermediate values are represented with $O(\log n)$ bits).
Real transformers use floating-point arithmetic with different precision characteristics; the mapping between theoretical precision models and practical behavior remains an active research area.
The result is a simulation upper bound; it does not directly prove lower bounds on specific problems like graph connectivity for transformers, though it does imply that problems known to be outside $\mathsf{TC}^0$ cannot be solved by log-precision transformers in a single pass.

Chain-of-thought as computation extension: Feng et al. (2023) analyze the role of intermediate generation steps in overcoming depth limitations. Their key insight: bounded-depth transformers face impossibility results for directly outputting answers to certain problems unless model size grows rapidly with input size. However, generating intermediate steps (chain-of-thought) allows constant-size autoregressive transformers to solve these problems by effectively extending computation depth through the generation process. This provides theoretical grounding for why CoT helps—it’s not just about “showing work” but about enabling deeper computation.

3.4 The Path Forward: Compute at Inference Time

If single forward passes cannot reliably solve multi-step reasoning problems, the natural response is to allocate more computation at inference time. This is the core insight behind several lines of research:

Chain-of-thought (CoT) prompting from Wei et al. (2022) instructs models to produce intermediate reasoning steps before the final answer. This works by:

Converting internal reasoning into external tokens
Allowing the model to “read” its own reasoning in subsequent forward passes
Effectively increasing the computation allocated to the problem

CoT provides substantial improvements on math, logic, and multi-step reasoning tasks. However, it has limitations: the reasoning must fit the model’s implicit capabilities, and stated reasoning may not reflect actual computation.

Self-consistency from Wang et al. (2022) samples multiple reasoning chains and takes the majority answer. This provides error correction when reasoning paths are noisy but individual samples are above chance.

Tree-of-thought from Yao et al. (2023) generalizes CoT to explicit search over reasoning trees:

                    [Problem]
                    /   |   \
               [Step A][Step B][Step C]
                 /  \      |      \
            [A1][A2] [B1]  [C1]
             |        |
          [Ans1]  [Ans2]

The model generates multiple candidate steps at each node, evaluates or scores them, and searches (breadth-first, beam, or other strategies) for solutions. This transforms reasoning from implicit single-pass computation to explicit search—at the cost of significantly more inference compute.

Process reward models (PRMs) from Lightman et al. (2023) train separate models to evaluate intermediate reasoning steps, enabling more reliable search. PRMs trained on step-level human feedback provide better signal for search than outcome-only supervision.

Formal verification integration: For domains with formal semantics (mathematics, code), integrating formal verification tools allows models to verify reasoning steps with certainty. AlphaProof (DeepMind, 2024) combines a language model with the Lean proof assistant to solve International Mathematical Olympiad problems at silver-medal level. AlphaGeometry (Trinh et al., 2024) combines neural language models with a symbolic deduction engine for geometry theorem proving. These systems are not general-purpose reasoners but demonstrate how neural and symbolic approaches can complement each other in domains with formal structure.

4. Retrieval Augmentation and Tool Use

A fundamental limitation of parametric models is that knowledge is static—frozen at training time. Retrieval-augmented generation (RAG) addresses this by conditioning generation on dynamically retrieved information.

4.1 RAG Architecture and Theory

The standard RAG formulation from Lewis et al. (2020):

$p(y|x) = \sum_{z \in \text{top-}k} p(z|x) \cdot p(y|x, z)$

where $z$ represents retrieved documents. In practice:

Encode query: Map input $x$ to embedding $e_x = E(x)$
Retrieve documents: Find $z_1, \ldots, z_k$ maximizing similarity $\text{sim}(e_x, E(z_i))$
Generate with context: Compute $p(y|x, z_1, \ldots, z_k)$

Dense retrieval uses learned embeddings (e.g., Contriever, E5, BGE) where semantic similarity in embedding space approximates relevance. Training typically uses contrastive losses on query-document pairs.

Retrieval quality is paramount. Empirically, RAG systems are bottlenecked by retrieval quality more than generation quality. A sophisticated generator cannot compensate for irrelevant retrieved documents. Investment in retrieval—including query understanding, document chunking, embedding model selection, and relevance filtering—typically provides better returns than generator improvements.

4.2 Advanced RAG Techniques

Query transformation: The user’s query may not be optimal for retrieval. Techniques include:

Query expansion (add related terms)
Hypothetical document embeddings (HyDE): generate a hypothetical answer, embed it, retrieve similar documents
Multi-query retrieval: generate multiple query variants, retrieve for each, merge results

Hierarchical retrieval: For large corpora, two-stage retrieval is often necessary:

Fast, approximate retrieval (BM25, approximate nearest neighbors) to get candidate set
Expensive, accurate reranking (cross-encoder models) on candidates

Chunk optimization: Document chunking strategy significantly affects performance:

Too small: loses context, increases noise
Too large: dilutes relevance, exceeds context limits
Overlapping chunks with metadata (document title, section headers) often outperform naive fixed-size chunking

Self-RAG from Asai et al. (2023) trains models to decide when to retrieve, what to retrieve, and how to use retrieved information—making retrieval decisions adaptive rather than fixed.

4.3 Tool Use and Agency

Tool use extends retrieval to arbitrary external capabilities: calculators, code execution, APIs, databases, web search. The model generates structured calls, receives results, and conditions subsequent generation on them.

Tool use as a language modeling problem: Tool calls can be represented as special tokens in the generation vocabulary. Training on traces of successful tool use teaches models when and how to invoke tools. This approach powers function calling in GPT-4, Claude, and other systems.

The agency spectrum: Systems exhibit increasing levels of autonomous action:

Level	Description	Example
0	No tools	Pure generation
1	Single tool call	Calculator, search
2	Multi-step tool use	Research + synthesis
3	Planning and execution	Decompose task, execute steps
4	Autonomous agents	Long-running, self-directed

Higher agency levels introduce compounding failure modes: each step can fail, errors propagate, and the system may enter unrecoverable states. Robust agent architectures require explicit error handling, state management, and human oversight mechanisms.

For production RAG implementation, see our guide on designing RAG pipelines.

5. Multimodal Architectures

Modern AI systems increasingly process multiple modalities—text, images, audio, video—in unified frameworks. Understanding multimodal architectures is essential for building systems that perceive and reason about the world.

5.1 Vision-Language Models

Dual-encoder architectures (CLIP, ALIGN): Separate encoders for each modality, trained to align representations via contrastive learning. The foundational CLIP paper (Radford et al., 2021) uses an InfoNCE-style contrastive loss. For a batch of $N$ image-text pairs, define the similarity logits:

$s_{ij} = \frac{\mathrm{sim}(e_I^i, e_T^j)}{\tau}$

The loss encourages matching pairs (diagonal) to have high similarity while non-matching pairs have low similarity:

$\mathcal{L}_{I \to T} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ij})}$

$\mathcal{L}_{T \to I} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(s_{ii})}{\sum_{j=1}^{N} \exp(s_{ji})}$

$\mathcal{L} = \frac{1}{2}(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I})$

Source: Symmetric InfoNCE cross-entropy on similarity logits (image→text and text→image)

where $e_I^i$ and $e_T^i$ are image and text embeddings for pair $i$ , $\tau$ is a learned temperature parameter, and $\mathrm{sim}(\cdot,\cdot)$ is typically cosine similarity.

Emergent capabilities: CLIP-style models exhibit remarkable zero-shot transfer. The text encoder can represent arbitrary class descriptions, enabling classification without task-specific training. This emerges from the scale and diversity of image-text training data.

Limitations: Dual encoders compute independent embeddings—they cannot model fine-grained interactions between modalities. “A cat sitting on a red mat” and “A red cat sitting on a mat” may have similar embeddings despite different meanings.

Unified transformer architectures (Flamingo, GPT-4V, Gemini): Process all modalities in a single transformer:

[Image tokens] [Adapter] → [Transformer] ← [Text tokens]
                               ↓
                        [Output tokens]

Image patches are embedded (via CNN, ViT, or similar), optionally projected through an adapter (mapping to the LLM’s embedding space), and concatenated with text tokens. The unified transformer attends over both modalities, enabling rich cross-modal reasoning.

Key architectural decisions:

Early vs. late fusion: Early fusion (concatenate at input) allows maximum interaction; late fusion (separate processing, merge at output) is more efficient but less expressive
Adapter architecture: Linear projection, MLP, cross-attention, or Q-Former (BLIP-2, Li et al., 2023) all work, with tradeoffs in expressiveness and training efficiency
Frozen vs. tuned vision encoder: Freezing (BLIP-2, LLaVA) enables rapid adaptation with less data; end-to-end tuning (Gemini) maximizes capability but requires more compute

5.2 Image Generation

Diffusion models have become the dominant paradigm for image generation. The forward process gradually adds Gaussian noise:

$q(x_t \mid x_{t-1}) = \mathcal{N}\bigl(x_t;\, \sqrt{1-\beta_t}\,x_{t-1},\, \beta_t I\bigr)$

Source: q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) * x_{t-1}, β_t * I)

The reverse process learns to denoise:

$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\bigl(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)\bigr)$

Source: p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

Text conditioning: Text-to-image generation conditions the denoising process on text embeddings. Classifier-free guidance from Ho & Salimans (2022) interpolates between conditional and unconditional predictions:

$\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \varnothing) + s \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))$

where $s > 1$ amplifies the influence of conditioning. Higher guidance scales produce images more aligned with prompts but with reduced diversity.

Latent diffusion from Rombach et al. (2022) (Stable Diffusion): Performing diffusion in a learned latent space rather than pixel space dramatically reduces computational cost while maintaining quality. An autoencoder compresses images to latents; diffusion operates in latent space; a decoder reconstructs pixels.

Architectural innovations:

DiT (Diffusion Transformer): Peebles & Xie (2023) replaces U-Net with transformer architecture, scaling better with compute
Consistency models: Song et al. (2023) enable few-step or single-step generation by training to map any point on the diffusion trajectory to the final output
Rectified flows: Liu et al. (2022) provides alternative formulation with straighter sampling trajectories

5.3 Video and Temporal Modeling

Video introduces temporal consistency as a key challenge. A sequence of independently generated frames produces flickering, inconsistent results.

Temporal attention: Extend attention to include temporal dimension. For video input $X \in \mathbb{R}^{T \times H \times W \times C}$ , temporal attention attends across frames at each spatial position:

$\text{TemporalAttn}(X)_{t,h,w} = \text{Attn}(X_{:,h,w})_t$

Combined with spatial attention, this enables joint spatiotemporal modeling.

3D architectures: Inflate 2D convolutions/attention to 3D, processing spatiotemporal volumes directly. More expressive but more computationally expensive.

Autoregressive video models: Generate frames sequentially, conditioning each frame on previous frames. Can leverage powerful image models by fine-tuning for next-frame prediction. Struggle with long-range temporal coherence.

Video diffusion: Apply diffusion process to video latents, with temporal attention ensuring frame consistency. Notable examples include OpenAI’s Sora, with Sora 2 (September 2025) showing significant improvements in duration and coherence; Runway’s Gen-3 and subsequent versions; Google DeepMind’s Veo; and Kling from Kuaishou. This remains an active area with rapid progress—capabilities improve substantially every few months.

6. Mechanistic Interpretability

Understanding what neural networks actually compute—rather than treating them as black boxes—is crucial for alignment, debugging, and capability development. Mechanistic interpretability aims to reverse-engineer learned computations.

6.1 Features and Circuits

The linear representation hypothesis: Neural networks represent human-interpretable features as directions in activation space. Evidence:

Word embeddings exhibit linear structure (king - man + woman ≈ queen)
Probing classifiers find task-relevant information in intermediate representations
Activation patching shows causal role of specific directions

Features: A feature is a property the model represents, corresponding to a direction in activation space. Features include concrete concepts (cat, running) and abstract ones (negation, uncertainty, first-person perspective).

Circuits: Features are computed by circuits—subnetworks that implement specific computations. Circuit analysis traces how features are computed from inputs through intermediate features.

Example: Indirect Object Identification from Wang et al. (2023): In “Mary gave the book to John. She…”, the model must determine that “She” refers to Mary. This is implemented by a circuit involving:

Name mover heads that copy names to the final position
S-inhibition heads that suppress the subject from the indirect object position
Backup name mover heads providing redundancy

Understanding this circuit required painstaking analysis—activations patching to identify relevant components, then detailed study of attention patterns and how they compose.

6.2 Superposition and Polysemanticity

A key finding: neural networks represent more features than they have dimensions. This is possible through superposition—features are encoded as nearly-orthogonal directions in a shared space.

The toy model: Elhage et al. (2022) study superposition in simple settings. With $n$ features and $m < n$ dimensions, sparse features can be packed into the space if they rarely activate simultaneously:

$\text{Feature } i: e_i \in \mathbb{R}^m$ $\text{Representation: } \sum_i a_i e_i$

If features rarely co-occur, their representations rarely interfere. The network tolerates small reconstruction errors to represent more features than dimensions allow.

Implications:

Individual neurons are polysemantic—they respond to multiple unrelated features
Features are distributed across neurons, not localized
This makes interpretation harder: you cannot simply study what each neuron does

6.3 Sparse Autoencoders for Feature Discovery

Sparse autoencoders (SAEs) address polysemanticity by learning an overcomplete basis of monosemantic features:

$\text{Encode: } f(x) = \text{ReLU}(W_e(x - b_d) + b_e)$ $\text{Decode: } \hat{x} = W_d f(x) + b_d$

The SAE is trained on model activations with a sparsity penalty, encouraging it to find a sparse set of interpretable directions.

Results: SAEs trained on language model activations discover features corresponding to:

Semantic concepts (DNA sequences, programming languages, legal text)
Syntactic roles (subjects, objects, modifiers)
Meta-cognitive states (uncertainty, refusal, roleplay)
Formatting patterns (markdown, lists, code blocks)

Anthropic’s research on monosemanticity provides extensive documentation of discovered features.

Limitations:

SAEs may find the “wrong” decomposition—there are many valid sparse decompositions
Scaling to large models is computationally expensive
Completeness is hard to verify—how do we know if we’ve found all important features?

6.4 Applications to Alignment

Mechanistic interpretability has direct applications to AI safety:

Detecting deception: If we understand how truthfulness is represented, we might detect when models are being deceptive by examining internal activations rather than outputs.

Steering vectors: The linear representation hypothesis suggests we can modify model behavior by adding directions corresponding to desired features:

$h' = h + \alpha \cdot v_{\text{target}}$

where $v_{\text{target}}$ is a direction identified through interpretability research (e.g., “honesty” or “helpfulness”). Turner et al. (2023) demonstrate this approach for behavioral control without fine-tuning.

Understanding failures: When models fail, interpretability tools can diagnose why. Which circuits activated incorrectly? What features were misrecognized? This enables targeted fixes.

Verification: For safety-critical systems, we want not just empirical testing but understanding of why the system behaves as it does. Mechanistic interpretability provides this, at least partially.

7. Alignment and Robustness

Building AI systems that reliably do what we intend—and avoid what we don’t intend—is both a technical and philosophical challenge. Here we focus on the technical approaches.

7.1 The Alignment Problem Formalized

The alignment problem can be stated as: ensure that the model’s optimization target (what it actually pursues) matches the intended objective (what we want it to pursue).

Sources of misalignment:

Specification gaming: The objective function doesn’t capture what we really want; the model optimizes the specified objective in unintended ways
Goal misgeneralization: The model learns a proxy of the intended goal that diverges in new situations
Deceptive alignment: The model appears aligned during training but pursues different objectives in deployment

7.2 RLHF and Its Limitations

Reinforcement Learning from Human Feedback (RLHF), formalized by Christiano et al. (2017) and applied to language models by Ouyang et al. (2022), trains models to produce outputs that humans prefer:

Collect comparisons: Humans rank model outputs for the same prompt
Train reward model: Learn $r(x, y)$ predicting human preferences
Optimize policy: Fine-tune model to maximize expected reward via PPO or similar

$\max_\theta \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot \mid x)}\bigl[r(x, y)\bigr] - \beta \cdot D_{\mathrm{KL}}\bigl[\pi_\theta \,\|\, \pi_{\mathrm{ref}}\bigr]$

Source: max_θ E[r(x,y)] - β * D_KL[π_θ || π_ref]

The KL penalty prevents the model from deviating too far from the reference policy (usually the supervised fine-tuned model), which helps avoid reward hacking.

Limitations of RLHF:

Reward hacking: Models can learn to exploit the reward model rather than satisfy human preferences. The reward model has its own failure modes.
Human feedback quality: Humans are inconsistent, have biases, and struggle to evaluate complex outputs. High-quality feedback is expensive.
Scalable oversight: As models become more capable, humans become less able to evaluate outputs accurately. We cannot RLHF our way to superhuman performance if we cannot evaluate superhuman outputs.

7.3 Constitutional AI and Self-Improvement

Constitutional AI from Bai et al. (2022) reduces reliance on human feedback by having models critique and revise their own outputs according to a set of principles (the “constitution”):

Generate initial response
Critique response against constitutional principles
Revise response based on critique
Use revised responses as training signal

This generates training data without human labeling for each example. Humans specify principles (the constitution) rather than labeling instances.

RLAIF (RL from AI Feedback): Replace human comparisons with model-generated comparisons. A capable model evaluates which outputs better satisfy the constitution, providing reward signal for training.

Advantages:

Scales better than human feedback
Principles can be explicit and auditable
Can enforce consistency across many examples

Risks:

Model’s own biases are amplified
Principles may be incomplete or conflicting
Self-improvement processes can be unstable

7.4 Direct Preference Optimization

Direct Preference Optimization (DPO) from Rafailov et al. (2023) eliminates the explicit reward model by deriving a closed-form solution for the optimal policy:

$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta; \pi_{\mathrm{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)}\right)\right]$

Source: -E[ log σ( β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x)) ) ]

where $y_w$ and $y_l$ are preferred and dispreferred outputs, $\sigma$ is the sigmoid function, $\beta$ controls deviation from the reference policy, and the log-ratios $\log(\pi_\theta / \pi_{\mathrm{ref}})$ are the key structural element (the implicit reward difference). This directly optimizes the policy on preference data without an intermediate reward model.

Advantages:

Simpler pipeline (no reward model training)
More stable training
Equivalent to RLHF in theory, often comparable in practice

Variants: IPO (Identity Preference Optimization), KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024), ORPO (Odds Ratio Preference Optimization) each address specific limitations of DPO.

7.5 Robustness and Adversarial Attacks

Aligned models must be robust to adversarial inputs—prompts designed to elicit unintended behaviors.

Jailbreaking: Prompts that circumvent safety training:

Role-play scenarios (“Pretend you are an evil AI…”)
Encoding attacks (Base64, pig latin)
Context manipulation
Gradient-based prompt optimization (GCG attack, Zou et al., 2023)

Defenses:

Training robustness: Include adversarial examples in safety training
Input filtering: Detect and block adversarial patterns
Output filtering: Screen responses for policy violations
Ensemble approaches: Multiple independent checks

No known defense is complete. This is an active adversarial game between attack and defense research.

8. Evaluation and Measurement

Proper evaluation is critical for understanding model capabilities and guiding development. Different benchmarks measure different capabilities, and metric choice significantly affects conclusions.

8.1 Reasoning vs. Retrieval vs. Tool Use

Reasoning benchmarks (GSM8K, MATH, ARC, BBH) test multi-step inference. Key considerations:

Data contamination: models may have seen test problems during training
Solution style: does the benchmark reward correct answers or correct reasoning?
Difficulty distribution: aggregate metrics can hide bimodal behavior (perfect on easy, zero on hard)

Knowledge/retrieval benchmarks (TriviaQA, Natural Questions, MMLU) test factual recall. These conflate retrieval from parameters with reasoning about retrieved information.

Tool use evaluation remains underdeveloped. Existing benchmarks (ToolBench, API-Bank) often have narrow coverage and don’t capture realistic failure modes like cascading errors in multi-step tool chains.

8.2 Avoiding Misleading Metrics

Accuracy on held-out sets can be misleading when:

Distribution shift exists between benchmark and deployment
Benchmark saturates (ceiling effects hide capability differences)
Models are optimized for benchmark performance specifically

Calibration matters: a model that says “I’m 90% confident” should be right 90% of the time. Overconfident models are dangerous in production.

Human preference ratings can be gamed by verbosity, confident tone, and other superficial features that don’t correlate with correctness.

Recommendation: Use multiple complementary benchmarks, include held-out evaluation sets not used for development, measure calibration alongside accuracy, and validate that benchmark performance predicts deployment performance.

9. Long-Context Modeling

Extending context length beyond training-time limits is critical for many applications (document analysis, code understanding, long conversations).

9.1 Positional Encoding Extension

Beyond the encodings discussed in Section 1.3, several techniques extend context at inference time:

Position interpolation: Scale position indices to fit within training range. If trained on 4K context, interpolate positions for 16K context by dividing position indices by 4. Works but can degrade performance.

YaRN (Yet another RoPE extensioN): Peng et al. (2023) combines interpolation with attention scaling, maintaining performance better than naive interpolation.

NTK-aware scaling: Modifies RoPE’s frequency basis to better handle extrapolation.

9.2 Efficient Long-Context Attention

Beyond sparse attention variants:

Landmark attention: Select key “landmark” tokens that summarize regions, attend fully to landmarks, then attend locally.

Streaming/recurrent approaches: Process context in chunks, maintaining a compressed state. Models like Mamba (Gu & Dao, 2023) use state-space models with linear scaling in sequence length.

Retrieval over context: For very long contexts, retrieve relevant portions rather than attending to everything. This trades exact attention for practical scalability.

9.3 Training for Long Context

Simply training on longer sequences is expensive ( $O(n^2)$ in sequence length) but effective. Techniques to reduce cost:

Progressive training: Train on short sequences first, gradually increase length.

Sparse attention during training: Use efficient attention for most training, fine-tune with full attention on target length.

Synthetic long-context data: Generate training data specifically exercising long-range dependencies.

10. Future Directions and Open Problems

We conclude with significant open problems and promising research directions.

10.1 Fundamental Capability Improvements

Reliable reasoning: Despite progress on chain-of-thought and search-based methods, models still fail on tasks requiring reliable multi-step inference. Neurosymbolic approaches combining neural and formal methods show promise but lack generality.

Sample efficiency: Humans learn from far less data than current models. Techniques for better data efficiency—curriculum learning, meta-learning, causal representation learning—could dramatically reduce training costs or improve capability at fixed compute.

Continual learning: Current models are static after training. Efficiently incorporating new knowledge without catastrophic forgetting remains challenging. Retrieval augmentation provides a partial solution; true continual learning would be more powerful.

10.2 Architectural Innovation

Alternative sequence models: Mamba (Gu & Dao, 2023) and other state-space models show promise with linear scaling in sequence length, but have not displaced transformers for language. Hybrid architectures combining attention and state-space layers are an active area.

Mixture of Experts (MoE): Sparse MoE models route inputs to subsets of parameters, enabling larger total parameter counts with fixed per-example compute. Switch Transformer (Fedus et al., 2021), Mixtral, and Arctic demonstrate this approach. Optimal routing, load balancing, and training stability remain active research areas.

Memory and state: Transformers have no persistent state beyond the context window. Architectures incorporating external memory (Memorizing Transformers, Wu et al., 2022) or recurrent state could extend effective context without quadratic cost.

10.3 Alignment and Safety

Scalable oversight: How do we supervise systems more capable than ourselves? Proposals include:

Debate: Irving et al. (2018) propose models argue positions, humans judge arguments
Recursive reward modeling: Models help evaluate other models
Iterated amplification: Christiano et al. (2018) gradually build evaluation capability through AI assistance

None are proven at scale.

Interpretability at scale: Current interpretability techniques work on small models or small portions of large models. Scaling mechanistic interpretability to frontier models is essential for understanding and controlling them.

Evaluating dangerous capabilities: We lack reliable methods to evaluate deception, manipulation, and long-horizon planning capabilities before deployment. Developing robust evaluations is crucial for responsible development.

10.4 World Models and Embodiment

World models: Learning predictive models of environments that support planning and reasoning. Video prediction models, game simulators, and robotics foundation models all point toward this direction. True world models that generalize across domains remain distant.

Embodiment: Grounding language in physical action. Current language models reason about the world through text; embodied systems must act in it. The robotics foundation model approach (RT-2, Brohan et al., 2023) shows progress, but general-purpose robots remain beyond current capabilities.

Multimodal unification: Current multimodal models handle limited modality combinations. True multimodal systems would seamlessly integrate text, images, audio, video, 3D, actions, and other modalities. Architectural approaches that scale to many modalities without modality-specific engineering are needed.

Conclusion

The transformer architecture, scaled with unprecedented compute and data, has produced AI systems with remarkable capabilities. Yet significant limitations remain: reasoning is unreliable, knowledge is static, alignment is incomplete, and our understanding of what these systems compute is fragmentary.

Progress will come from multiple fronts: architectural innovations that improve efficiency and capability, algorithmic advances in reasoning and retrieval, alignment techniques that scale with capability, and interpretability methods that illuminate system behavior.

The path forward requires both engineering rigor—building systems that work reliably in production—and scientific understanding—comprehending why they work and when they fail. This paper has aimed to provide foundations for both.