This paper provides a rigorous treatment of the theoretical and practical foundations underlying modern large language models, with particular emphasis on reasoning capabilities, scaling behavior, and architectural innovations. We synthesize results from across the field to present a unified view of where the technology stands, why current systems exhibit their characteristic strengths and limitations, and what architectural and algorithmic approaches show promise for the next generation of AI systems.
The intended audience includes researchers, engineers building production AI systems, and technical leaders making architectural decisions. We assume familiarity with deep learning fundamentals, linear algebra, and probability theory.
1. The Transformer as a Computational Primitive
The transformer architecture, introduced by Vaswani et al. (2017), has become the dominant paradigm for sequence modeling. Understanding its computational properties—what it can and cannot express—is foundational to understanding both its capabilities and limitations.
1.1 Attention as Soft Dictionary Lookup
Self-attention can be understood as a differentiable dictionary lookup operation. Given input sequence , attention computes:
Source: Attention(Q,K,V) = softmax((Q K^T) / sqrt(d_k)) V
where , , are linear projections, and is the dimension of the key vectors (typically ). The scaling factor prevents the dot products from growing too large in magnitude as dimensionality increases, which would push the softmax into regions of extremely small gradients.
The softmax-weighted combination creates a soft lookup: rather than retrieving a single value, each query retrieves a weighted combination of all values, with weights determined by query-key similarity. This soft retrieval is what makes attention differentiable and trainable.
Key insight: The attention pattern matrix is:
Source: A = softmax((Q K^T) / sqrt(d_k))
This represents a learned, input-dependent routing of information. Unlike convolutions (fixed local patterns) or recurrence (sequential propagation), attention allows any position to directly access any other position in a single operation.
1.2 Expressiveness and Computational Complexity
What transformers can compute: The computational power of transformers depends critically on implementation details—particularly the precision of arithmetic and the form of attention used.
Pérez et al. (2021) showed that transformers with hard attention (where attention weights are exactly 0 or 1) are Turing complete under certain assumptions. However, this result does not directly apply to standard soft-attention transformers used in practice.
For bounded-precision (log-precision) transformers, the picture is more constrained. Merrill & Sabharwal (2023) demonstrate that such transformers can be simulated by constant-depth logspace-uniform threshold circuits (). This places fundamental limits on what bounded-precision transformers can compute in a single forward pass—they cannot solve problems outside without growing depth or precision.
The depth-width tradeoff: Theoretical results demonstrate that transformer depth is fundamentally more powerful than width for certain computations. Specifically, there exist functions computable by -depth transformers that require exponential width in constant-depth transformers. This has practical implications: for tasks requiring multi-step reasoning, depth matters.
Computational complexity: Standard self-attention is in sequence length and dimension . This quadratic scaling motivates extensive research into efficient attention variants:
| Variant | Complexity | Notes |
|---|---|---|
| Standard attention | Full expressiveness | |
| Sparse attention | = attended keys per query; depends on sparsity pattern (local, strided, block) | |
| Linear attention | = feature/rank dimension of kernel approximation; trades expressiveness for speed | |
| FlashAttention | Same complexity, ~2-4× faster via memory hierarchy optimization (SRAM vs HBM) | |
| Ring attention | compute | Distributed across devices; communication overhead not shown |
FlashAttention (Dao et al., 2022) deserves special mention: it achieves the same exact computation as standard attention but with dramatically better wall-clock time by respecting GPU memory hierarchy. This is a reminder that algorithmic complexity is not the only consideration—constant factors and hardware characteristics matter enormously at scale.
1.3 Positional Encoding and Length Generalization
Transformers process positions in parallel and thus require explicit position information. The choice of positional encoding significantly affects model behavior, particularly for length generalization.
Absolute positional encodings (sinusoidal or learned) embed each position as a vector added to token embeddings. Simple but limited: models trained on sequences up to length often fail catastrophically on longer sequences.
Rotary Position Embeddings (RoPE) from Su et al. (2021) encode position through rotation in embedding space:
where is a rotation matrix with angle proportional to position . RoPE encodes relative position information directly in the attention computation: the dot product depends on the relative position . This improves length generalization but does not fully solve it.
ALiBi (Attention with Linear Biases) from Press et al. (2022) adds position-dependent biases directly to attention logits:
where the bias matrix is defined as:
Source: softmax((Q K^T)/sqrt(d_k) + B) V, where B_{i,j} = -m * |i-j|
The bias penalizes attention to distant positions, with different heads using different slopes (geometrically spaced, e.g., ) to capture different distance scales. ALiBi provides strong length generalization with minimal computational overhead.
The length generalization problem remains open. Current models can generalize somewhat beyond training lengths with appropriate positional encodings and training procedures, but robust extrapolation to arbitrary lengths remains challenging. This has implications for tasks requiring long-range reasoning.
1.4 Inference-Time Systems Constraints
Production deployment introduces constraints beyond raw model capability:
KV-cache memory: During autoregressive generation, key and value vectors for all previous tokens must be stored. Memory (in bytes) grows as:
Source: KV_bytes = 2 * L * n_tokens * n_kv_heads * d_head * bytes_per_elem * batch_size
Notation definitions:
- = number of transformer layers
- = sequence length (cached tokens so far)
- = number of key/value heads (may be under GQA/MQA)
- = per-head dimension (typically )
- = bytes per element (2 for fp16/bf16, 1 for int8, 4 for fp32)
- = batch size
For a 70B model (80 layers, 8 KV heads with GQA, 128-dim heads) generating 4K tokens in fp16: (≈1.25 GiB).
Multi-Query and Grouped-Query Attention: MQA (Shazeer, 2019) shares key-value heads across all query heads, reducing KV-cache by the number of heads. GQA (Ainslie et al., 2023) groups query heads to share KV heads, trading off between MQA’s efficiency and full multi-head attention’s expressiveness. Most production models now use GQA.
Batching and throughput: Inference systems batch requests to amortize fixed costs, but different sequence lengths complicate batching. Continuous batching, speculative decoding, and paged attention (vLLM) address these challenges.
2. Scaling Laws and Emergent Capabilities
One of the most significant empirical discoveries in deep learning is the existence of predictable scaling laws governing model performance as a function of compute, parameters, and data.
2.1 The Scaling Laws
Kaplan et al. (2020) established that language model loss follows power laws in parameters, data, and compute:
Source: L(N) ∝ N^(-α_N), L(D) ∝ D^(-α_D), L(C) ∝ C^(-α_C)
where is parameters, is data (tokens), is compute (FLOPs), and the exponents are empirically determined (approximately 0.076, 0.095, and 0.050 respectively). Loss decreases as each factor increases—the negative exponents capture this inverse relationship.
Chinchilla scaling from Hoffmann et al. (2022) refined these laws with a crucial insight: for compute-optimal training, parameters and data should scale approximately equally. The original GPT-3 (175B parameters, 300B tokens) was significantly undertrained by this analysis. Chinchilla (70B parameters, 1.4T tokens) achieved better performance with the same compute by using more data and fewer parameters.
The Chinchilla-optimal ratio is approximately:
where is training tokens and is parameters. This has major practical implications:
- Many deployed models are undertrained relative to compute optimum
- Training data volume is increasingly the bottleneck
- Inference-optimized models (fewer parameters, more training) may be preferable in deployment
2.2 Emergence and Phase Transitions
Perhaps the most fascinating aspect of scaling is the emergence of capabilities that appear discontinuously—near-zero performance below some scale threshold, then rapid improvement above it.
Documented emergent capabilities include:
- Few-shot learning (Brown et al., 2020)
- Chain-of-thought reasoning (Wei et al., 2022)
- Instruction following
- Multi-step arithmetic
The emergence controversy: Wei et al. (2022) initially characterized emergence as sharp phase transitions in capability. Subsequent work by Schaeffer et al. (2023) argued that some apparent emergence is an artifact of evaluation metrics—with continuous metrics, performance improvements look gradual rather than discontinuous.
The truth likely involves both: some capabilities genuinely require sufficient model capacity and exhibit threshold behavior, while others improve gradually but appear sudden due to task structure or evaluation methodology. Distinguishing these cases is important for predicting capability development.
Theoretical frameworks for emergence:
- Skill composition: Arora & Goyal (2023) propose that emergence occurs when models acquire sufficient component skills that compose to solve harder tasks.
- Grokking: Power et al. (2022) demonstrated that models can suddenly generalize after extended training, well past the point of memorizing training data—suggesting emergence may relate to phase transitions in the loss landscape.
- Superposition release: Emergence may correspond to when model capacity becomes sufficient to represent task-relevant features without destructive interference (see Section 6).
2.3 The Bitter Lesson and Its Implications
Rich Sutton’s “Bitter Lesson” observes that historically, methods leveraging computation at scale outperform methods incorporating human knowledge. The transformer era exemplifies this: relatively simple architectures with massive scale outperform carefully engineered alternatives.
However, the bitter lesson has limits:
- Data efficiency: Humans learn from far less data than current models. Techniques that improve data efficiency compound the benefits of scale.
- Reliability: Scale improves average performance but does not eliminate failure modes. Architectural innovations for robustness complement scaling.
- Inference cost: Training compute scales down the inference cost curve, but inference still matters. Architectural efficiency remains important.
The pragmatic synthesis: scale is necessary but not sufficient. The most capable systems combine scale with architectural innovations that improve efficiency, reliability, and capability.
3. The Reasoning Problem
Despite remarkable capabilities, current language models exhibit characteristic failures on tasks requiring systematic reasoning. Understanding these failures—and developing approaches to address them—is central to advancing AI capabilities.
3.1 What “Reasoning” Means for Language Models
We can decompose reasoning into several components:
- Retrieval: Accessing relevant information from parameters or context
- Composition: Combining information according to rules or relationships
- Multi-step inference: Chaining multiple reasoning steps
- Verification: Checking whether conclusions follow from premises
- Search: Exploring possible inference paths when direct inference fails
Standard transformer inference—a single forward pass—performs all these operations implicitly. This works remarkably well for tasks that fit within the model’s “implicit reasoning depth” but fails when explicit multi-step computation is required.
3.2 Empirical Failure Modes
Length generalization in arithmetic: Transformers trained on -digit addition fail on -digit addition, despite the algorithm being identical. The model learns a procedure entangled with specific position patterns rather than the abstract algorithm.
Compositional generalization: Models struggle with novel compositions of known primitives. A model that understands “the cat sat on the mat” and “the mat is red” may fail to infer properties of “the cat sat on the red mat” in novel configurations.
Reasoning path consistency: When models produce multi-step reasoning (via chain-of-thought prompting), the stated reasoning often doesn’t causally determine the answer. Lanham et al. (2023) showed that models sometimes produce correct answers with incorrect reasoning chains, and corrupting reasoning steps doesn’t always corrupt answers—suggesting the reasoning is sometimes post-hoc rationalization rather than genuine computation.
Reversal curse: Berglund et al. (2023) demonstrated that models trained on “A is B” do not automatically learn “B is A.” If trained that “Tom Cruise’s mother is Mary Lee Pfeiffer,” models fail to answer “Who is Mary Lee Pfeiffer’s son?” This reveals fundamental limitations in how knowledge is stored and accessed.
3.3 Theoretical Analysis of Transformer Reasoning
Several theoretical results illuminate these limitations:
Circuit complexity bounds: Merrill & Sabharwal (2023) analyze transformers with bounded (logarithmic) precision as circuit families. Their key result: log-precision transformers can be simulated by logspace-uniform circuits (constant-depth threshold circuits with polynomial size). This places an upper bound on what such transformers can compute in a single forward pass—they are limited to problems in , regardless of model width.
Important caveats:
- These results apply specifically to the log-precision model analyzed (where intermediate values are represented with bits).
- Real transformers use floating-point arithmetic with different precision characteristics; the mapping between theoretical precision models and practical behavior remains an active research area.
- The result is a simulation upper bound; it does not directly prove lower bounds on specific problems like graph connectivity for transformers, though it does imply that problems known to be outside cannot be solved by log-precision transformers in a single pass.
Chain-of-thought as computation extension: Feng et al. (2023) analyze the role of intermediate generation steps in overcoming depth limitations. Their key insight: bounded-depth transformers face impossibility results for directly outputting answers to certain problems unless model size grows rapidly with input size. However, generating intermediate steps (chain-of-thought) allows constant-size autoregressive transformers to solve these problems by effectively extending computation depth through the generation process. This provides theoretical grounding for why CoT helps—it’s not just about “showing work” but about enabling deeper computation.
3.4 The Path Forward: Compute at Inference Time
If single forward passes cannot reliably solve multi-step reasoning problems, the natural response is to allocate more computation at inference time. This is the core insight behind several lines of research:
Chain-of-thought (CoT) prompting from Wei et al. (2022) instructs models to produce intermediate reasoning steps before the final answer. This works by:
- Converting internal reasoning into external tokens
- Allowing the model to “read” its own reasoning in subsequent forward passes
- Effectively increasing the computation allocated to the problem
CoT provides substantial improvements on math, logic, and multi-step reasoning tasks. However, it has limitations: the reasoning must fit the model’s implicit capabilities, and stated reasoning may not reflect actual computation.
Self-consistency from Wang et al. (2022) samples multiple reasoning chains and takes the majority answer. This provides error correction when reasoning paths are noisy but individual samples are above chance.
Tree-of-thought from Yao et al. (2023) generalizes CoT to explicit search over reasoning trees:
[Problem]
/ | \
[Step A][Step B][Step C]
/ \ | \
[A1][A2] [B1] [C1]
| |
[Ans1] [Ans2]
The model generates multiple candidate steps at each node, evaluates or scores them, and searches (breadth-first, beam, or other strategies) for solutions. This transforms reasoning from implicit single-pass computation to explicit search—at the cost of significantly more inference compute.
Process reward models (PRMs) from Lightman et al. (2023) train separate models to evaluate intermediate reasoning steps, enabling more reliable search. PRMs trained on step-level human feedback provide better signal for search than outcome-only supervision.
Formal verification integration: For domains with formal semantics (mathematics, code), integrating formal verification tools allows models to verify reasoning steps with certainty. AlphaProof (DeepMind, 2024) combines a language model with the Lean proof assistant to solve International Mathematical Olympiad problems at silver-medal level. AlphaGeometry (Trinh et al., 2024) combines neural language models with a symbolic deduction engine for geometry theorem proving. These systems are not general-purpose reasoners but demonstrate how neural and symbolic approaches can complement each other in domains with formal structure.
4. Retrieval Augmentation and Tool Use
A fundamental limitation of parametric models is that knowledge is static—frozen at training time. Retrieval-augmented generation (RAG) addresses this by conditioning generation on dynamically retrieved information.
4.1 RAG Architecture and Theory
The standard RAG formulation from Lewis et al. (2020):
where represents retrieved documents. In practice:
- Encode query: Map input to embedding
- Retrieve documents: Find maximizing similarity
- Generate with context: Compute
Dense retrieval uses learned embeddings (e.g., Contriever, E5, BGE) where semantic similarity in embedding space approximates relevance. Training typically uses contrastive losses on query-document pairs.
Retrieval quality is paramount. Empirically, RAG systems are bottlenecked by retrieval quality more than generation quality. A sophisticated generator cannot compensate for irrelevant retrieved documents. Investment in retrieval—including query understanding, document chunking, embedding model selection, and relevance filtering—typically provides better returns than generator improvements.
4.2 Advanced RAG Techniques
Query transformation: The user’s query may not be optimal for retrieval. Techniques include:
- Query expansion (add related terms)
- Hypothetical document embeddings (HyDE): generate a hypothetical answer, embed it, retrieve similar documents
- Multi-query retrieval: generate multiple query variants, retrieve for each, merge results
Hierarchical retrieval: For large corpora, two-stage retrieval is often necessary:
- Fast, approximate retrieval (BM25, approximate nearest neighbors) to get candidate set
- Expensive, accurate reranking (cross-encoder models) on candidates
Chunk optimization: Document chunking strategy significantly affects performance:
- Too small: loses context, increases noise
- Too large: dilutes relevance, exceeds context limits
- Overlapping chunks with metadata (document title, section headers) often outperform naive fixed-size chunking
Self-RAG from Asai et al. (2023) trains models to decide when to retrieve, what to retrieve, and how to use retrieved information—making retrieval decisions adaptive rather than fixed.
4.3 Tool Use and Agency
Tool use extends retrieval to arbitrary external capabilities: calculators, code execution, APIs, databases, web search. The model generates structured calls, receives results, and conditions subsequent generation on them.
Tool use as a language modeling problem: Tool calls can be represented as special tokens in the generation vocabulary. Training on traces of successful tool use teaches models when and how to invoke tools. This approach powers function calling in GPT-4, Claude, and other systems.
The agency spectrum: Systems exhibit increasing levels of autonomous action:
| Level | Description | Example |
|---|---|---|
| 0 | No tools | Pure generation |
| 1 | Single tool call | Calculator, search |
| 2 | Multi-step tool use | Research + synthesis |
| 3 | Planning and execution | Decompose task, execute steps |
| 4 | Autonomous agents | Long-running, self-directed |
Higher agency levels introduce compounding failure modes: each step can fail, errors propagate, and the system may enter unrecoverable states. Robust agent architectures require explicit error handling, state management, and human oversight mechanisms.
For production RAG implementation, see our guide on designing RAG pipelines.
5. Multimodal Architectures
Modern AI systems increasingly process multiple modalities—text, images, audio, video—in unified frameworks. Understanding multimodal architectures is essential for building systems that perceive and reason about the world.
5.1 Vision-Language Models
Dual-encoder architectures (CLIP, ALIGN): Separate encoders for each modality, trained to align representations via contrastive learning. The foundational CLIP paper (Radford et al., 2021) uses an InfoNCE-style contrastive loss. For a batch of image-text pairs, define the similarity logits:
The loss encourages matching pairs (diagonal) to have high similarity while non-matching pairs have low similarity:
Source: Symmetric InfoNCE cross-entropy on similarity logits (image→text and text→image)
where and are image and text embeddings for pair , is a learned temperature parameter, and is typically cosine similarity.
Emergent capabilities: CLIP-style models exhibit remarkable zero-shot transfer. The text encoder can represent arbitrary class descriptions, enabling classification without task-specific training. This emerges from the scale and diversity of image-text training data.
Limitations: Dual encoders compute independent embeddings—they cannot model fine-grained interactions between modalities. “A cat sitting on a red mat” and “A red cat sitting on a mat” may have similar embeddings despite different meanings.
Unified transformer architectures (Flamingo, GPT-4V, Gemini): Process all modalities in a single transformer:
[Image tokens] [Adapter] → [Transformer] ← [Text tokens]
↓
[Output tokens]
Image patches are embedded (via CNN, ViT, or similar), optionally projected through an adapter (mapping to the LLM’s embedding space), and concatenated with text tokens. The unified transformer attends over both modalities, enabling rich cross-modal reasoning.
Key architectural decisions:
- Early vs. late fusion: Early fusion (concatenate at input) allows maximum interaction; late fusion (separate processing, merge at output) is more efficient but less expressive
- Adapter architecture: Linear projection, MLP, cross-attention, or Q-Former (BLIP-2, Li et al., 2023) all work, with tradeoffs in expressiveness and training efficiency
- Frozen vs. tuned vision encoder: Freezing (BLIP-2, LLaVA) enables rapid adaptation with less data; end-to-end tuning (Gemini) maximizes capability but requires more compute
5.2 Image Generation
Diffusion models have become the dominant paradigm for image generation. The forward process gradually adds Gaussian noise:
Source: q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) * x_{t-1}, β_t * I)
The reverse process learns to denoise:
Source: p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
Text conditioning: Text-to-image generation conditions the denoising process on text embeddings. Classifier-free guidance from Ho & Salimans (2022) interpolates between conditional and unconditional predictions:
where amplifies the influence of conditioning. Higher guidance scales produce images more aligned with prompts but with reduced diversity.
Latent diffusion from Rombach et al. (2022) (Stable Diffusion): Performing diffusion in a learned latent space rather than pixel space dramatically reduces computational cost while maintaining quality. An autoencoder compresses images to latents; diffusion operates in latent space; a decoder reconstructs pixels.
Architectural innovations:
- DiT (Diffusion Transformer): Peebles & Xie (2023) replaces U-Net with transformer architecture, scaling better with compute
- Consistency models: Song et al. (2023) enable few-step or single-step generation by training to map any point on the diffusion trajectory to the final output
- Rectified flows: Liu et al. (2022) provides alternative formulation with straighter sampling trajectories
5.3 Video and Temporal Modeling
Video introduces temporal consistency as a key challenge. A sequence of independently generated frames produces flickering, inconsistent results.
Temporal attention: Extend attention to include temporal dimension. For video input , temporal attention attends across frames at each spatial position:
Combined with spatial attention, this enables joint spatiotemporal modeling.
3D architectures: Inflate 2D convolutions/attention to 3D, processing spatiotemporal volumes directly. More expressive but more computationally expensive.
Autoregressive video models: Generate frames sequentially, conditioning each frame on previous frames. Can leverage powerful image models by fine-tuning for next-frame prediction. Struggle with long-range temporal coherence.
Video diffusion: Apply diffusion process to video latents, with temporal attention ensuring frame consistency. Notable examples include OpenAI’s Sora, with Sora 2 (September 2025) showing significant improvements in duration and coherence; Runway’s Gen-3 and subsequent versions; Google DeepMind’s Veo; and Kling from Kuaishou. This remains an active area with rapid progress—capabilities improve substantially every few months.
6. Mechanistic Interpretability
Understanding what neural networks actually compute—rather than treating them as black boxes—is crucial for alignment, debugging, and capability development. Mechanistic interpretability aims to reverse-engineer learned computations.
6.1 Features and Circuits
The linear representation hypothesis: Neural networks represent human-interpretable features as directions in activation space. Evidence:
- Word embeddings exhibit linear structure (king - man + woman ≈ queen)
- Probing classifiers find task-relevant information in intermediate representations
- Activation patching shows causal role of specific directions
Features: A feature is a property the model represents, corresponding to a direction in activation space. Features include concrete concepts (cat, running) and abstract ones (negation, uncertainty, first-person perspective).
Circuits: Features are computed by circuits—subnetworks that implement specific computations. Circuit analysis traces how features are computed from inputs through intermediate features.
Example: Indirect Object Identification from Wang et al. (2023): In “Mary gave the book to John. She…”, the model must determine that “She” refers to Mary. This is implemented by a circuit involving:
- Name mover heads that copy names to the final position
- S-inhibition heads that suppress the subject from the indirect object position
- Backup name mover heads providing redundancy
Understanding this circuit required painstaking analysis—activations patching to identify relevant components, then detailed study of attention patterns and how they compose.
6.2 Superposition and Polysemanticity
A key finding: neural networks represent more features than they have dimensions. This is possible through superposition—features are encoded as nearly-orthogonal directions in a shared space.
The toy model: Elhage et al. (2022) study superposition in simple settings. With features and dimensions, sparse features can be packed into the space if they rarely activate simultaneously:
If features rarely co-occur, their representations rarely interfere. The network tolerates small reconstruction errors to represent more features than dimensions allow.
Implications:
- Individual neurons are polysemantic—they respond to multiple unrelated features
- Features are distributed across neurons, not localized
- This makes interpretation harder: you cannot simply study what each neuron does
6.3 Sparse Autoencoders for Feature Discovery
Sparse autoencoders (SAEs) address polysemanticity by learning an overcomplete basis of monosemantic features:
The SAE is trained on model activations with a sparsity penalty, encouraging it to find a sparse set of interpretable directions.
Results: SAEs trained on language model activations discover features corresponding to:
- Semantic concepts (DNA sequences, programming languages, legal text)
- Syntactic roles (subjects, objects, modifiers)
- Meta-cognitive states (uncertainty, refusal, roleplay)
- Formatting patterns (markdown, lists, code blocks)
Anthropic’s research on monosemanticity provides extensive documentation of discovered features.
Limitations:
- SAEs may find the “wrong” decomposition—there are many valid sparse decompositions
- Scaling to large models is computationally expensive
- Completeness is hard to verify—how do we know if we’ve found all important features?
6.4 Applications to Alignment
Mechanistic interpretability has direct applications to AI safety:
Detecting deception: If we understand how truthfulness is represented, we might detect when models are being deceptive by examining internal activations rather than outputs.
Steering vectors: The linear representation hypothesis suggests we can modify model behavior by adding directions corresponding to desired features:
where is a direction identified through interpretability research (e.g., “honesty” or “helpfulness”). Turner et al. (2023) demonstrate this approach for behavioral control without fine-tuning.
Understanding failures: When models fail, interpretability tools can diagnose why. Which circuits activated incorrectly? What features were misrecognized? This enables targeted fixes.
Verification: For safety-critical systems, we want not just empirical testing but understanding of why the system behaves as it does. Mechanistic interpretability provides this, at least partially.
7. Alignment and Robustness
Building AI systems that reliably do what we intend—and avoid what we don’t intend—is both a technical and philosophical challenge. Here we focus on the technical approaches.
7.1 The Alignment Problem Formalized
The alignment problem can be stated as: ensure that the model’s optimization target (what it actually pursues) matches the intended objective (what we want it to pursue).
Sources of misalignment:
- Specification gaming: The objective function doesn’t capture what we really want; the model optimizes the specified objective in unintended ways
- Goal misgeneralization: The model learns a proxy of the intended goal that diverges in new situations
- Deceptive alignment: The model appears aligned during training but pursues different objectives in deployment
7.2 RLHF and Its Limitations
Reinforcement Learning from Human Feedback (RLHF), formalized by Christiano et al. (2017) and applied to language models by Ouyang et al. (2022), trains models to produce outputs that humans prefer:
- Collect comparisons: Humans rank model outputs for the same prompt
- Train reward model: Learn predicting human preferences
- Optimize policy: Fine-tune model to maximize expected reward via PPO or similar
Source: max_θ E[r(x,y)] - β * D_KL[π_θ || π_ref]
The KL penalty prevents the model from deviating too far from the reference policy (usually the supervised fine-tuned model), which helps avoid reward hacking.
Limitations of RLHF:
- Reward hacking: Models can learn to exploit the reward model rather than satisfy human preferences. The reward model has its own failure modes.
- Human feedback quality: Humans are inconsistent, have biases, and struggle to evaluate complex outputs. High-quality feedback is expensive.
- Scalable oversight: As models become more capable, humans become less able to evaluate outputs accurately. We cannot RLHF our way to superhuman performance if we cannot evaluate superhuman outputs.
7.3 Constitutional AI and Self-Improvement
Constitutional AI from Bai et al. (2022) reduces reliance on human feedback by having models critique and revise their own outputs according to a set of principles (the “constitution”):
- Generate initial response
- Critique response against constitutional principles
- Revise response based on critique
- Use revised responses as training signal
This generates training data without human labeling for each example. Humans specify principles (the constitution) rather than labeling instances.
RLAIF (RL from AI Feedback): Replace human comparisons with model-generated comparisons. A capable model evaluates which outputs better satisfy the constitution, providing reward signal for training.
Advantages:
- Scales better than human feedback
- Principles can be explicit and auditable
- Can enforce consistency across many examples
Risks:
- Model’s own biases are amplified
- Principles may be incomplete or conflicting
- Self-improvement processes can be unstable
7.4 Direct Preference Optimization
Direct Preference Optimization (DPO) from Rafailov et al. (2023) eliminates the explicit reward model by deriving a closed-form solution for the optimal policy:
Source: -E[ log σ( β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x)) ) ]
where and are preferred and dispreferred outputs, is the sigmoid function, controls deviation from the reference policy, and the log-ratios are the key structural element (the implicit reward difference). This directly optimizes the policy on preference data without an intermediate reward model.
Advantages:
- Simpler pipeline (no reward model training)
- More stable training
- Equivalent to RLHF in theory, often comparable in practice
Variants: IPO (Identity Preference Optimization), KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024), ORPO (Odds Ratio Preference Optimization) each address specific limitations of DPO.
7.5 Robustness and Adversarial Attacks
Aligned models must be robust to adversarial inputs—prompts designed to elicit unintended behaviors.
Jailbreaking: Prompts that circumvent safety training:
- Role-play scenarios (“Pretend you are an evil AI…”)
- Encoding attacks (Base64, pig latin)
- Context manipulation
- Gradient-based prompt optimization (GCG attack, Zou et al., 2023)
Defenses:
- Training robustness: Include adversarial examples in safety training
- Input filtering: Detect and block adversarial patterns
- Output filtering: Screen responses for policy violations
- Ensemble approaches: Multiple independent checks
No known defense is complete. This is an active adversarial game between attack and defense research.
8. Evaluation and Measurement
Proper evaluation is critical for understanding model capabilities and guiding development. Different benchmarks measure different capabilities, and metric choice significantly affects conclusions.
8.1 Reasoning vs. Retrieval vs. Tool Use
Reasoning benchmarks (GSM8K, MATH, ARC, BBH) test multi-step inference. Key considerations:
- Data contamination: models may have seen test problems during training
- Solution style: does the benchmark reward correct answers or correct reasoning?
- Difficulty distribution: aggregate metrics can hide bimodal behavior (perfect on easy, zero on hard)
Knowledge/retrieval benchmarks (TriviaQA, Natural Questions, MMLU) test factual recall. These conflate retrieval from parameters with reasoning about retrieved information.
Tool use evaluation remains underdeveloped. Existing benchmarks (ToolBench, API-Bank) often have narrow coverage and don’t capture realistic failure modes like cascading errors in multi-step tool chains.
8.2 Avoiding Misleading Metrics
Accuracy on held-out sets can be misleading when:
- Distribution shift exists between benchmark and deployment
- Benchmark saturates (ceiling effects hide capability differences)
- Models are optimized for benchmark performance specifically
Calibration matters: a model that says “I’m 90% confident” should be right 90% of the time. Overconfident models are dangerous in production.
Human preference ratings can be gamed by verbosity, confident tone, and other superficial features that don’t correlate with correctness.
Recommendation: Use multiple complementary benchmarks, include held-out evaluation sets not used for development, measure calibration alongside accuracy, and validate that benchmark performance predicts deployment performance.
9. Long-Context Modeling
Extending context length beyond training-time limits is critical for many applications (document analysis, code understanding, long conversations).
9.1 Positional Encoding Extension
Beyond the encodings discussed in Section 1.3, several techniques extend context at inference time:
Position interpolation: Scale position indices to fit within training range. If trained on 4K context, interpolate positions for 16K context by dividing position indices by 4. Works but can degrade performance.
YaRN (Yet another RoPE extensioN): Peng et al. (2023) combines interpolation with attention scaling, maintaining performance better than naive interpolation.
NTK-aware scaling: Modifies RoPE’s frequency basis to better handle extrapolation.
9.2 Efficient Long-Context Attention
Beyond sparse attention variants:
Landmark attention: Select key “landmark” tokens that summarize regions, attend fully to landmarks, then attend locally.
Streaming/recurrent approaches: Process context in chunks, maintaining a compressed state. Models like Mamba (Gu & Dao, 2023) use state-space models with linear scaling in sequence length.
Retrieval over context: For very long contexts, retrieve relevant portions rather than attending to everything. This trades exact attention for practical scalability.
9.3 Training for Long Context
Simply training on longer sequences is expensive ( in sequence length) but effective. Techniques to reduce cost:
Progressive training: Train on short sequences first, gradually increase length.
Sparse attention during training: Use efficient attention for most training, fine-tune with full attention on target length.
Synthetic long-context data: Generate training data specifically exercising long-range dependencies.
10. Future Directions and Open Problems
We conclude with significant open problems and promising research directions.
10.1 Fundamental Capability Improvements
Reliable reasoning: Despite progress on chain-of-thought and search-based methods, models still fail on tasks requiring reliable multi-step inference. Neurosymbolic approaches combining neural and formal methods show promise but lack generality.
Sample efficiency: Humans learn from far less data than current models. Techniques for better data efficiency—curriculum learning, meta-learning, causal representation learning—could dramatically reduce training costs or improve capability at fixed compute.
Continual learning: Current models are static after training. Efficiently incorporating new knowledge without catastrophic forgetting remains challenging. Retrieval augmentation provides a partial solution; true continual learning would be more powerful.
10.2 Architectural Innovation
Alternative sequence models: Mamba (Gu & Dao, 2023) and other state-space models show promise with linear scaling in sequence length, but have not displaced transformers for language. Hybrid architectures combining attention and state-space layers are an active area.
Mixture of Experts (MoE): Sparse MoE models route inputs to subsets of parameters, enabling larger total parameter counts with fixed per-example compute. Switch Transformer (Fedus et al., 2021), Mixtral, and Arctic demonstrate this approach. Optimal routing, load balancing, and training stability remain active research areas.
Memory and state: Transformers have no persistent state beyond the context window. Architectures incorporating external memory (Memorizing Transformers, Wu et al., 2022) or recurrent state could extend effective context without quadratic cost.
10.3 Alignment and Safety
Scalable oversight: How do we supervise systems more capable than ourselves? Proposals include:
- Debate: Irving et al. (2018) propose models argue positions, humans judge arguments
- Recursive reward modeling: Models help evaluate other models
- Iterated amplification: Christiano et al. (2018) gradually build evaluation capability through AI assistance
None are proven at scale.
Interpretability at scale: Current interpretability techniques work on small models or small portions of large models. Scaling mechanistic interpretability to frontier models is essential for understanding and controlling them.
Evaluating dangerous capabilities: We lack reliable methods to evaluate deception, manipulation, and long-horizon planning capabilities before deployment. Developing robust evaluations is crucial for responsible development.
10.4 World Models and Embodiment
World models: Learning predictive models of environments that support planning and reasoning. Video prediction models, game simulators, and robotics foundation models all point toward this direction. True world models that generalize across domains remain distant.
Embodiment: Grounding language in physical action. Current language models reason about the world through text; embodied systems must act in it. The robotics foundation model approach (RT-2, Brohan et al., 2023) shows progress, but general-purpose robots remain beyond current capabilities.
Multimodal unification: Current multimodal models handle limited modality combinations. True multimodal systems would seamlessly integrate text, images, audio, video, 3D, actions, and other modalities. Architectural approaches that scale to many modalities without modality-specific engineering are needed.
Conclusion
The transformer architecture, scaled with unprecedented compute and data, has produced AI systems with remarkable capabilities. Yet significant limitations remain: reasoning is unreliable, knowledge is static, alignment is incomplete, and our understanding of what these systems compute is fragmentary.
Progress will come from multiple fronts: architectural innovations that improve efficiency and capability, algorithmic advances in reasoning and retrieval, alignment techniques that scale with capability, and interpretability methods that illuminate system behavior.
The path forward requires both engineering rigor—building systems that work reliably in production—and scientific understanding—comprehending why they work and when they fail. This paper has aimed to provide foundations for both.
References
Arora, S., & Goyal, A. (2023). A Theory for Emergence of Complex Skills in Language Models.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback.
Berglund, L., et al. (2023). The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”.
Brown, T., et al. (2020). Language Models are Few-Shot Learners.
Christiano, P., et al. (2017). Deep Reinforcement Learning from Human Preferences.
Christiano, P., et al. (2018). Supervising Strong Learners by Amplifying Weak Experts.
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
DeepMind. (2024). AI Solves IMO Problems at Silver Medal Level.
Elhage, N., et al. (2022). Toy Models of Superposition.
Ethayarajh, K., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization.
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models.
Irving, G., et al. (2018). AI Safety via Debate.
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models.
Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Lightman, H., et al. (2023). Let’s Verify Step by Step.
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback.
Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers.
Peng, B., et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models.
Pérez, J., et al. (2021). Attention is Turing-Complete.
Power, A., et al. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision.
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.
Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage?
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need.
Song, Y., et al. (2023). Consistency Models.
Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
Trinh, T., et al. (2024). Solving Olympiad Geometry without Human Demonstrations.
Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization.
Vaswani, A., et al. (2017). Attention Is All You Need.
Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. (2022). Emergent Abilities of Large Language Models.
Wu, Y., et al. (2022). Memorizing Transformers.
Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
Anthropic. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.
