Skip to main content

Capability dashboard

State of the Frontier

The leading measures of AI capability — how capable, how autonomous, how in demand, how fast it is improving, and how much it costs to push the frontier. The Intelligence Index and the autonomy horizon refresh automatically each month; the remaining series are periodically curated — each chart shows the date its data was last updated.

Intelligence Index

The Race No One Agreed To Run

The Artificial Analysis Intelligence Index measures AI model capability across reasoning, knowledge, and instruction-following. The frontier has advanced from single-digit scores to over 60 in just three years.

Each line tracks a lab's most capable model over time. The steepening curves show AI capability growing faster each year — and the gap between leaders shrinking.

Data: Artificial Analysis Intelligence Index — Last updated:

The Intelligence Index is a composite benchmark measuring model performance across reasoning, knowledge, mathematics, and coding tasks. Higher scores indicate greater capability. Used with attribution for educational purposes.

Autonomous task length

Autonomy Horizon

How long a task takes a skilled human before AI agents complete it with 50% success. In 2019, AI could manage a 2-second task. By April 2026, the frontier had reached roughly 17 hours of skilled work — though METR notes that measurements above 16 hours are unreliable with the current task suite. The horizon is doubling every 129 days.

This measures how long AI agents can work autonomously on real software engineering tasks. The y-axis is logarithmic: each gridline represents a tenfold increase in autonomous work duration.

Data: METR — Horizon Benchmark v1.1. Cite: arXiv:2503.14499 (NeurIPS 2025). Used with attribution for educational purposes. · Data as of 7 Apr 2026

The p50 horizon length is the task duration (for a skilled human) at which AI agents succeed 50% of the time. Shaded bands show 95% confidence intervals. The frontier line connects state-of-the-art models at time of release. The dashed trend line shows the 129-day doubling rate from 2023 onward.

AI-agent demand & capability

The Horizon Extends

AI agents are advancing on two axes at once. The labor market reveals demand: a skill cluster that did not meaningfully exist before 2024 — postings for "agentic AI," "AI agents," "LangGraph" — exploded across 2025 and now appears in roughly 48,000 US job postings, overtaking decade-old categories like neural networks, robotics, and visual recognition. And the agents themselves are gaining capability: on four real-world benchmarks where the best models scored 1–20% as recently as 2024, they now hit 66–93%, closing in on human performance across domains as distinct as web navigation, terminal commands, computer use, and cybersecurity.

Demand view: each line tracks the share of US job postings mentioning an AI-related skill cluster, 2010–2025. The cyan AI agent line emerged from zero in 2024 — note how quickly it overtook visual recognition, robotics, and neural networks. Capability view: each line tracks the best AI model on a real-world agent benchmark; dashed lines mark human performance. Terminal-Bench is monthly; the others are annual.

Data: Stanford AI Index 2026 (CC BY-ND 4.0). Demand: Economy chapter, Fig 4.4.3 (Lightcast 2010–2025) and Fig 4.4.7 (agent-skill breakdown, 2024 vs 2025). Capability: Technical Performance chapter, Fig 2.5.2 (Terminal-Bench 2.0), Fig 2.6.2 (OSWorld), Fig 2.6.3 (WebArena), Fig 2.6.5 (Cybench).

Demand intermediate-year values were read off Stanford's published chart to ±0.05%; 2025 endpoints are exact. Capability values are all exact (printed on Stanford's charts).

Frontier benchmarks

Closing the Gap

Seven benchmarks designed to measure the frontier of AI capability. Each line tracks the best model score over time. FrontierMath went from <2% to ~25%. ARC-AGI from ~5% to ~88%. SWE-bench from 2% to ~70%. Humanity's Last Exam — scored just 3% at launch — has already been pushed to 27% in months. Problems designed to resist progress for years are falling in months.

Each line tracks AI performance on a different standardized test. When a line approaches the top of the chart, AI has matched or surpassed human expert performance on that task.

Data: Epoch AI Benchmark Hub (CC BY 4.0). Individual sources: SWE-bench, Epoch FrontierMath, ARC Prize, GPQA, SimpleQA, MATH, Humanity's Last Exam.

Capabilities Index (ECI) is Epoch AI's aggregate score combining performance across multiple benchmarks into a single comparable metric per model.

Training compute

The Engine Room

Training compute for frontier AI models has grown by over 10,000× since 2020 — doubling every 9 months. Each point on this chart represents a decision by an organization to invest more computational resource than any previous model in history.

Each dot represents an AI model. The vertical axis shows the computing power used to train it — note the logarithmic scale, where each step represents a 10× increase. Training compute has grown roughly 4× per year since 2010.

Data compiled from Epoch AI research, published papers, and industry reports. Training compute estimates carry uncertainty — see Epoch AI documentation for methodology. CC BY 4.0.

Hardware Scaling: transistor counts from Our World in Data (Moore's Law), GPU FLOPS from manufacturer specifications (NVIDIA, AMD). Moore's original observation: Gordon Moore, "Cramming More Components onto Integrated Circuits," Electronics, 1965.

See these in the full story →