Capability dashboard
State of the Frontier
The leading measures of AI capability — how capable, how autonomous, how in demand, how fast it is improving, and how much it costs to push the frontier. The Intelligence Index and the autonomy horizon refresh automatically each month; the remaining series are periodically curated — each chart shows the date its data was last updated.
Intelligence Index
The Race No One Agreed To Run
The Artificial Analysis Intelligence Index measures AI model capability across reasoning, knowledge, and instruction-following. The frontier has advanced from single-digit scores to over 60 in just three years.
Each line tracks a lab's most capable model over time. The steepening curves show AI capability growing faster each year — and the gap between leaders shrinking.
Data: Artificial Analysis Intelligence Index — Last updated:
The Intelligence Index is a composite benchmark measuring model performance across reasoning, knowledge, mathematics, and coding tasks. Higher scores indicate greater capability. Used with attribution for educational purposes.
Recap
- What you saw
Intelligence Index scores climbing from ~10 to over 60 since 2023.
- What it means
Capability is accelerating faster than any prior general-purpose technology on record.
- What to watch
The next Artificial Analysis quarterly release, and which lab sets the frontier.
Autonomous task length
Autonomy Horizon
How long a task takes a skilled human before AI agents complete it with 50% success. In 2019, AI could manage a 2-second task. By April 2026, the frontier had reached roughly 17 hours of skilled work — though METR notes that measurements above 16 hours are unreliable with the current task suite. The horizon is doubling every 129 days.
This measures how long AI agents can work autonomously on real software engineering tasks. The y-axis is logarithmic: each gridline represents a tenfold increase in autonomous work duration.
Data: METR — Horizon Benchmark v1.1. Cite: arXiv:2503.14499 (NeurIPS 2025). Used with attribution for educational purposes. · Data as of 7 Apr 2026
The p50 horizon length is the task duration (for a skilled human) at which AI agents succeed 50% of the time. Shaded bands show 95% confidence intervals. The frontier line connects state-of-the-art models at time of release. The dashed trend line shows the 129-day doubling rate from 2023 onward.
Recap
- What you saw
METR's autonomy horizon extending from 2 seconds (2019) to roughly 17 hours (April 2026), with measurements above 16 hours flagged by METR as less reliable.
- What it means
Agents are beginning to operate on workday timescales, not chat timescales.
- What to watch
Whether the 129-day doubling continues through 2027.
AI-agent demand & capability
The Horizon Extends
AI agents are advancing on two axes at once. The labor market reveals demand: a skill cluster that did not meaningfully exist before 2024 — postings for "agentic AI," "AI agents," "LangGraph" — exploded across 2025 and now appears in roughly 48,000 US job postings, overtaking decade-old categories like neural networks, robotics, and visual recognition. And the agents themselves are gaining capability: on four real-world benchmarks where the best models scored 1–20% as recently as 2024, they now hit 66–93%, closing in on human performance across domains as distinct as web navigation, terminal commands, computer use, and cybersecurity.
Demand view: each line tracks the share of US job postings mentioning an AI-related skill cluster, 2010–2025. The cyan AI agent line emerged from zero in 2024 — note how quickly it overtook visual recognition, robotics, and neural networks. Capability view: each line tracks the best AI model on a real-world agent benchmark; dashed lines mark human performance. Terminal-Bench is monthly; the others are annual.
Data: Stanford AI Index 2026 (CC BY-ND 4.0). Demand: Economy chapter, Fig 4.4.3 (Lightcast 2010–2025) and Fig 4.4.7 (agent-skill breakdown, 2024 vs 2025). Capability: Technical Performance chapter, Fig 2.5.2 (Terminal-Bench 2.0), Fig 2.6.2 (OSWorld), Fig 2.6.3 (WebArena), Fig 2.6.5 (Cybench).
Demand intermediate-year values were read off Stanford's published chart to ±0.05%; 2025 endpoints are exact. Capability values are all exact (printed on Stanford's charts).
Recap
- What you saw
A skill cluster emerging from zero in 2024 and overtaking decade-old categories — paired with agent benchmarks closing on human performance across four distinct domains.
- What it means
Demand and capability are advancing together. Companies are hiring for agents before they can buy them off the shelf — and the agents are improving fast enough to meet that demand.
- What to watch
Whether enterprise deployment catches up. Stanford reports agent deployment is still in the single digits across most business functions — the bottleneck is shifting from capability to integration.
Frontier benchmarks
Closing the Gap
Seven benchmarks designed to measure the frontier of AI capability. Each line tracks the best model score over time. FrontierMath went from <2% to ~25%. ARC-AGI from ~5% to ~88%. SWE-bench from 2% to ~70%. Humanity's Last Exam — scored just 3% at launch — has already been pushed to 27% in months. Problems designed to resist progress for years are falling in months.
Each line tracks AI performance on a different standardized test. When a line approaches the top of the chart, AI has matched or surpassed human expert performance on that task.
Data: Epoch AI Benchmark Hub (CC BY 4.0). Individual sources: SWE-bench, Epoch FrontierMath, ARC Prize, GPQA, SimpleQA, MATH, Humanity's Last Exam.
Capabilities Index (ECI) is Epoch AI's aggregate score combining performance across multiple benchmarks into a single comparable metric per model.
Recap
- What you saw
Seven frontier benchmarks, including Humanity's Last Exam, collapsing in months.
- What it means
The evaluation community is running out of tests the frontier cannot already pass.
- What to watch
The next benchmark release from Scale AI and the Center for AI Safety — and how long it holds.
Training compute
The Engine Room
Training compute for frontier AI models has grown by over 10,000× since 2020 — doubling every 9 months. Each point on this chart represents a decision by an organization to invest more computational resource than any previous model in history.
Each dot represents an AI model. The vertical axis shows the computing power used to train it — note the logarithmic scale, where each step represents a 10× increase. Training compute has grown roughly 4× per year since 2010.
Data compiled from Epoch AI research, published papers, and industry reports. Training compute estimates carry uncertainty — see Epoch AI documentation for methodology. CC BY 4.0.
Hardware Scaling: transistor counts from Our World in Data (Moore's Law), GPU FLOPS from manufacturer specifications (NVIDIA, AMD). Moore's original observation: Gordon Moore, "Cramming More Components onto Integrated Circuits," Electronics, 1965.
Recap
- What you saw
Training compute growing over 10,000-fold since 2020, doubling every nine months.
- What it means
Frontier AI has become a material, industrial undertaking with a planetary footprint.
- What to watch
Whether any publicly announced training run crosses 10²⁷ FLOP in the next two years.