The Benchmark Is Broken: How AI Labs Are Winning Evals While Losing on Safety
Frontier models are gaming the scoreboards, third-party auditors can't keep up, and the safety metrics that matter most are the ones nobody wants to publish.

The leaderboard says everything is fine. The actual numbers say otherwise.
Across every major AI benchmark, 2026 has delivered the same headline: frontier models keep improving, labs keep shipping, capability curves keep bending upward. What the press releases don't say is that the evals themselves have become the product — carefully selected, quietly optimized for, and increasingly divorced from how these systems behave when they leave the lab. The result is a growing gap between what benchmarks report and what safety researchers actually find. And that gap is now wide enough to matter.
The Benchmark Treadmill
MMLU is effectively solved. Every frontier model scores above 88% — the differences between the top systems fall within measurement noise. Big-Bench Hard followed the same trajectory, got saturated, got replaced by BBEH, and BBEH will be saturated within a year. This is the benchmark treadmill: the moment a test becomes the thing that matters, labs start optimizing for the test. Not the underlying capability. The test.
"The moment GPQA Diamond became the benchmark that mattered, labs started optimizing for GPQA Diamond — not for the reasoning it was designed to measure."
The ARC Prize consortium designed ARC-AGI specifically to resist this. Tasks are constructed to be absent from any training corpus, forcing genuine generalization over memorization. ARC-AGI-3, released this year, has broken every agent tested against it — no model has scored above human-baseline on the hardest tier. That result is honest. It is also uncomfortable, because it sits next to press releases about "superhuman reasoning."
SWE-Bench tells a version of the same story in reverse. OpenAI's internal audit found training-data overlap across all frontier models tested; 59.4% of the hard tasks contain flawed tests. OpenAI quietly stopped reporting Verified scores. Scale AI's SEAL leaderboard built SWE-Bench Pro to fix the contamination problem — and the same Claude Opus 4.5 that scores 80.9% on Verified drops to 45.9% on SEAL. The gap is not noise. It is the sound of a benchmark being gamed.
The Safety Reporting Void
Here is a number Stanford's 2026 AI Index buried in its Responsible AI chapter: almost every frontier model developer publishes results on capability benchmarks, but the same is not true for safety benchmarks. Among the major labs, only one — Anthropic, with Claude Opus 4.5 — reports results on more than two of the tracked responsible AI benchmarks.
That is a structural problem. Labs have every incentive to publish capability scores and almost none to publish safety scores, because safety scores are harder to contextualize, easier to misread, and occasionally embarrassing. The International AI Safety Report 2026 flagged this directly: transparency is declining at exactly the moment when capability is accelerating. Third-party auditors — the intended check on that decline — are not keeping pace. Frontier AI Auditing, a working paper circulating in policy circles this spring, documents the gap between what rigorous third-party assessment would require and what labs actually allow auditors to see.
Jailbreaks Are Getting Easier, Not Harder
The red-teaming data that does exist is alarming in a specific, granular way. Single-turn jailbreak attempts succeed roughly 20–28% of the time across current frontier models. Multi-turn attacks push that to 39–55%. Role-play framing — the oldest trick in the adversarial prompt playbook — succeeds at 89.6% in structured adversarial evaluations. Extended multi-turn campaigns, within five conversation turns, reach a reported 97% success rate on at least some tested models.
These are not exotic lab results. This is what happens when determined users apply moderate effort.
RLHF reduces jailbreak success rates by up to 30% in controlled testing, per Mindgard's 2026 red-teaming survey. That is meaningful. It is also math: a 30% reduction on a 97% multi-turn success rate still leaves you somewhere you don't want to be. The StrongREJECT benchmark, developed to standardize jailbreak resistance measurement, is now the closest thing the field has to a shared adversarial eval standard. Claude 3.7 Sonnet led detection rates in independent trials, identifying 46.9% of adversarial challenges — the highest of any tested model, which is both a compliment and a reminder that the best model in the field is still missing more than half.
Only 24% of organizations deploying AI systems have implemented what researchers classify as strong safety safeguards.
What Third Parties Are Finding
The auditing infrastructure being built around these models is producing its own signal. Frontier AI Auditing, the paper, proposes a framework for rigorous third-party assessment — and in doing so, catalogs what is currently missing: independent access to training data, red-team outputs, pre-deployment evaluations, and internal incident logs. None of the major labs provide all four. Most provide none.
The practical consequence: when VentureBeat reported in April that frontier models are failing one in three production attempts and "getting harder to audit," it wasn't describing a capability shortfall. It was describing an opacity problem. Models are performing well enough on benchmarks to ship, failing often enough in deployment to matter, and structured in ways that make the failure modes opaque to everyone outside the lab — sometimes including the lab itself.
The integrated eval platforms — Confident AI, Arize, DeepEval — are trying to close this gap from the deployment side. Pre-production evals, production observability, and adversarial red-teaming running in the same workspace is the architecture the serious operators have converged on. It helps. It is not the same as meaningful third-party auditing of what ships.
The Accountability Gap
The real story of benchmarks and safety in 2026 is not that AI is dangerous. It is that the measurement infrastructure is lagging the capability infrastructure by enough to create a credibility problem — and that credibility problem is not hypothetical. It shows up when a model scores 80% on Verified and 46% on SEAL. It shows up when a multi-turn jailbreak reaches 97% success. It shows up when the only lab reporting responsible AI benchmark results is the one that built its commercial identity around safety messaging.
The benchmark is not lying, exactly. It is just not telling you what you actually need to know.
ARC-AGI-3 is the most honest benchmark active right now, precisely because it was designed to resist the optimization pressure that corrupts every other eval. Its scores are modest. Its methodology is transparent. The labs that score badly on it do not lead with those numbers.
Until the industry treats safety benchmark reporting the way it treats capability benchmark reporting — as table stakes for shipping a frontier model — the leaderboard will continue to tell a cleaner story than the red-team logs. Policymakers are starting to notice. The question is whether they move faster than the next benchmark gets saturated.
The answer, based on the current trajectory, is probably not.
