Researchers just audited 43 AI benchmarks, over 72,000 test tasks, and found ...
Researchers just audited 43 AI benchmarks, over 72,000 test tasks, and found something that should bother anyone buying AI tools right now.
All of them combined cover less than 8% of the actual labor market.
Software engineering gets most of the attention. Management (11 million workers, $120K median salary, 88% digitized workflows) gets 1.4% of benchmark tasks. Legal gets 0.3%.
The entire category of "interacting with others," negotiation, stakeholder management, coordination, giving feedback, is basically absent.
I watched this play out in real time last month. We had a company summit, brought in a presenter who showed our agents how to generate AI songs. The room went nuts. Everyone was obsessed.
These are people who close multi-million dollar property deals for a living. AI-generated jingles will influence exactly zero of those transactions. But it was flashy, it demo'd well, and it felt like the future.
That's the benchmark problem in miniature. We're measuring AI where it's easy to measure, code generation, content creation, classification, and calling it progress. Meanwhile the hard stuff, reading a room, knowing when to push and when to back off, managing a relationship through a six-month deal cycle, isn't being tested because nobody knows how to score it.
Every recipe says "prep time: 15 minutes." That's in a test kitchen with sharp knives and a chef who's done it 400 times. Your kitchen is different. Your knives are different. The benchmark doesn't know that.
Same deal with AI. A model that scores 94% on a coding benchmark might be useless for the thing you actually need it to do, because nobody tested it on that.
Before you trust a benchmark score, ask what kitchen it was tested in.
What's your favorite metric that completely hides the real story? A baseball stat, a vendor claim, a job posting requirement. I want to hear the good ones.