The Leaderboard Illusion

by pongogogo

1751d ago

50 comments

Comments (50)

mrandish1d ago

I'm not even following AI model performance testing that closely but I'm hearing increasing reports they're inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.

Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.

simonw1d ago

I published some notes and opinions on this paper here: https://simonwillison.net/2025/Apr/30/criticism-of-the-chatb...

Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.

This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.

If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.

godelski1d ago

Many of these things are ones that people have been screaming about for years (including Sarah Hooker). It's great to see some numbers attached. And in classic Cohere manner, they are not holding punches on some specific people. Expect them to push back.

There's a crux that makes it easy to understand why we should expect it. If you code (I assume you do) you probably (hopefully) know that you can't test your way into proving your code is correct. Test Driven Development (TDD) is a flawed paradigm. You should use tests, but they are hints. That's why Cohere is quoting Goodhart at the top of the intro[0]. There is NO metric where the metric is perfectly aligned with the reason you implemented that metric in the first place (intent). This is fucking alignment 101 here. Which is why it is really ironic how prolific this attitude is in ML[1]. I'm not sure I believe any person or company that claims they can make safe AI if they are trying to shove benchmarks at you.

Pay close attention, evaluation is very hard. It is also getting harder. Remember reward hacking, it is still alive and well (it is Goodhart's Law). You have to think about what criteria meets your objective. This is true for any job! But think about RLHF and similar strategies. What methods also maximize the reward function? If it is human preference, deception maximizes just as well (or better) than accuracy. This is bad design pattern. You want to make errors as loud as possible, but this paradigm makes errors as quiet as possible and you cannot confuse that with lack of errors. It makes evaluation incredibly difficult.

Metrics are guides, not targets

[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.

[1] We used to say that when people say "AI" instead of "ML" to put your guard up. But a very useful one that's been true for years is "if people try to prove by benchmarks alone, they're selling snakeoil." There should always be analysis in addition to metrics.

pongogogo1d ago

I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.

unkulunkulu1d ago

Sounds like classic inequality observed everywhere. Success leads to attention leads to more success.

Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.

Thus, we have this inequality.

bob10291d ago

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

In context of genetic programming and other non-traditional ML techniques, I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity due to this effect.

For example, say you use something like common prefix length to measure how close a candidate's output string is to an objective string given an input string. The underlying learner will inevitably start doing things like repeating the input verbatim, especially if the input/output training tuples often share a lot of prefixes. So, you might try doing something like reversing the input to force learning to take a less crappy path [0]. The learner may respond degenerately by inventing a string reversing technique and repeating its prior behavior. So, you iterate again and try something like base64 encoding the input. This might take, but eventually you wind up with so many weird hacks that the learner can't make progress and the meaning of the quantities evaporates.

Every metric I've ever looked at gets cheated in some way. The holy grail is probably normalized information distance (approximated by normalized compression distance), but then you have a whole new problem of finding an ideal universal compressor which definitely doesn't exist.

[0]: https://arxiv.org/abs/1409.3215 (Figure 1)

ekidd1d ago

Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:

- Lots of bullet points in every response.

- Emoji.

...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.

Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.

jmount1d ago

Not the same effect: but a good related writeup: https://www.stefanmesken.info/machine%20learning/how-to-beat...

jmmcd1d ago

Absolutely devastating for the credibility of FAIR.

badmonster1d ago

https://x.com/karpathy/status/1917546757929722115

j7ake1d ago

It’s essentially the pvalue hacking we see in social and biological sciences applied to machine learning field.

Once you set an evaluation metric it ceases to become a useful metric.

mottiden1d ago

This is such a great research. Kudos to the authors!

aredox1d ago

The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s

lostmsu1d ago

Chiming in as usual: https://trashtalk.borg.games

A social deduction game for both LLMs and humans. All the past games are available for anyone.

I'm open for feedback.

n8m81d ago

Predictable, yet incredibly important.

Metrics are guides, not targets

[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.