Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.
This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.
If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.
There's a crux that makes it easy to understand why we should expect it. If you code (I assume you do) you probably (hopefully) know that you can't test your way into proving your code is correct. Test Driven Development (TDD) is a flawed paradigm. You should use tests, but they are hints. That's why Cohere is quoting Goodhart at the top of the intro[0]. There is NO metric where the metric is perfectly aligned with the reason you implemented that metric in the first place (intent). This is fucking alignment 101 here. Which is why it is really ironic how prolific this attitude is in ML[1]. I'm not sure I believe any person or company that claims they can make safe AI if they are trying to shove benchmarks at you.
Pay close attention, evaluation is very hard. It is also getting harder. Remember reward hacking, it is still alive and well (it is Goodhart's Law). You have to think about what criteria meets your objective. This is true for any job! But think about RLHF and similar strategies. What methods also maximize the reward function? If it is human preference, deception maximizes just as well (or better) than accuracy. This is bad design pattern. You want to make errors as loud as possible, but this paradigm makes errors as quiet as possible and you cannot confuse that with lack of errors. It makes evaluation incredibly difficult.
Metrics are guides, not targets
[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.
[1] We used to say that when people say "AI" instead of "ML" to put your guard up. But a very useful one that's been true for years is "if people try to prove by benchmarks alone, they're selling snakeoil." There should always be analysis in addition to metrics.
Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.
Thus, we have this inequality.
In context of genetic programming and other non-traditional ML techniques, I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity due to this effect.
For example, say you use something like common prefix length to measure how close a candidate's output string is to an objective string given an input string. The underlying learner will inevitably start doing things like repeating the input verbatim, especially if the input/output training tuples often share a lot of prefixes. So, you might try doing something like reversing the input to force learning to take a less crappy path [0]. The learner may respond degenerately by inventing a string reversing technique and repeating its prior behavior. So, you iterate again and try something like base64 encoding the input. This might take, but eventually you wind up with so many weird hacks that the learner can't make progress and the meaning of the quantities evaporates.
Every metric I've ever looked at gets cheated in some way. The holy grail is probably normalized information distance (approximated by normalized compression distance), but then you have a whole new problem of finding an ideal universal compressor which definitely doesn't exist.
[0]: https://arxiv.org/abs/1409.3215 (Figure 1)
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.
Once you set an evaluation metric it ceases to become a useful metric.
A social deduction game for both LLMs and humans. All the past games are available for anyone.
I'm open for feedback.