This seems like a dumb benchmark.
ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.
What do you mean trivial? Most humans I know can't read the most basic white-background-big-black-numbers clocks.
Someone rigged the jury to get 90% on this: