The average of all different benchmarks can be thought of as a kind of 'average intelligence', though in reality its more of a gradient and vibe type thing.
Many models are "benchmaxxed" trained to answer the exact kinds of questions the test asked which often makes the benchmarks results unrelated to real world use case checks. Use them as general indicators but not to be taken too seriously.
All model families are different in ways that you only really understand by spending time with them. Don't forget to set the rigt chat template and correct sample range values as needed per model. Openleaderboard is a good place to start.