Tech firms are gaming the most popular ranking of AI models, researchers claim

May 1, 2025

TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.

Here’s what you need to know.

Talking Points

Tech firms like Meta and Google have an unfair edge in Chatbot Arena, a popular leaderboard of AI models, according to a new paper co-authored by researchers at Cohere and its non-profit research arm
Some developers are privately testing lots of versions of their models to find the best-scoring one, and getting much more data than rivals to improve the performance of their systems, the study claims

The test: Chatbot Arena prompts users to ask questions of two different, unidentified large language models (LLMs), and choose which response they prefer. It collects those results into scores on a leaderboard.

The issue: Some developers are privately testing lots of different variants of their models before launch, then picking the one that does best to make public, according to the paper, published late Tuesday on open-access site ArXiv. For example, the researchers found Meta had tried out 27 different systems before it launched Llama 4, its latest LLM, last month; Google checked 10 versions of its flagship Gemini system or Gemma 3, an open-source version.

Other developers don’t know they have the option to do this pre-launch testing, so they end up with lower scores, according to the paper, which has not yet been peer-reviewed. Eight of the paper’s 13 co-authors are affiliated with Cohere, the Toronto AI startup, or with Cohere Labs, its non-profit research arm. Researchers at five schools including the University of Waterloo and Stanford University also contributed.

Their study also claims Chatbot Arena is putting LLMs from some major AI firms in more of the head-to-head battles than others, giving those developers more information to boost the performance of their products. It estimates that Google and OpenAI have each gotten about a fifth of the data the contest has produced, despite over a dozen firms having submitted.

Combined, the two problems mean developers are teaching their models how to do well on the test rather than just trying to make them better, according to the researchers. “It makes it difficult to distinguish between models that have legitimately improved versus those that have exploited statistical shortcuts,” they wrote.

The result is that models with high Chatbot Arena scores don’t always do as well in the real world—and vice versa. In an X post, Canadian computer scientist Andrej Karpathy cited Anthropic’s Claude 3.5, which was “top tier in my personal use” but “ranked very low on the arena.” Karpathy, who was not involved in the study, is a former star AI researcher at OpenAI and Tesla, and recently launched education technology startup Eureka Labs.

The test-maker: A group of University of California Berkeley students launched Chatbot Arena in May 2023, as a crowdsourced way to test all the new AI models tech firms were releasing. Last month, the volunteer project evolved into a for-profit startup.

Chatbot Arena has become a go-to benchmark for developers building generative tools, who can choose between a variety of LLMs. It’s also sometimes offered sneak peaks of AI’s next headline-grabbing moment. Chinese startup DeepSeek’s models started climbing the leaderboard a few days before their performance sparked a stock market freak-out in January.

Google’s Gemini 2.5 Pro model currently tops Chatbot Arena’s language rankings, followed by OpenAI’s o3 system. Cohere’s latest model, Command A, currently ranks at 19 on that leaderboard. Nick Frosst, the Canadian firm’s co-founder, has previously said the test doesn’t show whether models are suited for Cohere’s target market of businesses.

The response: “Pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the leaderboard is biased,” LMArena said in a post on X. The startup said people like being able to try out new systems before they launch, and it’s good that developers are tweaking their products based on that feedback.

Gift the full article

The fixes: Chatbot Arena has “democratized access to many models and enabled a large and varied user base to weigh in on what matters in the real world for model selection,” the paper says. But it suggests the platform should cap how many versions of a model a developer can privately test pre-release, and stop companies from withdrawing lower-scoring ones. It also recommends a new system for deciding which models get served up when a user shows up to chat.

The alternatives: Toronto’s Vector Institute recently launched a mega-evaluation that put 11 leading models through 16 tests.

Correction: Following publication of the paper, Cohere Labs said it had corrected the Google pre-release testing figures. This story has been updated.

#artificial intelligence #Chatbot Arena #Cohere #Cohere Labs #Google #Meta #Tech

Services

Content

Masthead

Talking Points

Related Articles

Sponsored Content

Thanks for sharing!

This account has reached its share limit.

Want to share this article?

Gift the full article!

Copy link and gift

Email to a friend

Gift on Social Media

Most Popular This Week

Exclusive

News

The Big Read

News

In-depth, agenda-setting reporting

Commentary

Carmichael: Canada’s culture of risk aversion has created an economic doom loop

Briefing

TC Energy says North American natural gas demand is growing much faster than previously expected

Mississauga imposes data centre pause, as Toronto councillors scrutinize local projects

A third of Canadian workers are using generative AI, but not for everything

Best business newsletter in Canada

Exclusive events

Membership in The Logic Council

Recent Popular Stories

The Big Read

News

Exclusive

News

News

Commentary

Canada's most influential executives and policymakers are reading The Logic