TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.
TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.
TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.
Here’s what you need to know.
Talking Points
The test: Chatbot Arena prompts users to ask questions of two different, unidentified large language models (LLMs), and choose which response they prefer. It collects those results into scores on a leaderboard.
The issue: Some developers are privately testing lots of different variants of their models before launch, then picking the one that does best to make public, according to the paper, published late Tuesday on open-access site ArXiv. For example, the researchers found Meta had tried out 27 different systems before it launched Llama 4, its latest LLM, last month; Google checked 10 versions of its flagship Gemini system or Gemma 3, an open-source version.
Other developers don’t know they have the option to do this pre-launch testing, so they end up with lower scores, according to the paper, which has not yet been peer-reviewed. Eight of the paper’s 13 co-authors are affiliated with Cohere, the Toronto AI startup, or with Cohere Labs, its non-profit research arm. Researchers at five schools including the University of Waterloo and Stanford University also contributed.
Their study also claims Chatbot Arena is putting LLMs from some major AI firms in more of the head-to-head battles than others, giving those developers more information to boost the performance of their products. It estimates that Google and OpenAI have each gotten about a fifth of the data the contest has produced, despite over a dozen firms having submitted.
Combined, the two problems mean developers are teaching their models how to do well on the test rather than just trying to make them better, according to the researchers. “It makes it difficult to distinguish between models that have legitimately improved versus those that have exploited statistical shortcuts,” they wrote.
The result is that models with high Chatbot Arena scores don’t always do as well in the real world—and vice versa. In an X post, Canadian computer scientist Andrej Karpathy cited Anthropic’s Claude 3.5, which was “top tier in my personal use” but “ranked very low on the arena.” Karpathy, who was not involved in the study, is a former star AI researcher at OpenAI and Tesla, and recently launched education technology startup Eureka Labs.
The test-maker: A group of University of California Berkeley students launched Chatbot Arena in May 2023, as a crowdsourced way to test all the new AI models tech firms were releasing. Last month, the volunteer project evolved into a for-profit startup.
Chatbot Arena has become a go-to benchmark for developers building generative tools, who can choose between a variety of LLMs. It’s also sometimes offered sneak peaks of AI’s next headline-grabbing moment. Chinese startup DeepSeek’s models started climbing the leaderboard a few days before their performance sparked a stock market freak-out in January.
Google’s Gemini 2.5 Pro model currently tops Chatbot Arena’s language rankings, followed by OpenAI’s o3 system. Cohere’s latest model, Command A, currently ranks at 19 on that leaderboard. Nick Frosst, the Canadian firm’s co-founder, has previously said the test doesn’t show whether models are suited for Cohere’s target market of businesses.
The response: “Pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the leaderboard is biased,” LMArena said in a post on X. The startup said people like being able to try out new systems before they launch, and it’s good that developers are tweaking their products based on that feedback.
The fixes: Chatbot Arena has “democratized access to many models and enabled a large and varied user base to weigh in on what matters in the real world for model selection,” the paper says. But it suggests the platform should cap how many versions of a model a developer can privately test pre-release, and stop companies from withdrawing lower-scoring ones. It also recommends a new system for deciding which models get served up when a user shows up to chat.
The alternatives: Toronto’s Vector Institute recently launched a mega-evaluation that put 11 leading models through 16 tests.
Correction: Following publication of the paper, Cohere Labs said it had corrected the Google pre-release testing figures. This story has been updated.
Loading...
You have shared 5 articles this month and reached the maximum amount of shares available.
CloseIf you would like to purchase a sharing license please contact The Logic support at [email protected].
CloseYou have gifted 0 article(s) this month and have 5 remaining.
Recipients will be able to read the full text of the article after submitting their email address. They will not have access to other articles or subscriber benefits.
Get up to speed in minutes with insights and analysis on the most important stories of the day, every weekday.
See the bigger picture with reporters and industry experts in subscriber-exclusive events.
Membership provides access to our popular Slack channel, participation in subscriber surveys and invitations to exclusive events with our journalists and special guests.