Skip to content

Canada's Business and Tech Newsroom

  • Professional Subscription
  • Partnerships & Advertising
  • Licensing & Syndication
Log In Subscribe
Welcome,
  • My Account
  • Log Out
  • Business
  • Tech
  • National
  • The Big Read
  • Briefings
  • Commentary
Search
Log In Subscribe
Welcome,
  • My Account
  • Log Out
News

Tech firms are gaming the most popular ranking of AI models, researchers claim

TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.

News

Tech firms are gaming the most popular ranking of AI models, researchers claim

Meta and Google are testing lots of versions of their products on Chatbot Arena before launch so they can pick the ones that score best, according to a new study

By Murad Hemmadi
Meta founder and CEO Mark Zuckerberg sitting on a white chair, holding a microphone, smiling, against a blue background.
Meta founder and CEO Mark Zuckerberg at LlamaCon 2025, an AI developer conference, in Menlo Park, Calif., in April 2025. Photo: AP Photo/Jeff Chiu
May 1, 2025
A A
A Small A Medium A Large
Share

Gift

Share

TORONTO — Top tech firms like Meta and Google are gaming a widely-watched leaderboard of AI models, making their systems seem better than they are in the real world, according to a new study co-authored by researchers at Toronto-based Cohere.

Here’s what you need to know.

Talking Points

  • Tech firms like Meta and Google have an unfair edge in Chatbot Arena, a popular leaderboard of AI models, according to a new paper co-authored by researchers at Cohere and its non-profit research arm
  • Some developers are privately testing lots of versions of their models to find the best-scoring one, and getting much more data than rivals to improve the performance of their systems, the study claims

The test: Chatbot Arena prompts users to ask questions of two different, unidentified large language models (LLMs), and choose which response they prefer. It collects those results into scores on a leaderboard.

The issue: Some developers are privately testing lots of different variants of their models before launch, then picking the one that does best to make public, according to the paper, published late Tuesday on open-access site ArXiv. For example, the researchers found Meta had tried out 27 different systems before it launched Llama 4, its latest LLM, last month; Google checked 10 versions of its flagship Gemini system or Gemma 3, an open-source version.

Other developers don’t know they have the option to do this pre-launch testing, so they end up with lower scores, according to the paper, which has not yet been peer-reviewed. Eight of the paper’s 13 co-authors are affiliated with Cohere, the Toronto AI startup, or with Cohere Labs, its non-profit research arm. Researchers at five schools including the University of Waterloo and Stanford University also contributed.

Their study also claims Chatbot Arena is putting LLMs from some major AI firms in more of the head-to-head battles than others, giving those developers more information to boost the performance of their products. It estimates that Google and OpenAI have each gotten about a fifth of the data the contest has produced, despite over a dozen firms having submitted. 

Related Articles

OpenAI CEO Sam Altman speaks at the Asia-Pacific Economic Cooperation CEO Summit in San Francisco, in November 2023.

This new test shows the pros and cons of major AI models

By Murad Hemmadi
Cohere’s logo.

Cohere claims its new Command A model matches its rivals—and uses way less power

By Murad Hemmadi

Combined, the two problems mean developers are teaching their models how to do well on the test rather than just trying to make them better, according to the researchers. “It makes it difficult to distinguish between models that have legitimately improved versus those that have exploited statistical shortcuts,” they wrote.

The result is that models with high Chatbot Arena scores don’t always do as well in the real world—and vice versa. In an X post, Canadian computer scientist Andrej Karpathy cited Anthropic’s Claude 3.5, which was “top tier in my personal use” but “ranked very low on the arena.” Karpathy, who was not involved in the study, is a former star AI researcher at OpenAI and Tesla, and recently launched education technology startup Eureka Labs. 

The test-maker: A group of University of California Berkeley students launched Chatbot Arena in May 2023, as a crowdsourced way to test all the new AI models tech firms were releasing. Last month, the volunteer project evolved into a for-profit startup. 

Chatbot Arena has become a go-to benchmark for developers building generative tools, who can choose between a variety of LLMs. It’s also sometimes offered sneak peaks of AI’s next headline-grabbing moment. Chinese startup DeepSeek’s models started climbing the leaderboard a few days before their performance sparked a stock market freak-out in January. 

Google’s Gemini 2.5 Pro model currently tops Chatbot Arena’s language rankings, followed by OpenAI’s o3 system. Cohere’s latest model, Command A, currently ranks at 19 on that leaderboard. Nick Frosst, the Canadian firm’s co-founder, has previously said the test doesn’t show whether models are suited for Cohere’s target market of businesses. 

The response: “Pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the leaderboard is biased,” LMArena said in a post on X. The startup said people like being able to try out new systems before they launch, and it’s good that developers are tweaking their products based on that feedback.

Gift the full article

The fixes: Chatbot Arena has “democratized access to many models and enabled a large and varied user base to weigh in on what matters in the real world for model selection,” the paper says. But it suggests the platform should cap how many versions of a model a developer can privately test pre-release, and stop companies from withdrawing lower-scoring ones. It also recommends a new system for deciding which models get served up when a user shows up to chat. 

The alternatives: Toronto’s Vector Institute recently launched a mega-evaluation that put 11 leading models through 16 tests. 

Correction: Following publication of the paper, Cohere Labs said it had corrected the Google pre-release testing figures. This story has been updated.

#artificial intelligence #Chatbot Arena #Cohere #Cohere Labs #Google #Meta #Tech

Loading...

Thanks for sharing!

You have shared 5 articles this month and reached the maximum amount of shares available.

Close
This account has reached its share limit.

If you would like to purchase a sharing license please contact The Logic support at [email protected].

Close
Want to share this article?

Upgrade to all-access now

Close
Gift the full article!

You have gifted 0 article(s) this month and have 5 remaining.

Copy link and gift
Copy Link
Email to a friend
Send Email
Gift on Social Media

Recipients will be able to read the full text of the article after submitting their email address. They will not have access to other articles or subscriber benefits.

Meta founder and CEO Mark Zuckerberg sitting on a white chair, holding a microphone, smiling, against a blue background.

Photo: AP Photo/Jeff Chiu

Most Popular This Week

A head-on shot of James Neufeld seated with others at a round table in a meeting room. Eleanor Olszewski is seated to his left. There's a laptop open in front of Neufeld.
News

For this Alberta tech firm, ‘Buy Canadian’ isn’t working as advertised

By David Reevely
News

Everything you need to know about the debate over stablecoin yields

By Claire Brownell
In this photo illustration, the Manulife company logo is seen displayed on a smartphone screen.
News

Manulife and Intact buck a global trend by reporting AI returns

By Anita Balakrishnan
A photo of Daniel Sax shot through a circular piece of ironwork on a stairway balustrade. He's looking off-camera, and is wearing a dark blue jacket bearing his company's logo.
The Big Read

Mining the moon. Selling nuclear reactors. For this Canadian, it’s all part of the plan

By David Reevely

In-depth, agenda-setting reporting

Great journalism delivered straight to your inbox.

A wide shot of the Vancouver skyline shot from the east, featuring the Science World geodesic dome painted as a FIFA 2026 World Cup soccer ball. B.C. Place stadium appears on the right side of the frame.
News

Canada gets low returns from events like the World Cup. Ottawa wants to know why

By Laura Osman

Briefing

Nokia to spin out space communications business through Canadian SPAC deal

By David Reevely   |   Jun 19, 2026 | 4:11 PM ET

Ontario police aren’t reporting spyware use, senior privacy official warns

By David Reevely   |   Jun 19, 2026 | 3:37 PM ET

Magna founder Stronach found guilty of indecent and sexual assault

By Anita Balakrishnan   |   Jun 19, 2026 | 3:33 PM ET

Best business newsletter in Canada

Get up to speed in minutes with insights and analysis on the most important stories of the day, every weekday.

Exclusive events

See the bigger picture with reporters and industry experts in subscriber-exclusive events.

Membership in The Logic Council

Membership provides access to our popular Slack channel, participation in subscriber surveys and invitations to exclusive events with our journalists and special guests.

Recent Popular Stories

News

Manulife and Intact buck a global trend by reporting AI returns

By Anita Balakrishnan   |   Jun 16, 2026
In this photo illustration, the Manulife company logo is seen displayed on a smartphone screen.
Commentary: Quebec Ink

Quebec just found out what not having digital sovereignty really means

By Martin Patriquin   |   Jun 8, 2026
A yellow ambulance is pictured outside of a hospital in Montreal. A red sign in the foreground reads, “Urgence / Emergency.”
News

Canada’s surprise plan to buy Saab command jets leaves competitors seeking answers

By David Reevely   |   May 29, 2026
A closeup of a scale model of a jet covered in pixellated camouflage, with sensor equipment attached to the top of its fuselage. There are civilians and uniformed military personnel milling in the background.
The Big Read

Mining the moon. Selling nuclear reactors. For this Canadian, it’s all part of the plan

By David Reevely   |   Jun 12, 2026
A photo of Daniel Sax shot through a circular piece of ironwork on a stairway balustrade. He's looking off-camera, and is wearing a dark blue jacket bearing his company's logo.
News

Canadians could demand firms delete their personal data under new privacy bill

By Laura Osman   |   Jun 15, 2026
Evan Solomon in a suit and tie, gesturing with his left hand as he speaks, Several people sit and stand behind him looking in other directions. There's an orange curtain behind him lit from above.
The Big Read

We found every data centre in Canada

By Murad Hemmadi, David Reevely, Aleksandra Sagan, Chaimae Chouiekh, Martin Patriquin and Catherine McIntyre   |   Apr 8, 2026
Four vertical slices of aerial view photos. From left, a building in downtown Toronto housing several data centres, a picture of the Albertan wilderness where the proposed Wonder Valley data centre would go, a lit-up QScale data centre in Quebec, and a data centre at a Hydro-Quebec dam.

Canada's most influential executives and policymakers are reading The Logic

  • CPP Investments
  • Sun Life Financial
  • C100
  • Amazon
  • Telus
  • Mastercard
  • bdc
  • Shopify
  • Rogers
  • RBC
  • General Motors
  • MaRS
  • Government of Canada
  • Uber
  • Loblaw Companies Limited
logic-logo

Canada's Business and Tech Newsroom

100% human-crafted journalism

Newsroom

  • News Tips
  • AI Policy
  • Editorial Disclosures
  • Story Pitches

Company

  • About Us
  • Terms of Service
  • Privacy Statement
  • Corporate Information

Contact

  • Contact Us
  • Advertise
  • FAQs
  • Work at The Logic

© 2026 The Logic Inc. All Rights Reserved.

Trusted by leaders

Error

Account creation failed.

Please email us at [email protected].

Create Account

[wppb-register form_name=”cozmo-registration-form-for-modal”]

I do have an account
Login
or

[wppb-login]

I don’t have an account