Skip to content

Canada's Business and Tech Newsroom

  • Professional Subscription
  • Partnerships & Advertising
  • Licensing & Syndication
Log In Subscribe
Welcome,
  • My Account
  • Log Out
  • Business
  • Tech
  • National
  • The Big Read
  • Briefings
  • Commentary
Search
Log In Subscribe
Welcome,
  • My Account
  • Log Out
News

This new test shows the pros and cons of major AI models

TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.

News

This new test shows the pros and cons of major AI models

OpenAI, Meta, Cohere and others all claim they’ve got the top-performing models. A new suite of exams from Toronto’s Vector Institute hopes to provide an objective assessment.

By Murad Hemmadi
OpenAI CEO Sam Altman speaks at the Asia-Pacific Economic Cooperation CEO Summit in San Francisco, in November 2023.
Tech firms releasing new AI products regularly tout their performance on popular benchmarks, but such tests can lack objectivity. Photo: AP Photo/Eric Risberg
Apr 21, 2025
A A
A Small A Medium A Large
Share

Gift

Share

TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.

Earlier this month, the Toronto-based non-profit released its first “state of evaluation” study, which assessed 11 products from providers like Alibaba, Cohere, Meta and OpenAI. Vector tested the models on 16 benchmarks that evaluated their capabilities at math, coding and reasoning, as well as their knowledge on topics like finance and history.

Talking Points

  • Every AI company claims its products top the sector’s most popular tests, but benchmarks can be easy to fudge. Toronto’s Vector Institute ran 11 leading models through 16 exams to provide an independent evaluation of their capabilities.
  • The study could help companies pick the right tool for their needs, and policymakers understand the technology’s capabilities, said Deval Pandya, Vector’s vice-president of AI engineering

Businesses, researchers and policymakers are all looking for independent checks on the most advanced AI systems, said Deval Pandya, the institute’s vice-president of AI engineering. “We wanted to provide a very objective and more comprehensive evaluation,” he said. A team of five Vector engineers began working on the study last summer, and started testing models around December. 

Vector didn’t set out to crown a winner. But it found that open-source systems still lagged commercial models, particularly on more complicated tasks. OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet beat the rest of the field on the Multimodal Massive Multitask Understanding benchmark. Developed by University of Waterloo researcher Wenhu Chen, the test checks whether the AI tools can figure out answers to questions that include images rather than just text.  

Tech firms releasing new AI products regularly tout their performance on popular benchmarks, though developers argue over which evaluations are the most useful. But how testers set up the exams and the models can have a “huge impact” on the scores, said John Willes, an engineering manager at Vector. “You can take these benchmarks and give them to three different evaluators, and they’ll come back with different results.”

Related Articles

AI customer service reps are getting just as good as humans

By Murad Hemmadi
A view of the steering wheel inside a Waymo driverless taxi on a street in San Francisco.

World models promise to solve some of AI’s biggest problems

By Murad Hemmadi

That variability can lead to some embarrassment. For example, Meta highlighted the scores its new Llama 4 Maverick system got on the popular LM Arena platform, but used a different version than the one it publicly released. Ahmad Al-Dahle, Meta’s head of generative AI, also denied that the model initially failed to match its performance on technology tests because the firm had used the exams’ answers to train the model. Both OpenAI and xAI have also been accused of promoting misleading, or at least overly optimistic, benchmark results.

Businesses buying or building products powered by AI models can’t keep up with all the new releases, or their developers’ contradictory claims. “Large companies have built this infrastructure to do evaluations internally,” said Pandya, “but most of it is not openly accessible to others.” Vector’s study aims to fill that gap for smaller firms.

Users typically won’t just choose the model with the highest scores, however. Newer models may be significantly more expensive, but perform only slightly better. Some models are better at generating marketing messages, while others produce less buggy code. 

AI systems used by businesses are also rarely based on just one model, said Pandya. For example, a firm might use an OpenAI model as the brains for a new tool, but run its sensitive customer data through an open-source model hosted on its own hardware. Vector’s mega-evaluation “will help guide those choices about which models to use for what kind of task,” Pandya said.

Policymakers could also use the evaluation as they look to regulate AI. Vector has already worked with the U.K. AI Security Institute to add evaluations to an open-source testing platform the agency launched last year.

Gift the full article

Vector is now working with the new Canadian AI Safety Institute, and Pandya is a member of an advisory group set up to counsel federal officials on the technology’s risks. “Understanding capabilities is very important,” he said, though he acknowledged that governments around the world are still figuring out how to translate that understanding into rules. 

Vector is publishing the code it used to run the exams, the questions posed to each model, and the systems’ responses, so that other test makers can study and improve on the benchmarks it used, Willes said. The institute hopes other research organizations will pick up some of the work. “There’s no way we can continuously evaluate every single model which is coming out,” said Pandya. 

#artificial intelligence #Tech #Vector Institute

Loading...

Thanks for sharing!

You have shared 5 articles this month and reached the maximum amount of shares available.

Close
This account has reached its share limit.

If you would like to purchase a sharing license please contact The Logic support at [email protected].

Close
Want to share this article?

Upgrade to all-access now

Close
Gift the full article!

You have gifted 0 article(s) this month and have 5 remaining.

Copy link and gift
Copy Link
Email to a friend
Send Email
Gift on Social Media

Recipients will be able to read the full text of the article after submitting their email address. They will not have access to other articles or subscriber benefits.

OpenAI CEO Sam Altman speaks at the Asia-Pacific Economic Cooperation CEO Summit in San Francisco, in November 2023.

Photo: AP Photo/Eric Risberg

Most Popular This Week

A person in glasses and a blue top is sitting and typing on a laptop in an office. A desktop screen next to the laptop displays some blurred-out coding work.
News

A niche white-collar role is becoming the AI industry’s hot new job

By Anita Balakrishnan
A logo that reads AI in blue lettering against a light yellow background.
News

What happened when a VC firm let AI do almost everything

By Catherine McIntyre
News

Canada joins the movement to make AI more open source

By Murad Hemmadi
A close-up of a made-in-Canada stamp on the end of a cylindrical piece of raw aluminum.
Analysis

It turns out Trump does need something from Canada—aluminum

By Joanna Smith

In-depth, agenda-setting reporting

Great journalism delivered straight to your inbox.

Workers position pipe during construction of the Trans Mountain pipeline expansion in Abbotsford, B.C., in May 2023.
News

Carney’s new deal for B.C. paves way for West Coast pipeline

By David Reevely and Meghan Potkins

Briefing

A $4.6B power project tied to a Meta-linked Alberta data centre gets the green light

By Meghan Potkins   |   Jul 2, 2026 | 4:17 PM ET

Quebec launches $1B water infrastructure housing program

By Martin Patriquin   |   Jul 2, 2026 | 4:11 PM ET

Radical Ventures backs TwelveLabs in US$100M Series B for video AI tools

By Murad Hemmadi   |   Jul 2, 2026 | 3:14 PM ET

Best business newsletter in Canada

Get up to speed in minutes with insights and analysis on the most important stories of the day, every weekday.

Exclusive events

See the bigger picture with reporters and industry experts in subscriber-exclusive events.

Membership in The Logic Council

Membership provides access to our popular Slack channel, participation in subscriber surveys and invitations to exclusive events with our journalists and special guests.

Recent Popular Stories

Analysis

It turns out Trump does need something from Canada—aluminum

By Joanna Smith   |   Jun 25, 2026
A close-up of a made-in-Canada stamp on the end of a cylindrical piece of raw aluminum.
News

What happened when a VC firm let AI do almost everything

By Catherine McIntyre   |   Jun 29, 2026
A logo that reads AI in blue lettering against a light yellow background.
News

Alberta to free up a huge amount of power to attract Big Tech and its data centres

By Meghan Potkins   |   Jun 24, 2026
A wide landscape shot of high-tension power lines over green and golden fields in rolling countryside.
Exclusive

Ssense has laid off photo and make-up teams and says AI will do much of their work

By Catherine McIntyre   |   Jun 22, 2026
News

A niche white-collar role is becoming the AI industry’s hot new job

By Anita Balakrishnan   |   Jun 30, 2026
A person in glasses and a blue top is sitting and typing on a laptop in an office. A desktop screen next to the laptop displays some blurred-out coding work.
News

Canada joins the movement to make AI more open source

By Murad Hemmadi   |   Jun 26, 2026

Canada's most influential executives and policymakers are reading The Logic

  • CPP Investments
  • Sun Life Financial
  • C100
  • Amazon
  • Telus
  • Mastercard
  • bdc
  • Shopify
  • Rogers
  • RBC
  • General Motors
  • MaRS
  • Government of Canada
  • Uber
  • Loblaw Companies Limited
logic-logo

Canada's Business and Tech Newsroom

100% human-crafted journalism

Newsroom

  • News Tips
  • AI Policy
  • Editorial Disclosures
  • Story Pitches

Company

  • About Us
  • Terms of Service
  • Privacy Statement
  • Corporate Information

Contact

  • Contact Us
  • Advertise
  • FAQs
  • Work at The Logic

© 2026 The Logic Inc. All Rights Reserved.

Trusted by leaders

Error

Account creation failed.

Please email us at [email protected].

Create Account

[wppb-register form_name=”cozmo-registration-form-for-modal”]

I do have an account
Login
or

[wppb-login]

I don’t have an account