Skip to content

Canada's Business and Tech Newsroom

  • Professional Subscription
  • Partnerships & Advertising
  • Licensing & Syndication
Log In Subscribe
Welcome,
  • My Account
  • Log Out
  • Business
  • Tech
  • National
  • The Big Read
  • Briefings
  • Commentary
Search
Log In Subscribe
Welcome,
  • My Account
  • Log Out
News

This new test shows the pros and cons of major AI models

TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.

News

This new test shows the pros and cons of major AI models

OpenAI, Meta, Cohere and others all claim they’ve got the top-performing models. A new suite of exams from Toronto’s Vector Institute hopes to provide an objective assessment.

By Murad Hemmadi
OpenAI CEO Sam Altman speaks at the Asia-Pacific Economic Cooperation CEO Summit in San Francisco, in November 2023.
Tech firms releasing new AI products regularly tout their performance on popular benchmarks, but such tests can lack objectivity. Photo: AP Photo/Eric Risberg
Apr 21, 2025
A A
A Small A Medium A Large
Share

Gift

Share

TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.

Earlier this month, the Toronto-based non-profit released its first “state of evaluation” study, which assessed 11 products from providers like Alibaba, Cohere, Meta and OpenAI. Vector tested the models on 16 benchmarks that evaluated their capabilities at math, coding and reasoning, as well as their knowledge on topics like finance and history.

Talking Points

  • Every AI company claims its products top the sector’s most popular tests, but benchmarks can be easy to fudge. Toronto’s Vector Institute ran 11 leading models through 16 exams to provide an independent evaluation of their capabilities.
  • The study could help companies pick the right tool for their needs, and policymakers understand the technology’s capabilities, said Deval Pandya, Vector’s vice-president of AI engineering

Businesses, researchers and policymakers are all looking for independent checks on the most advanced AI systems, said Deval Pandya, the institute’s vice-president of AI engineering. “We wanted to provide a very objective and more comprehensive evaluation,” he said. A team of five Vector engineers began working on the study last summer, and started testing models around December. 

Vector didn’t set out to crown a winner. But it found that open-source systems still lagged commercial models, particularly on more complicated tasks. OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet beat the rest of the field on the Multimodal Massive Multitask Understanding benchmark. Developed by University of Waterloo researcher Wenhu Chen, the test checks whether the AI tools can figure out answers to questions that include images rather than just text.  

Tech firms releasing new AI products regularly tout their performance on popular benchmarks, though developers argue over which evaluations are the most useful. But how testers set up the exams and the models can have a “huge impact” on the scores, said John Willes, an engineering manager at Vector. “You can take these benchmarks and give them to three different evaluators, and they’ll come back with different results.”

Related Articles

AI customer service reps are getting just as good as humans

By Murad Hemmadi
A view of the steering wheel inside a Waymo driverless taxi on a street in San Francisco.

World models promise to solve some of AI’s biggest problems

By Murad Hemmadi

That variability can lead to some embarrassment. For example, Meta highlighted the scores its new Llama 4 Maverick system got on the popular LM Arena platform, but used a different version than the one it publicly released. Ahmad Al-Dahle, Meta’s head of generative AI, also denied that the model initially failed to match its performance on technology tests because the firm had used the exams’ answers to train the model. Both OpenAI and xAI have also been accused of promoting misleading, or at least overly optimistic, benchmark results.

Businesses buying or building products powered by AI models can’t keep up with all the new releases, or their developers’ contradictory claims. “Large companies have built this infrastructure to do evaluations internally,” said Pandya, “but most of it is not openly accessible to others.” Vector’s study aims to fill that gap for smaller firms.

Users typically won’t just choose the model with the highest scores, however. Newer models may be significantly more expensive, but perform only slightly better. Some models are better at generating marketing messages, while others produce less buggy code. 

AI systems used by businesses are also rarely based on just one model, said Pandya. For example, a firm might use an OpenAI model as the brains for a new tool, but run its sensitive customer data through an open-source model hosted on its own hardware. Vector’s mega-evaluation “will help guide those choices about which models to use for what kind of task,” Pandya said.

Policymakers could also use the evaluation as they look to regulate AI. Vector has already worked with the U.K. AI Security Institute to add evaluations to an open-source testing platform the agency launched last year.

Gift the full article

Vector is now working with the new Canadian AI Safety Institute, and Pandya is a member of an advisory group set up to counsel federal officials on the technology’s risks. “Understanding capabilities is very important,” he said, though he acknowledged that governments around the world are still figuring out how to translate that understanding into rules. 

Vector is publishing the code it used to run the exams, the questions posed to each model, and the systems’ responses, so that other test makers can study and improve on the benchmarks it used, Willes said. The institute hopes other research organizations will pick up some of the work. “There’s no way we can continuously evaluate every single model which is coming out,” said Pandya. 

#artificial intelligence #Tech #Vector Institute

Loading...

Thanks for sharing!

You have shared 5 articles this month and reached the maximum amount of shares available.

Close
This account has reached its share limit.

If you would like to purchase a sharing license please contact The Logic support at [email protected].

Close
Want to share this article?

Upgrade to all-access now

Close
Gift the full article!

You have gifted 0 article(s) this month and have 5 remaining.

Copy link and gift
Copy Link
Email to a friend
Send Email
Gift on Social Media

Recipients will be able to read the full text of the article after submitting their email address. They will not have access to other articles or subscriber benefits.

OpenAI CEO Sam Altman speaks at the Asia-Pacific Economic Cooperation CEO Summit in San Francisco, in November 2023.

Photo: AP Photo/Eric Risberg

Most Popular This Week

A diptych showing Mark Carney on the left, and CIBC CEO Harry Culham on the right.
News

Diversifying trade requires banks to take bigger risks, official advised Carney before CIBC meeting

By Joanna Smith
The image shows the inside of Toronto Stadium on a sunny day. The rows of seats are empty; an empty green field is visible.
News

Toronto and Vancouver aren’t getting a World Cup bookings boom

By Chaimae Chouiekh
A yellow ambulance is pictured outside of a hospital in Montreal. A red sign in the foreground reads, “Urgence / Emergency.”
Commentary: Quebec Ink

Quebec just found out what not having digital sovereignty really means

By Martin Patriquin
An image of Mark Carney standing in front of a red podium with the words "AI for All / L'IA pour tous." He is wearing a suit and tie. In the background, people wearing scrubs and white coats are visible.
Special Report

Canada’s new AI strategy sets lofty goals for adoption and growth

By Murad Hemmadi and Laura Osman

In-depth, agenda-setting reporting

Great journalism delivered straight to your inbox.

News

Canadian mother sues OpenAI claiming ChatGPT encouraged her daughter’s suicide

By Martin Patriquin

Briefing

Canada to publish list of imports at risk of being made with forced labour

By Joanna Smith   |   Jun 12, 2026 | 4:05 PM ET

TMX Group acquires RAFI Indices for $683M

By Anita Balakrishnan   |   Jun 12, 2026 | 3:29 PM ET

Ikea invests in Toronto food startup NS/TX Industries’ US$10.5M fundraise

By Catherine McIntyre   |   Jun 12, 2026 | 3:26 PM ET

Best business newsletter in Canada

Get up to speed in minutes with insights and analysis on the most important stories of the day, every weekday.

Exclusive events

See the bigger picture with reporters and industry experts in subscriber-exclusive events.

Membership in The Logic Council

Membership provides access to our popular Slack channel, participation in subscriber surveys and invitations to exclusive events with our journalists and special guests.

Recent Popular Stories

Commentary: Quebec Ink

Quebec just found out what not having digital sovereignty really means

By Martin Patriquin   |   Jun 8, 2026
A yellow ambulance is pictured outside of a hospital in Montreal. A red sign in the foreground reads, “Urgence / Emergency.”
News

OMERS investment chief departs for Singapore’s Temasek

By Chaimae Chouiekh   |   Jun 10, 2026
The Big Read

We found every data centre in Canada

By Murad Hemmadi, David Reevely, Aleksandra Sagan, Chaimae Chouiekh, Martin Patriquin and Catherine McIntyre   |   Apr 8, 2026
Four vertical slices of aerial view photos. From left, a building in downtown Toronto housing several data centres, a picture of the Albertan wilderness where the proposed Wonder Valley data centre would go, a lit-up QScale data centre in Quebec, and a data centre at a Hydro-Quebec dam.
News

Diversifying trade requires banks to take bigger risks, official advised Carney before CIBC meeting

By Joanna Smith   |   Jun 9, 2026
A diptych showing Mark Carney on the left, and CIBC CEO Harry Culham on the right.
News

Canada’s surprise plan to buy Saab command jets leaves competitors seeking answers

By David Reevely   |   May 29, 2026
A closeup of a scale model of a jet covered in pixellated camouflage, with sensor equipment attached to the top of its fuselage. There are civilians and uniformed military personnel milling in the background.
The Big Read

ApplyBoard faces a reckoning as Canada’s immigration boom turns into a bust

By Claire Brownell and David Reevely   |   May 27, 2026

Canada's most influential executives and policymakers are reading The Logic

  • CPP Investments
  • Sun Life Financial
  • C100
  • Amazon
  • Telus
  • Mastercard
  • bdc
  • Shopify
  • Rogers
  • RBC
  • General Motors
  • MaRS
  • Government of Canada
  • Uber
  • Loblaw Companies Limited
logic-logo

Canada's Business and Tech Newsroom

100% human-crafted journalism

Newsroom

  • News Tips
  • AI Policy
  • Editorial Disclosures
  • Story Pitches

Company

  • About Us
  • Terms of Service
  • Privacy Statement
  • Corporate Information

Contact

  • Contact Us
  • Advertise
  • FAQs
  • Work at The Logic

© 2026 The Logic Inc. All Rights Reserved.

Trusted by leaders

Error

Account creation failed.

Please email us at [email protected].

Create Account

[wppb-register form_name=”cozmo-registration-form-for-modal”]

I do have an account
Login
or

[wppb-login]

I don’t have an account