This new test shows the pros and cons of major AI models

Apr 21, 2025

TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.

Earlier this month, the Toronto-based non-profit released its first “state of evaluation” study, which assessed 11 products from providers like Alibaba, Cohere, Meta and OpenAI. Vector tested the models on 16 benchmarks that evaluated their capabilities at math, coding and reasoning, as well as their knowledge on topics like finance and history.

Talking Points

Every AI company claims its products top the sector’s most popular tests, but benchmarks can be easy to fudge. Toronto’s Vector Institute ran 11 leading models through 16 exams to provide an independent evaluation of their capabilities.
The study could help companies pick the right tool for their needs, and policymakers understand the technology’s capabilities, said Deval Pandya, Vector’s vice-president of AI engineering

Businesses, researchers and policymakers are all looking for independent checks on the most advanced AI systems, said Deval Pandya, the institute’s vice-president of AI engineering. “We wanted to provide a very objective and more comprehensive evaluation,” he said. A team of five Vector engineers began working on the study last summer, and started testing models around December.

Vector didn’t set out to crown a winner. But it found that open-source systems still lagged commercial models, particularly on more complicated tasks. OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet beat the rest of the field on the Multimodal Massive Multitask Understanding benchmark. Developed by University of Waterloo researcher Wenhu Chen, the test checks whether the AI tools can figure out answers to questions that include images rather than just text.

Tech firms releasing new AI products regularly tout their performance on popular benchmarks, though developers argue over which evaluations are the most useful. But how testers set up the exams and the models can have a “huge impact” on the scores, said John Willes, an engineering manager at Vector. “You can take these benchmarks and give them to three different evaluators, and they’ll come back with different results.”

That variability can lead to some embarrassment. For example, Meta highlighted the scores its new Llama 4 Maverick system got on the popular LM Arena platform, but used a different version than the one it publicly released. Ahmad Al-Dahle, Meta’s head of generative AI, also denied that the model initially failed to match its performance on technology tests because the firm had used the exams’ answers to train the model. Both OpenAI and xAI have also been accused of promoting misleading, or at least overly optimistic, benchmark results.

Businesses buying or building products powered by AI models can’t keep up with all the new releases, or their developers’ contradictory claims. “Large companies have built this infrastructure to do evaluations internally,” said Pandya, “but most of it is not openly accessible to others.” Vector’s study aims to fill that gap for smaller firms.

Users typically won’t just choose the model with the highest scores, however. Newer models may be significantly more expensive, but perform only slightly better. Some models are better at generating marketing messages, while others produce less buggy code.

AI systems used by businesses are also rarely based on just one model, said Pandya. For example, a firm might use an OpenAI model as the brains for a new tool, but run its sensitive customer data through an open-source model hosted on its own hardware. Vector’s mega-evaluation “will help guide those choices about which models to use for what kind of task,” Pandya said.

Policymakers could also use the evaluation as they look to regulate AI. Vector has already worked with the U.K. AI Security Institute to add evaluations to an open-source testing platform the agency launched last year.

Gift the full article

Vector is now working with the new Canadian AI Safety Institute, and Pandya is a member of an advisory group set up to counsel federal officials on the technology’s risks. “Understanding capabilities is very important,” he said, though he acknowledged that governments around the world are still figuring out how to translate that understanding into rules.

Vector is publishing the code it used to run the exams, the questions posed to each model, and the systems’ responses, so that other test makers can study and improve on the benchmarks it used, Willes said. The institute hopes other research organizations will pick up some of the work. “There’s no way we can continuously evaluate every single model which is coming out,” said Pandya.

#artificial intelligence #Tech #Vector Institute

Services

Content

Masthead

Talking Points

Related Articles

Thanks for sharing!

This account has reached its share limit.

Want to share this article?

Gift the full article!

Copy link and gift

Email to a friend

Gift on Social Media

Most Popular This Week

News

The Big Read

News

News

In-depth, agenda-setting reporting

Commentary

Carmichael: The hard work of breaking down internal trade barriers is starting to pay off

Briefing

Cadillac Gold raises $385M in IPO

U.S. hits 60 countries, including Canada, with forced labour tariffs

U.S. Congress members propose an ‘AI kill switch’

Best business newsletter in Canada

Exclusive events

Membership in The Logic Council

Recent Popular Stories

The Big Read

News

News

News

News

News

Canada's most influential executives and policymakers are reading The Logic