TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.
TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.
TORONTO — As tech firms make ever-loftier claims about AI’s capabilities, the Vector Institute has been putting leading models through a barrage of tests to give a clearer picture of what they can really do, and which do it best.
Earlier this month, the Toronto-based non-profit released its first “state of evaluation” study, which assessed 11 products from providers like Alibaba, Cohere, Meta and OpenAI. Vector tested the models on 16 benchmarks that evaluated their capabilities at math, coding and reasoning, as well as their knowledge on topics like finance and history.
Talking Points
Businesses, researchers and policymakers are all looking for independent checks on the most advanced AI systems, said Deval Pandya, the institute’s vice-president of AI engineering. “We wanted to provide a very objective and more comprehensive evaluation,” he said. A team of five Vector engineers began working on the study last summer, and started testing models around December.
Vector didn’t set out to crown a winner. But it found that open-source systems still lagged commercial models, particularly on more complicated tasks. OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet beat the rest of the field on the Multimodal Massive Multitask Understanding benchmark. Developed by University of Waterloo researcher Wenhu Chen, the test checks whether the AI tools can figure out answers to questions that include images rather than just text.
Tech firms releasing new AI products regularly tout their performance on popular benchmarks, though developers argue over which evaluations are the most useful. But how testers set up the exams and the models can have a “huge impact” on the scores, said John Willes, an engineering manager at Vector. “You can take these benchmarks and give them to three different evaluators, and they’ll come back with different results.”
That variability can lead to some embarrassment. For example, Meta highlighted the scores its new Llama 4 Maverick system got on the popular LM Arena platform, but used a different version than the one it publicly released. Ahmad Al-Dahle, Meta’s head of generative AI, also denied that the model initially failed to match its performance on technology tests because the firm had used the exams’ answers to train the model. Both OpenAI and xAI have also been accused of promoting misleading, or at least overly optimistic, benchmark results.
Businesses buying or building products powered by AI models can’t keep up with all the new releases, or their developers’ contradictory claims. “Large companies have built this infrastructure to do evaluations internally,” said Pandya, “but most of it is not openly accessible to others.” Vector’s study aims to fill that gap for smaller firms.
Users typically won’t just choose the model with the highest scores, however. Newer models may be significantly more expensive, but perform only slightly better. Some models are better at generating marketing messages, while others produce less buggy code.
AI systems used by businesses are also rarely based on just one model, said Pandya. For example, a firm might use an OpenAI model as the brains for a new tool, but run its sensitive customer data through an open-source model hosted on its own hardware. Vector’s mega-evaluation “will help guide those choices about which models to use for what kind of task,” Pandya said.
Policymakers could also use the evaluation as they look to regulate AI. Vector has already worked with the U.K. AI Security Institute to add evaluations to an open-source testing platform the agency launched last year.
Vector is now working with the new Canadian AI Safety Institute, and Pandya is a member of an advisory group set up to counsel federal officials on the technology’s risks. “Understanding capabilities is very important,” he said, though he acknowledged that governments around the world are still figuring out how to translate that understanding into rules.
Vector is publishing the code it used to run the exams, the questions posed to each model, and the systems’ responses, so that other test makers can study and improve on the benchmarks it used, Willes said. The institute hopes other research organizations will pick up some of the work. “There’s no way we can continuously evaluate every single model which is coming out,” said Pandya.
Loading...
You have shared 5 articles this month and reached the maximum amount of shares available.
CloseIf you would like to purchase a sharing license please contact The Logic support at [email protected].
CloseYou have gifted 0 article(s) this month and have 5 remaining.
Recipients will be able to read the full text of the article after submitting their email address. They will not have access to other articles or subscriber benefits.
Get up to speed in minutes with insights and analysis on the most important stories of the day, every weekday.
See the bigger picture with reporters and industry experts in subscriber-exclusive events.
Membership provides access to our popular Slack channel, participation in subscriber surveys and invitations to exclusive events with our journalists and special guests.