Business

Stop chasing AI benchmarks—create your own

Published

2 weeks ago

April 4, 2025

Every few months, a new large language model (LLM) is anointed AI champion, with record-breaking benchmark scores. But these celebrated metrics of LLM performance—such as testing graduate-level reasoning and abstract math—rarely reflect real business needs or represent truly novel AI frontiers. For companies in the market for enterprise AI models, basing the decision of which models to use on these leaderboards alone can lead to costly mistakes—from wasted budgets to misaligned capabilities and potentially harmful, domain-specific errors that benchmark scores rarely capture.

Public benchmarks can be helpful to individual users by providing directional indicators of AI capabilities. And admittedly, some code-completion and software-engineering benchmarks, like SWE-Bench or Codeforces, are valuable for companies within a narrow range of coding-related, LLM-based business applications. But the most common benchmarks and public leaderboards often distract both businesses and model developers, pushing innovation toward marginal improvements in areas unhelpful for businesses or unrelated to areas of breakthrough AI innovation.

The challenge for executives, therefore, lies in designing business-specific evaluation frameworks that test potential models in the environments where they’ll actually be deployed. To do that, companies will need to adopt tailored evaluation strategies to run at scale using relevant and realistic data.

The mismatch between benchmarks and business needs

The flashy benchmarks that model developers tout in their releases are often detached from the realities of enterprise applications. Consider some of the most popular ones: graduate-level reasoning (GPQA Diamond) and high school-level math tests, like MATH-500 and AIME2024. Each of these was cited in the releases for GPT o1, Sonnet 3.7, or DeepSeek’s R1. But none of these indicators is helpful in assessing common enterprise applications like knowledge management tools, design assistants, or customer-facing chatbots.

Instead of assuming that the “best” model on a given leaderboard is the obvious choice, businesses should use metrics tailored to their specific needs to work backward and identify the right model. Start by testing models on your actual context and data—real customer queries, domain-specific documents, or whatever inputs your system will encounter in production. When real data is scarce or sensitive, companies can craft synthetic test cases that capture the same challenges.

Without real-world tests, companies can end up ill-fitting models that may, for instance, require too much memory for edge devices, have latency that’s too high for real-time interactions, or have insufficient support for the on-premises deployment sometimes mandated by data governance standards.

Salesforce has tried to bridge this gap between common benchmarks and their actual business requirements by developing its own internal benchmark for its CRM-related needs. The company created its own evaluation criteria specifically for tasks like prospecting, nurturing leads, and generating service case summaries—the actual work that marketing and sales teams need AI to perform.

Reaching beyond stylized metrics

Popular benchmarks are not only insufficient for informed business decision-making but can also be misleading. Often LLM media coverage, including all three major recent release announcements, uses benchmarks to compare models based on their average performance. Specific benchmarks are distilled into a single dot, number, or bar.

The trouble is that generative AI models are stochastic, highly input-sensitive systems, which means that slight variations of a prompt can make them behave unpredictably. A recent research paper from Anthropic rightly argues that, as a result, single dots on a performance comparison chart are insufficient because of the large error ranges of the evaluation metrics. A recent study by Microsoft found that using a statistically more accurate clustered-based evaluation in the same benchmarks can significantly change the rank ordering of—and public narratives about—models on a leaderboards.

That’s why business leaders need to ensure reliable measurements of model performance across a reasonable range of variations, done at scale, even if it requires hundreds of test runs. This thoroughness becomes even more critical when multiple systems are combined through AI and data supply chains, potentially increasing variability. For industries like aviation or healthcare, the margin of error is small and far beyond what current AI benchmarks typically guarantee, such that solely relying on leaderboard metrics can obscure substantial operational risk in real-world deployments.

Businesses must also test models in adversarial scenarios to ensure the security and robustness of a model—such as a chatbot’s resistance to manipulation by bad actors attempting to bypass guardrails—that cannot be measured by conventional benchmarks. LLMs are notably vulnerable to being fooled by sophisticated prompting techniques. Depending on the use case, implementing strong safeguards against these vulnerabilities could determine your technology choice and deployment strategy. The resilience of a model in the face of a potential bad actor could be a more important metric than the model’s math or reasoning capabilities. In our view, making AI “foolproof” is an exciting and impactful next barrier to break for AI researchers, one that may require novel model development and testing techniques.

Putting evaluation into practice: Four keys to a scalable approach

Start with existing evaluation frameworks. Companies should start by leveraging the strengths of existing automated tools (along with human judgment and practical but repeatable measurement goals). Specialized AI evaluation toolkits, such as DeepEval, LangSmith, TruLens, Mastra, or ARTKIT, can expedite and simplify testing, allowing for consistent comparison across models and over time.

Bring human experts to the testing ground. Effective AI evaluation requires that automated testing be supplemented with human judgment wherever possible. Automated evaluation could include a comparison of LLM answers to ground truth answers, or the use of proxy metrics, such as automated ROUGE or BLEU scores, to gauge the quality of text summarization.

For nuanced assessments, however, ones where machines still struggle, human evaluation remains vital. This could include domain experts or end-users conducting a “blind” review of a sample of model outputs. Such actions can also flag potential biases in responses, such as LLMs giving responses about job candidates that are biased by gender or race. This human layer of review is labor-intensive, but can provide additional critical insight, like whether a response is actually useful and well-presented.

The value of this hybrid approach can be seen in a recent case study where a company evaluated an HR-support chatbot using both human and automated tests. The company’s iterative internal evaluation process with human involvement showed a significant source of LLM response errors was due to flawed updates to enterprise data. The discovery highlights how human evaluation can uncover systemic issues beyond the model itself.

Focus on tradeoffs, not isolated dimensions of assessment. When evaluating models, companies must look beyond accuracy to consider the full spectrum of business requirements: speed, cost efficiency, operational feasibility, flexibility, maintainability, and regulatory compliance. A model that performs marginally better on accuracy metrics might be prohibitively expensive or too slow for real-time applications. A great example of this is how Open AI’s GPT o1(a leader in many benchmarks at release time) performed when applied to the ARC-AGI prize. To the surprise of many, the o1 model performed poorly, largely due to ARC-AGI’s “efficiency limit” on the computing power used to solve the benchmark tasks. The o1 model would often take too long, using more compute time to try to come up with a more accurate answer. Most popular benchmarks don’t have a time limit even though time would be a critically important factor for many business use cases.

Tradeoffs become even more important in the growing world of (multi)-agentic applications, where simpler tasks can be handled by cheaper, quicker models (overseen by an orchestration agent), while the most complex steps (such as solving the broken-out series of problems from a customer) could need a more powerful version with reasoning to be successful.

Microsoft Research’s HuggingGPT, for example, orchestrates specialized models for different tasks under a central language model. Being prepared to change models for different tasks requires building flexible tooling that isn’t hard-coded to a single model or provider. This built-in flexibility allows companies to easily pivot and change models based on evaluation results. While this may sound like a lot of extra development work, there are a number of available tools, like LangChain, LlamaIndex, and Pydantic AI, that can simplify the process.

Turn model testing into a culture of continuous evaluation and monitoring. As technology evolves, ongoing assessment ensures AI solutions remain optimal while maintaining alignment with business objectives. Much like how software engineering teams implement continuous integration and regression testing to catch bugs and prevent performance degradation in traditional code, AI systems require regular evaluation against business-specific benchmarks. Similar to the practice of pharmacovigilance among users of new medicines, feedback from LLM users and affected stakeholders also needs to be continuously gathered and analyzed to ensure AI “behaves as expected” and doesn’t drift from its intended performance targets.

This kind of bespoke evaluation framework fosters a culture of experimentation and data-driven decision-making. It also enforces the new and critical mantra: AI may be used for execution, but humans are in control and must govern AI.

Conclusion

For business leaders, the path to AI success lies not in chasing the latest benchmark champions but in developing evaluation frameworks for your specific business objectives. Think of this approach as “a leaderboard for every user,” as one Stanford paper suggests. The true value of AI deployment comes from three key actions: defining metrics that directly measure success in your business context; implementing statistically robust testing in realistic situations using your actual data and in your actual context; and fostering a culture of continuous monitoring, evaluation and experimentation that draws on both automated tools and human expertise to assess tradeoffs across models.

By following this approach, executives will be able to identify solutions optimized for their specific needs without paying premium prices for “top-notch models.” Doing this can hopefully help steer the model development industry away from chasing marginal improvements on the same metrics—falling victim to Goodhart’s law with capabilities of limited use for business—and instead free them up to explore new avenues of innovation and the next AI breakthrough.

Read other Fortune columns by François Candelon.

Francois Candelon is a partner at private equity firm Seven2 and the former global director of the BCG Henderson Institute.

Theodoros Evgeniou is a professor at INSEAD and a cofounder of the trust and safety company Tremau.

Max Struever is a principal engineer at BCG-X and an ambassador at the BCG Henderson Institute.

David Zuluaga Martínez is a partner at Boston Consulting Group and an ambassador at the BCG Henderson Institute.

Some of the companies mentioned in this column are past or present clients of the authors’ employers.

This story was originally featured on Fortune.com

Source link

Business

Crisis on the menu: How cut-price deals and fast food are reshaping France’s sacred lunch ideals

Published

1 hour ago

April 16, 2025

Jace Porter

© 2025 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.

Source link

Business

‘I haven’t seen sunlight in 3 months’: American law firm trainees in London endure 13-hour days for eye-watering six-figure starting salaries

Published

2 hours ago

April 16, 2025

Jace Porter

A survey of trainees and junior lawyers at American law firms’ offices in London shows that they spend as much as 13 hours a day at work—roughly twice the average work week in the U.K.

That comes with a lifestyle of Deliveroo dinners and picking up calls at “ungodly hours or on days off,” an anonymous employee told Legal Cheek, a legal news site that surveyed 2,000 workers across London’s various law firms, in November.

“I haven’t seen sunlight in three months,” said another anonymous employee.

Yet another participant said that although vacation time was respected, they were always expected to answer work calls.

Yes, all the tropes that shows, like Suits, make you believe about how long and hard law firms work their new staff work, might just be true.

While it has the trappings of a toxic work culture people would try to avoid, working long hours at law firms comes with handsome pay. Starting salaries in the top firms are over £170,000, or nearly five times the U.K.’s median income in 2023.

The likes of Kirkland and Ellis and Paul Hastings, American law firms with practices in London, pay £172,000 and demand an average of 12 to 13 hours a day, The Times reported. In contrast, British firms make employees work slightly shorter on average while capping starting pay at £150,000.

To be sure, not every firm in the industry has brutally long hours in exchange for a six-figure paycheck. Several of the firms listed by Legal Cheek in its survey limit their workday to 9 hours or so for freshly qualified solicitors.

Still, that’s a far cry from the average workweek in the U.K., which spans 36.6 hours or 7.3 hours a day.

Billable hours are the metric law firms often use to measure the performance of their lawyers. In some cases, those hours tick up to 2,000 a year. The U.S. demands a higher number of hours on average compared to Britain.

However, the model has been controversial amid cost pressures and demands for a more transparent system. Lawyers also argue that there could be more efficient ways to do the same work without a billable hours structure that determines pay. With AI’s emergence into public consciousness, the legal profession is already beginning to change.

That hasn’t hit hiring momentum, at least at the top level. London’s top law firms hired partners at record speed in 2024, driven by American law firms’ appetite to compete for talent in the British capital.

Part of the appeal for fresh talent at U.S.-based firms is the high pay they can swing relative to British ones. The most esteemed law firms are rethinking their partner pay structure in response to the growing competition.

“The impact of the covetous New Yorker on the highest levels of the London legal services market over such a short period has been profound,” a report by recruiting firm Edward Gibson said in July.

A version of this story was originally published on Fortune.com on Nov. 5, 2024.

This story was originally featured on Fortune.com

Source link

Business

Crypto exchange OKX relaunches in U.S. two months after settling with DOJ for $500 million

Published

5 hours ago

April 15, 2025

Jace Porter

Seychelles-based OKX announced on Tuesday that it is relaunching the U.S. version of its crypto exchange and unveiled a new wallet for American users to store as well as trade cryptocurrencies. The company also named Roshan Robert, a longtime employee of Barclays, as its U.S. CEO and revealed it would locate its U.S. regional headquarters in San Jose, California.

“It is not just the rebrand. The entire technology interface, everything has changed,” said Robert, who was recently an executive at the crypto prime broker Hidden Road, which was acquired by Ripple for $1.25 billion in April.

OKX’s renewed focus on the U.S. follows a settlement the exchange’s international entity reached with the Department of Justice in February. Prosecutors alleged that OKX failed to implement adequate anti-money laundering processes and solicited U.S. customers even though its international entity wasn’t registered in the States. As part of the agreement, OKX paid a $500 million fine, pled guilty to one count of operating an unlicensed money transmitting business, and agreed to pay for an external compliance consultant through February 2027.

“For over seven years, OKX knowingly violated anti-money laundering laws and avoided implementing required policies to prevent criminals from abusing our financial system,” Matthew Podolsky, Acting U.S. Attorney for the Southern District, said in a statement announcing the settlement.

“There were no allegations of customer harm, no charges against any company employee and no government appointed monitor as part of the settlement,” OKX said in a blog post.

The exchange’s U.S. relaunch also comes amid a more favorable regulatory environment for crypto under President Donald Trump. Robert, the U.S. CEO, said OKX’s plans to increase its U.S. presence predates Trump’s second term. He started talking with the crypto exchange in the summer of 2024 and was officially brought on in September. “We were preparing our compliance infrastructure, our risk management infrastructure for the last year and a half or so,” he added.

That said, Robert welcomes the Trump administration’s less aggressive approach to crypto. “The rulemaking will take some time, but there is a path that we can see,” he said.

As Robert steers the new, relaunched OKX U.S., he’s facing stiff competition from incumbents Coinbase and Kraken. However, he believes that the market in the U.S. isn’t zero sum and thinks that younger generations’ appetite for risky crypto bets will grow the pie. “The whole digital asset market is an expanding universe,” he said.

Hong Fang, OKX’s global president, previously oversaw OKX’s U.S. entity, which was formerly named OKcoin.

This story was originally featured on Fortune.com

Source link