Connect with us

Business

Stop chasing AI benchmarks—create your own

Published

on



Every few months, a new large language model (LLM) is anointed AI champion, with record-breaking benchmark scores. But these celebrated metrics of LLM performance—such as testing graduate-level reasoning and abstract math—rarely reflect real business needs or represent truly novel AI frontiers. For companies in the market for enterprise AI models, basing the decision of which models to use on these leaderboards alone can lead to costly mistakes—from wasted budgets to misaligned capabilities and potentially harmful, domain-specific errors that benchmark scores rarely capture.

Public benchmarks can be helpful to individual users by providing directional indicators of AI capabilities. And admittedly, some code-completion and software-engineering benchmarks, like SWE-Bench or Codeforces, are valuable for companies within a narrow range of coding-related, LLM-based business applications. But the most common benchmarks and public leaderboards often distract both businesses and model developers, pushing innovation toward marginal improvements in areas unhelpful for businesses or unrelated to areas of breakthrough AI innovation. 

The challenge for executives, therefore, lies in designing business-specific evaluation frameworks that test potential models in the environments where they’ll actually be deployed. To do that, companies will need to adopt tailored evaluation strategies to run at scale using relevant and realistic data.

The mismatch between benchmarks and business needs

The flashy benchmarks that model developers tout in their releases are often detached from the realities of enterprise applications. Consider some of the most popular ones: graduate-level reasoning (GPQA Diamond) and high school-level math tests, like MATH-500 and AIME2024. Each of these was cited in the releases for GPT o1Sonnet 3.7, or DeepSeek’s R1. But none of these indicators is helpful in assessing common enterprise applications like knowledge management tools, design assistants, or customer-facing chatbots.

Instead of assuming that the “best” model on a given leaderboard is the obvious choice, businesses should use metrics tailored to their specific needs to work backward and identify the right model. Start by testing models on your actual context and data—real customer queries, domain-specific documents, or whatever inputs your system will encounter in production. When real data is scarce or sensitive, companies can craft synthetic test cases that capture the same challenges. 

Without real-world tests, companies can end up ill-fitting models that may, for instance, require too much memory for edge devices, have latency that’s too high for real-time interactions, or have insufficient support for the on-premises deployment sometimes mandated by data governance standards.

Salesforce has tried to bridge this gap between common benchmarks and their actual business requirements by developing its own internal benchmark for its CRM-related needs. The company created its own evaluation criteria specifically for tasks like prospecting, nurturing leads, and generating service case summaries—the actual work that marketing and sales teams need AI to perform.

Reaching beyond stylized metrics

Popular benchmarks are not only insufficient for informed business decision-making but can also be misleading. Often LLM media coverage, including all three major recent release announcements, uses benchmarks to compare models based on their average performance. Specific benchmarks are distilled into a single dot, number, or bar

The trouble is that generative AI models are stochastic, highly input-sensitive systems, which means that slight variations of a prompt can make them behave unpredictably.  A recent research paper from Anthropic rightly argues that, as a result, single dots on a performance comparison chart are insufficient because of the large error ranges of the evaluation metrics. A recent study by Microsoft found that using a statistically more accurate clustered-based evaluation in the same benchmarks can significantly change the rank ordering of—and public narratives about—models on a leaderboards.

That’s why business leaders need to ensure reliable measurements of model performance across a reasonable range of variations, done at scale, even if it requires hundreds of test runs. This thoroughness becomes even more critical when multiple systems are combined through AI and data supply chains, potentially increasing variability. For industries like aviation or healthcare, the margin of error is small and far beyond what current AI benchmarks typically guarantee, such that solely relying on leaderboard metrics can obscure substantial operational risk in real-world deployments. 

Businesses must also test models in adversarial scenarios to ensure the security and robustness of a model—such as a chatbot’s resistance to manipulation by bad actors attempting to bypass guardrails—that cannot be measured by conventional benchmarks. LLMs are notably vulnerable to being fooled by sophisticated prompting techniques. Depending on the use case, implementing strong safeguards against these vulnerabilities could determine your technology choice and deployment strategy. The resilience of a model in the face of a potential bad actor could be a more important metric than the model’s math or reasoning capabilities. In our view, making AI “foolproof” is an exciting and impactful next barrier to break for AI researchers, one that may require novel model development and testing techniques.

Putting evaluation into practice: Four keys to a scalable approach

Start with existing evaluation frameworks. Companies should start by leveraging the strengths of existing automated tools (along with human judgment and practical but repeatable measurement goals). Specialized AI evaluation toolkits, such as DeepEvalLangSmithTruLensMastra, or ARTKIT, can expedite and simplify testing, allowing for consistent comparison across models and over time. 

Bring human experts to the testing ground.  Effective AI evaluation requires that automated testing be supplemented with human judgment wherever possible. Automated evaluation could include a comparison of LLM answers to ground truth answers, or the use of proxy metrics, such as automated ROUGE or BLEU scores, to gauge the quality of text summarization. 

For nuanced assessments, however, ones where machines still struggle, human evaluation remains vital. This could include domain experts or end-users conducting a “blind” review of a sample of model outputs. Such actions can also flag potential biases in responses, such as LLMs giving responses about job candidates that are biased by gender or race. This human layer of review is labor-intensive, but can provide additional critical insight, like whether a response is actually useful and well-presented.

The value of this hybrid approach can be seen in a recent case study where a company evaluated an HR-support chatbot using both human and automated tests. The company’s iterative internal evaluation process with human involvement showed a significant source of LLM response errors was due to flawed updates to enterprise data. The discovery highlights how human evaluation can uncover systemic issues beyond the model itself.

Focus on tradeoffs, not isolated dimensions of assessmentWhen evaluating models, companies must look beyond accuracy to consider the full spectrum of business requirements: speed, cost efficiency, operational feasibility, flexibility, maintainability, and regulatory compliance. A model that performs marginally better on accuracy metrics might be prohibitively expensive or too slow for real-time applications. A great example of this is how Open AI’s GPT o1(a leader in many benchmarks at release time) performed when applied to the ARC-AGI prize. To the surprise of many, the o1 model performed poorly, largely due to ARC-AGI’s “efficiency limit” on the computing power used to solve the benchmark tasks. The o1 model would often take too long, using more compute time to try to come up with a more accurate answer. Most popular benchmarks don’t have a time limit even though time would be a critically important factor for many business use cases. 

Tradeoffs become even more important in the growing world of (multi)-agentic applications, where simpler tasks can be handled by cheaper, quicker models (overseen by an orchestration agent), while the most complex steps (such as solving the broken-out series of problems from a customer) could need a more powerful version with reasoning to be successful. 

Microsoft Research’s HuggingGPT, for example, orchestrates specialized models for different tasks under a central language model. Being prepared to change models for different tasks requires building flexible tooling that isn’t hard-coded to a single model or provider. This built-in flexibility allows companies to easily pivot and change models based on evaluation results. While this may sound like a lot of extra development work, there are a number of available tools, like LangChainLlamaIndex, and Pydantic AI, that can simplify the process.

Turn model testing into a culture of continuous evaluation and monitoring. As technology evolves, ongoing assessment ensures AI solutions remain optimal while maintaining alignment with business objectives. Much like how software engineering teams implement continuous integration and regression testing to catch bugs and prevent performance degradation in traditional code, AI systems require regular evaluation against business-specific benchmarks. Similar to the practice of pharmacovigilance among users of new medicines, feedback from LLM users and affected stakeholders also needs to be continuously gathered and analyzed to ensure AI “behaves as expected” and doesn’t drift from its intended performance targets

This kind of bespoke evaluation framework fosters a culture of experimentation and data-driven decision-making. It also enforces the new and critical mantra: AI may be used for execution, but humans are in control and must govern AI.

Conclusion

For business leaders, the path to AI success lies not in chasing the latest benchmark champions but in developing evaluation frameworks for your specific business objectives. Think of this approach as “a leaderboard for every user,” as one Stanford paper suggests. The true value of AI deployment comes from three key actions: defining metrics that directly measure success in your business context; implementing statistically robust testing in realistic situations using your actual data and in your actual context; and fostering a culture of continuous monitoring, evaluation and experimentation that draws on both automated tools and human expertise to assess tradeoffs across models.

By following this approach, executives will be able to identify solutions optimized for their specific needs without paying premium prices for “top-notch models.” Doing this can hopefully help steer the model development industry away from chasing marginal improvements on the same metrics—falling victim to Goodhart’s law with capabilities of limited use for business—and instead free them up to explore new avenues of innovation and the next AI breakthrough. 

Read other Fortune columns by François Candelon

Francois Candelon is a partner at private equity firm Seven2 and the former global director of the BCG Henderson Institute.

Theodoros Evgeniou is a professor at INSEAD and a cofounder of the trust and safety company Tremau.

Max Struever is a principal engineer at BCG-X and an ambassador at the BCG Henderson Institute.

David Zuluaga Martínez is a partner at 
Boston Consulting Group and an ambassador at the BCG Henderson Institute.

Some of the companies mentioned in this column are past or present clients of the authors’ employers.


This story was originally featured on Fortune.com



Source link

Continue Reading

Business

The White House moves to cancel thousands of immigrants’ Social Security numbers using a ‘death master file,’ NYT reports

Published

on

The Trump administration is reportedly escalating its tactics to revoke the temporary legal status of immigrants allowed into the U.S. under the Biden administration. The latest effort includes adding migrants who are here lawfully to Social Security Administration’s “death master file,” effectively blacklisting them from the U.S. financial system, the New York Times reports.

Sources including family members, funeral homes, financial institutions, and more report deaths in the U.S. to Social Security, which the agency records in the so-called death master file database. Once there, outside financial and medical agencies as well as other governmental agencies are notified, and banking and financial institutions scour the list themselves to prevent identity theft.

The Trump administration hopes that adding migrants to this list—which will cut them off from most financial services—will make them more likely to “self-deport,” the Times reports. These Social Security numbers were legally obtained under a program created by President Joe Biden, which gave some migrants temporary legal status in the U.S. that allowed them to work.

“The goal is to cut those people off from using crucial financial services like bank accounts and credit cards, along with their access to government benefits,” the Times reports.

More than 6,000 people were added earlier this week. The Trump administration says the migrants who have been added are convicted criminals and “suspected terrorists,” the Times reports, although the list included minors. But current and former SSA employees told the Times they are concerned that erroneous data could mean others are improperly placed on the list, including Americans citizens.

The Social Security Administration and the White House did not respond to a request for comment.

Being added to the master file has ripple effects throughout someone’s entire life: Their medical insurance benefits or Medicare coverage can be halted, credit cards can be cancelled, and pensions can be lost. They can lose access to their bank accounts and even their homes, as well as government benefits from agencies like the Department of Veterans Affairs, the Department of Defense, and so on.

This is not the first time Social Security’s death master file has become politicized by the Trump administration. Earlier this year, advisor Elon Musk and his Department of Government Efficiency made claims that SSA is paying tens of thousands of people who are over 100 years old, using that as proof that the agency needs an overhaul. SSA denies these inaccuracies.

Also this week, the Trump administration made moves to share long off-limits IRS data with Immigration and Customs Enforcement to identify and deport undocumented immigrants. Several top officials at the IRS resigned as a result, including the acting commissioner. Undocumented immigrants paid $96.7 billion in federal, state, and local taxes in 2022, according to the Institute on Taxation and Economic Policy.

This story was originally featured on Fortune.com



Source link

Continue Reading

Business

10 days and a $10 trillion market swing: How Trump’s tariffs changed the global economy, and what comes next

Published

on

In U.S. financial history, there are weeks that live in infamy—like the “Black Tuesday” stock market crash of 1929, or the 2008 Lehman Brothers bankruptcy, or the COVID-induced shock of 2020. To this list we can add the “Liberation Day” market meltdown triggered by President Donald Trump’s successive announcements of sweeping tariffs over the past two weeks.

The ensuing damage was enormous, as panicked investors sold off both stocks and U.S. Treasuries, leading to a $10 trillion wipeout in global equities between April 2 and April 9, when the president “paused” some of his tariffs. While the markets had recovered much of those losses by Friday, most major stock indexes are still in correction territory, and remain in a state of turmoil defined by wild volatility, as many traders fear there’s far more economic damage to come.

Amid the enduring chaos, a clearer picture is emerging of what happened, and what the potential long-term effects could be. Most obviously, of course, the tariffs themselves have caused the meltdown as, for the first time in nearly a century, the U.S. government has forsaken free markets in favor of mercantilism. On top of this fire, Trump has poured the gasoline of uncertainty, rolling out tariffs in the name of increasing U.S. industrial production, but implementing them in haphazard ways, and then abruptly reversing his decisions. No one knows what he will do next; communication between the White House and business leaders has been scant.

The capital markets have made clear that they don’t like what’s happening. In the most alarming development, many investors have reacted by selling off U.S. Treasury bills—the asset that for decades has represented the safest of safe havens. For the first time in living memory, some players have lost confidence in the U.S. financial system and are seeking security instead in Europe and Asia.

This erosion of confidence is being felt not only in the sell-off of T-bills, but in the sudden plunge of the U.S. dollar. Business leaders, who stayed mostly silent until the Liberation Day fallout, are beginning to sound the alarm about long-term harm to the American economy, while companies are scrambling to address upended supply chains and plan for the future. Tariffs, meanwhile, are widely regarded as inflationary, and consumers are increasingly frightened as they realize that the cost of eggs is not coming down, and that cars and many other goods could soon be punishingly expensive, if not out of their reach altogether.

In these unprecedented economic conditions, how should executives and investors react? To offer some guidance, Fortune has tapped into the expertise of its veteran business journalists to provide both sharp analysis—such as an overview of the “murder mystery” in the bond market that led Trump to flinch on his latest round of tariffs—and practical advice. We’ve curated some of our best recent coverage in a special report, The Economy In Crisis.

Our guidance includes a look at how some popular assets, including gold and Bitcoin, are performing, and advice on how to adjust your portfolio to account for a declining dollar and other potentially permanent shifts in the financial landscape. We report on how Walmart, Apple, and other Fortune 500 companies are adjusting to a climate in which free cross-border trade can no longer be taken for granted. Together, the stories help explain what happened in the most chaotic 10 days in the market in years, and offer a sense of what could happen next in an economic storm that appears far from over. Read on—and buckle up.

This story was originally featured on Fortune.com



Source link

Continue Reading

Business

Jamie Dimon argues JPMorgan can help fix the bond chaos if regulators get on board — ‘It’s not relief to the banks, it’s relief to the markets’

Published

on



  • The Federal Reserve can’t allow the Treasury market to seize up like it did in 2008, one reason JPMorgan CEO Jamie Dimon claims bank capital requirements need to be fixed. These regulations are in place to prevent a repeat of the Global Financial Crisis, but Treasury Secretary Scott Bessent, Fed Chair Jerome Powell, and many economists agree certain adjustments would allow banks and broker-dealers to step in during times of market stress. 

A bond market sell-off has made investors question the safe-haven status of U.S. debt and fear another credit crunch—when liquidity dries up and economic activity grinds to a halt. JPMorgan Chase CEO Jamie Dimon said the world’s biggest lenders can help, but only if regulations developed to prevent a repeat of the Global Financial Crisis are scaled back. 

Treasury Secretary Scott Bessent, Federal Chair Jerome Powell, and many economists agree that certain changes could help banks and broker-dealers hold more Treasuries in times of market stress. Dimon went further, however, calling for sweeping reform of capital requirements, which the industry has long argued are onerous and stunt consumer lending. The current framework, he said, contains deep flaws. 

“And remember, it’s not relief to the banks,” Dimon said during JPMorgan’s first-quarter earnings call Friday. “It’s relief to the markets.”

Capital requirements aim to ensure banks, especially those deemed “too big to fail,” can survive if they sustain heavy losses. JPMorgan was one of only a few major lenders that didn’t need a controversial government bailout in 2008—but Dimon took the money anyway at the insistence of then-Treasury Secretary Henry Paulson. 

The Treasury market helps the global economy go-round, and Wall Street is watching closely for signs the Fed may be forced to intervene. Many suspect bond market turmoil is what truly forced President Donald Trump to announce a 90-day pause on his sweeping “reciprocal tariffs,” but the fixed-income selling spree is not over. A confounding spike in yields, which rise as bond prices fall, has persisted as investors sour on Treasuries, long considered some of the world’s safest assets.

The Trump administration has been clear it wants to see a lower yield on the 10-year Treasury, the benchmark for interest rates on mortgages, car loans, and other common types of borrowing throughout the economy. It spiked as high as 4.59% on Friday, however, up over 30 basis points from Wednesday’s low and more than 70 points from where it began its climb on Monday.

“The textbook would be saying that when the stock market is going down, long-term interest rates should also be going down,” Torsten Sløk, chief economist at private equity giant Apollo, wrote in a note Friday. “But this is not what is happening at the moment.”

Why banks can’t step in

One of the culprits for this “murder mystery,” as Sløk told Fortune earlier this week, could be the so-called “basis trade,” when hedge funds borrow heavily to take advantage of tiny price discrepancies between Treasuries and futures linked to those bonds. In normal times, they profit handsomely, and, in turn, help keep money markets humming.

During periods of extreme volatility, however, hedge funds can be forced to unwind the $800 billion trade, which spells trouble if the market struggles to absorb a massive increase in the supply of Treasuries. Foreign selling could exacerbate the problem, and that appeared to be at play on Thursday and Friday as the dollar fell. 

Big banks and broker-dealers can’t step in, however, because of restrictions like the supplementary leverage ratio. As the name implies, this measure curbs the amount of borrowed funds lenders can use to make investments. 

“These limitations have, of course, become more tight after the financial crisis in 2008,” Sløk said, “and that’s why the Wall Street banks are working less as shock absorbers in the current environment.”

U.S. debt is the dominant form of collateral in so-called repo markets, a crucial part of the financial system that allows banks and companies to meet their commitments with short-term loans. In short, the Fed doesn’t want the Treasury market to seize up like it did in 2008, which is why Dimon and other critics of current capital requirements say these regulations need to be fixed.

“When you have a lot of volatile markets and very wide spreads and low liquidity in Treasuries,” Dimon said, “it affects all other capital markets. That’s the reason to do it, not as a favor to the banks themselves.”

Such changes would not be without precedent. During the COVID-19 pandemic, the Fed exempted Treasuries and bank reserves from the calculation of the supplementary leverage ratio, allowing banks to snap up more U.S. debt.

Bessent has indicated he wants to make that change permanent as part of a broader deregulatory push. Even though the Fed recently lost a bitter fight with big banks, particularly JPMorgan, over bumping up capital requirements, Powell has said he agrees. Several academics are also in favor of a slight adjustment, which they say can be made without undermining the foundations of the Dodd-Frank reforms instituted after the financial crisis.

Regardless, the Fed still had to buy $1.6 trillion of Treasuries to stabilize money markets at the onset of the pandemic. Dimon said the central bank will again be forced to take similar action eventually. 

“There will be a kerfuffle in the Treasury markets because of all the rules and regulations,” he said.

Dimon hopeful for change under Trump

Dimon was not just referring to small changes in the supplementary leverage ratio, however. Fixing several different types of capital requirements, he said, could free up “hundreds of billions of dollars” for JPMorgan to lend across the banking system.

Banks have been very successful at pushing back on the Fed’s efforts to fully implement Basel III, a set of international standards developed after the 2008 crash to prevent the collapse of so-called “globally systemic banks.” After huge blowback from the industry, the Fed scrapped a proposal last year that would have raised capital requirements by 19%. The central bank’s top regulator later stepped down, allowing Trump to appoint Michelle Bowman, who voted against the more stringent regulations, to the role.

On Friday’s earnings call, Dimon, who has been credited with convincing Trump to scale back his tariffs, was asked whether he thought there was a better chance of bank-friendly reforms with the current administration than under Biden.

“I think there’s a deep recognition of the flaws in the system,” he said, “and fortunately, they’re going to take a good look at it.”

This story was originally featured on Fortune.com



Source link

Continue Reading

Trending

Copyright © Miami Select.