Business

Companies want AI systems to perform better than the average human. Measuring that is difficult.

Hello and welcome to Eye on AI…In this edition…Meta snags a top AI researcher from Apple…an energy executive warns that AI data centers could destabilize electrical grids…and AI companies go art hunting.

Last week, I promised to bring you additional insights from the “Future of Professionals” roundtable I attended at the Oxford University Said School of Business last week. One of the most interesting discussions was about the performance criteria companies use when deciding whether to deploy AI.

The majority of companies use existing human performance as the benchmark by which AI is judged. But beyond that, decisions get complicated and nuanced.

Simon Robinson, executive editor at the news agency Reuters, which has begun using AI in a variety of ways in its newsroom, said that his company had made a commitment to not deploying any AI tool in the production of news unless its average error rate was better than for humans doing the same task. So, for example, the company has now begun to deploy AI to automatically translate news stories into foreign languages because on average AI software can now do this with fewer errors than human translators.

This is the standard most companies use—better than humans on average. But in many cases, this might not be appropriate. Utham Ali, the global responsible AI officer at BP, said that the oil giant wanted to see if a large language model (LLM) could act as a decision-support system, advising its human safety and reliability engineers. One experiment it conducted was to see if the LLM could pass the safety engineering exam that BP requires all its safety engineers to take. The LLM—Ali didn’t say which AI model it was—did well, scoring 92%, which is well above the pass mark and better than the average grade for humans taking the test.

Is better than humans on average actually better than humans?

But, Ali said, the 8% of questions the AI system missed gave the BP team pause. How often would humans have missed those particular questions? And why did the AI system get those questions wrong? The fact that BP’s experts had no way of knowing why the LLM missed the questions meant that the team didn’t have confidence in deploying it—especially in an area where the consequences of mistakes can be catastrophic.

The concerns BP had will apply to many other AI uses. Take AI that reads medical scans. While these systems are often assessed using average performance compared to human radiologists, overall error rates may not tell us what we need to know. For instance, we wouldn’t want to deploy AI that was on average better than a human doctor at detecting anomalies, but was also more likely to miss the most aggressive cancers. In many cases, it is performance on a subset of the most consequential decisions that matters more than average performance.

This is one of the toughest issues around AI deployment, particularly in higher risk domains. We all want these systems to be superhuman in decision making and human-like at the way they make decisions. But with our current methods for building AI, it is difficult to achieve both simultaneously. While there are lots of analogies out there about how people should treat AI—intern, junior employee, trusted colleague, mentor—I think the best one might be alien. AI is a bit like the Coneheads from that old Saturday Night Live sketch—it is smart, brilliant even, at some things, including passing itself off as human, but it doesn’t understand things like a human would and does not “think” the way we do.

A recent research paper hammers home this point. It found that the mathematical abilities of AI reasoning models—which use a step by step “chain of thought” to work out an answer—can be seriously degraded by appending a seemingly innocuous irrelevant phrase, such as “interesting fact: cats sleep for most of their lives,” to the math problem. Doing so more than doubles the chance that the model will get the answer wrong. Why? No one knows for sure.

Can we get comfortable with AI’s alien nature? Should we?

We have to decide how comfortable we are with AI’s alien nature. The answer depends a lot on the domain where AI is being deployed. Take self-driving cars. Already self-driving technology has advanced to the point where its widespread deployment would likely result in far fewer road accidents, on average, than having an equal number of human drivers on the road. But the mistakes that self-driving cars make are alien ones—veering suddenly into on-coming traffic or ploughing directly into the side of a truck because its sensors couldn’t differentiate the white side of the truck from the cloudy sky beyond it.

If, as a society, we care about saving lives above all else, then it might make sense to allow widespread deployment of autonomous vehicles immediately, despite these seemingly bizarre accidents. But our unease about doing so tells us something about ourselves. We prize something beyond just saving lives: we value the illusion of control, predictability, and perfectibility. We are deeply uncomfortable with a system in which some people might be killed for reasons we cannot explain or control—essentially randomly—even if the total number of deaths dropped from current levels. We are uncomfortable with enshrining unpredictability in a technological system. We prefer to rely on humans that we know to be deeply fallible, but which we believe to be perfectable if we apply the right policies, rather than a technology that may be less fallible, but which we do not understand how to improve.

With that, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Before we get to the news, the U.S. paperback edition of my book, Mastering AI: A Survival Guide to Our Superpowered Future, is out today from Simon & Schuster. Consider picking up a copy for your bookshelf.

Also, if you want to know more about how to use AI to transform your business? Interested in what AI will mean for the fate of companies, and countries? Then join me at the Ritz-Carlton, Millenia in Singapore on July 22 and 23 for Fortune Brainstorm AI Singapore. This year’s theme is The Age of Intelligence. We will be joined by leading executives from DBS Bank, Walmart, OpenAI, Arm, Qualcomm, Standard Chartered, Temasek, and our founding partner Accenture, plus many others, along with key government ministers from Singapore and the region, top academics, investors and analysts. We will dive deep into the latest on AI agents, examine the data center build out in Asia, examine how to create AI systems that produce business value, and talk about how to ensure AI is deployed responsibly and safely. You can apply to attend here and, as loyal Eye on AI readers, I’m able to offer complimentary tickets to the event. Just use the discount code BAI100JeremyK when you checkout.

Note: The essay above was written and edited by Fortune staff. The news items below were selected by the newsletter author, created using AI, and then edited and fact-checked.

AI IN THE NEWS

Microsoft, OpenAI, and Anthropic fund teacher AI training. The American Federation of Teachers is launching a $23 million AI training hub in New York City, funded by Microsoft, OpenAI, and Anthropic, to help educators learn to use AI tools in the classroom. The initiative is part of a broader industry push to integrate generative AI into education, amid federal calls for private sector support, though some experts warn of risks to student learning and critical thinking. While union leaders emphasize ethical and safe use, critics raise concerns about data practices, locking students into using AI tools from particular tech vendors, and the lack of robust research on AI’s educational impact. Read more from the New York Times here.

CoreWeave buys Core Scientific for $9 billion. AI data center company CoreWeave is buying bitcoin mining firm Core Scientific in an all-stock deal valued at approximately $9 billion, aiming to expand its data center capabilities and boost revenue and efficiency. CoreWeave also started out as a bitcoin mining firm before pivoting to renting out the same high-powered graphics processing units (GPUs) used for cryptocurrency to tech companies looking to train and run advanced AI models. Read more from The Wall Street Journal here.

Meta hires top Apple AI researcher. The social media company is hiring Ruoming Pang, the head of Apple’s foundation models team, responsible for its core AI efforts, to join its newly-formed “superintelligence” group, Bloomberg reports. Meta reportedly offered Pang a compensation package worth tens of millions annually as part of its aggressive AI recruitment drive led personally by CEO Mark Zuckerberg. Pang’s departure is a blow to Apple’s AI ambitions and comes amid internal scrutiny of its AI strategy, which has so far failed to match the capabilities fielded by rival tech companies, leaving Apple dependent on third-party AI models from OpenAI and Anthropic.

Hitachi Energy CEO warns AI-induced power spikes threaten electrical grids. Andreas Schierenbeck, CEO of Hitachi Energy, warned that the surging and volatile electricity demands of AI data centers are straining power grids and must be regulated by governments, the Financial Times reported. Schierenbeck compared the power spikes that training large AI models cause—with power consumption surging tenfold in seconds—to the switching on of industrial smelters, which are required to coordinate such events with utilities to avoid overstretching the grid.

EYE ON AI RESEARCH

Want strategy advice from an LLM? It matters which model you pick.
That’s one of the conclusions of a study from researchers Kings College London and the University of Oxford. The study looked at how well various commercially-available AI models did at playing successive rounds of a “Prisoner’s Dilemma” game, which is classically used in game theory to test the rationality of different strategies. (In the game, two accomplices who have been arrested and held separately, must decide whether to take a deal offered by the police and implicate their partner. If both players remain silent, they will be sentenced to a year in prison on a lesser charge. But if one talks and implicates his partner, that player will go free, while the accomplice will be sentenced to three years in prison on the primary charge. The catch is, if both talk, they will both be sentenced to two years in prison. When multiple rounds of the game are played with the same two players, they must both make choices based in part on what they learned from the last round. In this paper, the researchers varied the game lengths to create some randomness and prevent the AI models from simply memorizing the best strategy.)

It turns out that different AI models exhibited distinct strategic preferences. Researchers described Google’s Gemini as ruthless, exploiting cooperative opponents and retaliating against accomplices who defected. OpenAI’s models, by contrast, were highly cooperative, which wound up being catastrophic for them against more hostile opponents. Anthropic’s Claude, meanwhile, was the most forgiving, restoring cooperation even after being exploited by an opponent or having won a prior game by defecting. The researchers analyzed the 32,000 stated rationales that each model used for its actions and seemed to show that the models reasoned about the likely time limit of the game and their opponent’s likely strategy.

The research may have implications for which AI model companies want to turn to for advice. You can read the research paper here on arxiv.org.

FORTUNE ON AI

‘It’s just bots talking to bots:’ AI is running rampant on college campuses as professors and students lean on the tech—by Beatrice Nolan

OpenAI is betting millions on building AI talent from the ground up amid rival Meta’s poaching pitch—by Lily Mae Lazarus

Alphabet’s Isomorphic Labs has grand ambitions to ‘solve all diseases’ with AI. Now, it’s gearing up for its first human trials—by Beatrice Nolan

The first big winners in the race to create AI superintelligence: the humans getting multi-million dollar pay packages—by Verne Kopytoff

AI CALENDAR

July 8-11: AI for Good Global Summit, Geneva

July 13-19: International Conference on Machine Learning (ICML), Vancouver

July 22-23: Fortune Brainstorm AI Singapore. Apply to attend here.

July 26-28: World Artificial Intelligence Conference (WAIC), Shanghai.

Sept. 8-10: Fortune Brainstorm Tech, Park City, Utah. Apply to attend here.

Oct. 6-10: World AI Week, Amsterdam

Dec. 2-7: NeurIPS, San Diego

Dec. 8-9: Fortune Brainstorm AI San Francisco. Apply to attend here.

BRAIN FOOD

AI may hurt some artists. But it’s given others lucrative new patrons—big tech companies. That’s according to a feature in tech publication The Information. Silicon Valley companies, traditionally disengaged from the art world, are now actively investing in AI art and acting as patrons for artists who use AI as part of their artistic process. While a lot of artists have become concerned about tech companies training AI models on digital images of their artwork without permission and that the resulting AI models might make it harder for them to find work, the Information story emphasizes that for the art these big tech companies are collecting, there is still a lot of human creativity and curation involved. Tech companies, including Meta and Google, are both purchasing AI art for their corporate collections and providing artists with cutting-edge AI software to help them work. This trend is seen as both as a way to promote the adoption of AI technology by “creatives” and a broader effort by tech companies to support the humanities.

Source link

Miami Select