Breaking New Ground in Machine Learning and AI: New Platform Prophet Arena Redefines How We Evaluate AI’s Intelligence

From deciding whether to bring an umbrella to work tomorrow to how to prepare for retirement, humans make predictions constantly. A common refrain these days is to ‘just ask AI’—but can AI reliably forecast the future? Haifeng Xu, Assistant Professor of Computer Science and Data Science at the University of Chicago and the Data Science Institute, and his team at the Strategic IntelliGence for Machine Agents (SIGMA) Lab are trying to answer this question. They recently launched Prophet Arena, a benchmarking tool that evaluates large language models’ (LLMs) intelligence through real-world forecasting.

For those unfamiliar, this arena is “a platform where people can measure how clever or intelligent the language models are,” Xu explains.

In 2023, when the platform Chatbot Arena was launched, we witnessed the power of crowdsourced evaluation: the wildly popular platform run by LMSYS has collected over 800,000 votes from millions of users comparing language models in head-to-head comparisons, or “battles.” First developed by researchers at UC Berkeley, Chatbot Arena revolutionized AI evaluation by moving beyond academic benchmarks to capture real-world human preferences in smooth, “humanlike” communication.

But when groups like the Center for AI Safety and ScaleAI’s have created standardized tests for AI (e.g., ‘Humanity’s Last Exam,’), even the most advanced LLMs have flopped. This inconsistent performance across different measures of intelligence–that models could perform well on some benchmarks yet fail dramatically on others–was a key motivation behind Prophet Arena. “We needed a more suitable benchmark of intelligence,” says Xu.

Now, Prophet Arena is bringing the “arena” concept into the realm of forecasting to craft a better benchmark of intelligence. Where Chatbot Arena asked “which response do humans prefer?” Prophet Arena asks, “Which model makes the best prophet?”

How Prophet Arena is Taking on the Challenges at the Frontier of AI

Forecasting the future demands sophisticated capabilities that are fundamental to how we as humans reason about the world: understanding existing information, thinking critically about its reliability, calibrating for news sources, integrating different perspectives, and synthesize everything into their forecast. Prophet Arena challenges LLMs to focus on these capabilities, making general-purpose forecasts across domains without specialized fine-tuning.

Traditional AI benchmarks rely on fixed datasets with known answers. As Xu explains, “Frontier models are trained on every possible piece of data these days, so gradually, they’ll memorize the answers to any static benchmark,” also called overfitting. “They might be able to ace a sophisticated exam, but they still won’t be able to reason accurately in the wild.”

So the team considered a different approach: “If you ask a model to forecast the future, which is dynamic,” Xu explains, “you won’t run into that problem. Unless you can bend time, you’ll never overfit the future. You can’t memorize tomorrow’s news, because it hasn’t happened yet.”

Prophet Arena presents AI systems with live, unresolved real-world events, a design innovation that led Boris Power (Head of Applied Research at OpenAI) to call it “the only benchmark that can’t be hacked and will stay relevant for decades.”

How does Prophet Arena work? The platform uses a three-stage pipeline:

Information Sourcing: The platform gathers news reports across domains like politics, economics, science, and sports, and market data (specifically high trade volume and liquidity from the prediction market Kalshi) in order to curate events of interest.
Forecast Submission: Rather than requiring simple yes/no answers, Prophet Arena elicits probabilistic distributions from AI models, allowing them to express confidence levels and uncertainty. Along with their structured forecasts with probability distributions, models include detailed rationales. Until an event takes place, models continue to integrate developments in events and sentiments into its understanding and assessment of the evolving landscape.
Outcome Resolution: Each of these forecasts ladders up to an overall evaluation of model performance: the platform uses a combination of established and novel metrics that each measure different aspects of performance that together “provide a more complete picture of what makes a model a good prophet” for a given issue, says Xu. “You can think of it like an IQ test for LLMs.” As each event resolves in real-time, a live leaderboard tracks developments in model performance in real time.

A Case Study

So how can a casual user engage with the platform today? Say, for example, you want to explore the likelihood national gas prices will pass $3.25 this month. On Prophet Arena, you can explore not just what major LLMs have to say about the odds, but how contextual factors influence their reasoning.

The live forecasts and transparent evaluation metrics provide a window into AI forecasting.

Designed to facilitate collaboration between humans and AI, the interface documents the sources considered and explains the reasoning behind different weights or decisions made in the model. Further, users can adjust forecasts based on their personal risk tolerance, from completely risk-neutral to highly risk-averse.

The interface also allows users to log in to rate forecasts and compare source quality to help models learn which sources are more reliable, and explore how forecasts change with new information. The idea is that the interface can facilitate collaboration between AI and humans, supporting multi-step reasoning for the model and inviting humans to question their assumptions.

Initial Findings

We know that LLMs handle the same inputs differently based on their respective training. Prophet Arena is showing us how: models exhibit distinct reasoning styles or “personas” in their approaches. For example, in forecasting whether AI regulation will become federal law before 2026, models show dramatically different confidence levels (ranging from 35% to 75%) despite analyzing the same legislative developments. “Some models are very critical, and some trust all news sources absolutely,” Xu explains. “The more advanced models tend to be more hesitant rather than completely trusting. The less sophisticated will shift opinion very quickly.”

Some models demonstrate a better ability to detect a strategic edge. In one case, o3-mini identified Toronto Football Club as having a 30% chance of winning, while the market priced it at only 11%, ultimately earning a 9x return when Toronto won the upset victory.

Right now, you can use the platform to explore questions from “Who will win this year’s Nobel Peace Prize?” to “How likely is it that AI regulation becomes federal law this year?” Other questions you can investigate with Prophet Arena range from fun to serious:

Which films are most likely to receive Oscar nominations for Best Picture?
How likely is a recession this year?
Which vaccines are likely to see their recommendations ended in the US this year?

Xu’s vision extends further: “Our single greatest vision is to democratize forecasting,” he says. “While prediction is fundamental to human activity, it’s difficult for most people to do. But armed with LLMs that can combine many people’s opinions, and communicate their reasoning, more people can access these tools to help them plan ahead.”

The platform’s next priority is increasing user engagement, because as forecasting becomes more accessible, it also creates a virtuous cycle: more participation improves the forecasts, and better forecasts attract more users.

Advancing the Science of Forecasting to Develop an AI-Augmented Future

As Xu and his collaborators continue developing Prophet Arena, they’re working to go beyond benchmarking to enable meaningful collaboration between humans and AI in forecasting: by enhancing collective foresight and informing high-stake policy-making across society, they hope to help society make more informed decisions.

The Prophet Arena project demonstrates Xu’s expertise in strategic intelligence for machine agents and the Data Science Institute’s mission to explore how emerging technologies can benefit society. Xu has published regularly at flagship venues on machine learning, AI and computational economics, and his research has been recognized by multiple major awards including an AI2050 Early Career Fellowship, an IJCAI Early Career Spotlight, a WWW’24 Best Paper Award, and a Google Faculty Research Award.

This story originated from the Data Science Institute.

Resources

Community

AI-Powered Network Management: GATEAU Project Advances Synthetic Traffic Generation

Sebo Lab: Programming robots to better interact with humans

UChicago CS Student Awarded NSF Graduate Research Fellowship

Arman Shehabi (Lawrence Berkeley National Laboratory)- Powering the AI Revolution: Navigating the Surge in Data Center Energy

AI & Scientific Discovery Online Seminar

Basheer Tome: Prototyping Hardware Interfaces for Consumer Products

Inside The Lab: How Can Robots Improve Our Lives?

The Future of AI Panel: Alumni Weekend

Can we authenticate human creativity?

How Prophet Arena is Taking on the Challenges at the Frontier of AI

A Case Study

Initial Findings

Advancing the Science of Forecasting to Develop an AI-Augmented Future

AI-Powered Network Management: GATEAU Project Advances Synthetic Traffic Generation

Sebo Lab: Programming robots to better interact with humans

Inside The Lab: How Can Robots Improve Our Lives?

UChicago CS Student Awarded NSF Graduate Research Fellowship

Why Can’t Powerful LLMs Learn Multiplication?

Celebrating Excellence in Human-Computer Interaction: Yudai Tanaka Named 2025 Google North America PhD Fellow

Shape n’ Swarm: Hands-On, Shape-Aware Generative Authoring for Swarm User Interfaces Wins Best Demo at UIST 2025

Redirecting Hands in Virtual Reality With Galvanic Vestibular Stimulation: UChicago Lab to Present First-of-Its-Kind Work at UIST 2025

University of Chicago’s EPiQC Wins Prestigious IEEE Synergy Award for Quantum Computing Collaboration

UChicago CS Researchers Expand the Boundaries of Interface Technology at UIST 2025

Looking Back 20 Years: How an Academic Bet on Real-Time Data Finally Paid Off

Five UChicago CS students named to Siebel Scholars class of 2026