Introducing AIMIP: The AI weather and climate model intercomparison project

May 13, 2026

Brian Henn - Ai2

A new generation of AI models can simulate aspects of Earth’s climate far more efficiently than traditional systems, but the field still needs rigorous, shared ways to test whether those models are accurate and reliable.

To address that gap, we’ve been leading a community effort called AIMIP (AI Model Intercomparison Project) to support scientific understanding and open evaluation of AI models for climate forecasting. AIMIP brings together multiple modeling groups, including NVIDIA, Google Research, and others, around a shared benchmark experiment and dataset—making it easier to compare systems on common outputs and evaluation criteria and helping build confidence in how these models are assessed.

As part of AIMIP Phase 1, we’re releasing a dataset of AI weather and climate model forecasts for the above-mentioned benchmark experiment, along with a report and evaluations showing that AI models are competitive on key climate metrics but continue to struggle in some areas.

Leveraging a revolution in weather and climate forecasts

AI climate models are relatively new, but they build on several years of rapid development in using AI to predict short-term weather patterns. Relying on a large dataset of historical weather observations spanning the entire atmosphere (called ERA5) as training data, AI-driven forecasts now regularly beat conventional weather models at key skill metrics for forecasts 1-10 days in the future as demonstrated on WeatherBench, a leaderboard for AI weather models. And they do so with extraordinary speed, using far less computational power than conventional models.

The development of AI-driven climate models, in turn, has made use of advances in AI weather forecasting—but it faces a unique set of challenges. Until quite recently, there were few AI models that could simulate climate over long timescales in a way that resembles a traditional climate model. And unlike WeatherBench, the benchmarks and metrics around which to evaluate AI climate models are less obvious.

To understand why these challenges exist, it helps to first understand what climate models do and how they’re usually tested.

Climate models and the MIPs

Developed over the last several decades, physically-based climate models aim to simulate Earth's climate under particular scenarios over periods of decades or centuries. They do this by using physical laws to predict the weather on short timescales, over and over for the entire globe, as they advance through a simulation period. The resulting averages and extremes of weather make up the climate—the average temperature and precipitation for a given location, for example, but also its likelihood of experiencing an extreme event such as a heat wave or a tropical storm.

Climate models must also account for the effects of changes in the ocean and sea ice (among other parts of the Earth system) over time, because on these long timescales they meaningfully affect the weather. And they must also evaluate a range of possible hazards and scenarios such as rising greenhouse gas (GHG) emissions.

The computational demands of climate modeling are tremendous as a result. Historically, only scientists with access to large high-performance computing systems – e.g., at national laboratories – were able to execute the simulations and a limited number of models were developed around the world. That scarcity is one reason why shared evaluation frameworks have become so important in climate science.

To evaluate climate models, the scientific community uses a tool called model intercomparison projects, or MIPs. A MIP is a standardized experiment that climate models must execute and from which provide common outputs for evaluation. The ongoing Coupled Model Intercomparison Project, or CMIP, has been the driving force behind the community’s campaign to develop accurate model forecasts of the effects of GHG emissions, for example.

AI climate modeling offers the same promise as AI weather forecasting: forecasts that are made with revolutionary speed and efficiency as compared to physically-based climate models (with up to three orders of magnitude less compute), offering the potential to unlock scientific discovery for a much wider range of users. But only in the last two years or so have AI models from multiple groups, using a variety of AI architectures, demonstrated that they can make stable, high-fidelity predictions for decades and centuries. And their ability to correctly respond to different climate scenarios is still largely unknown.

Existing intercomparison frameworks were built for conventional climate models, and don’t match the capabilities or address the questions surrounding today’s AI climate models. Thus, the time was ripe for AIMIP, which developed out of community conversations with both AI and conventional climate modeling groups.

AIMIP Phase 1: Specification and submissions

AIMIP Phase 1 is the project’s first shared benchmark experiment, designed to compare AI climate models under a common setup while keeping the scope narrow enough for broad participation. It specifies that a model must forecast the state of the global atmosphere over 1979-2024 with monthly and daily output frequencies. Models must be trained only on the ERA5 historical observations from 1979-2014, leaving the last decade as test data, but the choice of AI architecture is up to the participating modeling groups.

The ocean and sea ice states are prescribed with historically observed values, because at this early stage in AI climate modeling the goal is to focus on the behavior of the atmosphere alone. However, in future AIMIP phases, it may be possible for AI to simulate the ocean, sea ice, and other Earth system components via a “coupled” climate model (e.g., like our SamudrACE model), and AIMIP will need to evolve to properly capture this.

In AIMIP Phase 1, models must output temperature, humidity, and winds at seven levels in the atmosphere, as well as temperature and precipitation and other key weather variables at the surface. They must also make their outputs compatible with typical CMIP format specifications to facilitate intercomparison with conventional climate models and evaluation tools.

Ai2 Climate Modeling and five outside organizations – the ArchesWeather group, NVIDIA, the University of Washington, the University of Maryland, and Google Research – submitted eight model simulations to AIMIP Phase 1.

Faithful representation of the historical climate, but challenges in predicting its changes

With the dataset, we can evaluate how well AI climate models are simulating the historical climate and its changes over the past several decades. We find that AI models, almost regardless of architectural choices, do very well at simulating average historical climate patterns—typically beating a conventional physically-based climate model at this task. The most accurate AI climate models can reduce the time-averaged error in fields like near-surface air temperature by a factor of 2.

A more demanding test is whether the models capture the long-term warming trend visible in the historical record, especially beyond their training period and into the held-out final decade of ERA5 data. There, the picture is more mixed. Some models track the warming trend quite well, while others underestimate it significantly. While generalizing to future conditions is essential for climate change projections, it may be less critical for other use cases such as informatics or sampling of climate risk factors during an AI model's training period.

Additionally, we evaluated the submitted models’ ability to simulate atmospheric responses to El Niño ocean conditions, day-to-day atmospheric variability, and a truly out-of-sample “shock” in which the global ocean surface is instantaneously warmed by 2 or 4 degrees Celsius. The latter isn’t a physically likely scenario, but it’s useful to understand how AI models might generalize to unseen conditions. Perhaps not surprisingly, the models’ predictions diverge significantly in this out-of-sample case, with some producing what appear to be physically implausible results.

Going forward: Open dataset and community evaluations

The AIMIP Phase 1 dataset is being hosted through the German Climate Computing Center (DKRZ), with publication to the Earth System Grid Federation (ESGF) planned to make it broadly accessible to the climate science community. Scientists are already using the dataset to carry out further evaluations of AI climate models, with our work serving as an entry point for continued research.

The results from AIMIP Phase 1 suggest that one of the central challenges for AI climate models is responding robustly to a range of climate scenarios. Generalization, in other words, will be critical if these models are to be widely adopted by the scientific community. In particular, researchers need to be able to trust how AI climate models behave under unseen GHG emissions scenarios. Conventional climate model outputs may provide training data for some of these cases, but additional AI-specific approaches will likely be needed.

If AIMIP Phase 1 proves valuable to the community, and if AI climate modeling continues to advance at its current pace, future AIMIP phases will follow. These would likely expand to more complex coupled modeling, including of the ocean and sea ice; a broader set of scenarios such as GHG emissions pathways; and more extensive output requirements and evaluations.