Data-driven discovery with large generative models

May 16, 2024

Ai2

How do you boil the ocean? That impossible task is what researchers in every field try to accomplish when they sort through the existing research relevant to their endeavors. As of 2022, it is estimated that over 5.14 million academic articles are published per year, including short surveys, reviews, and conference proceedings. This proliferation means researchers might overlook new ideas in favor of popular or widely used work, miss interconnections, and find it hard to arrive at novel meaningful conclusions. AI can help tackle this issue head-on.

A Tale of Two Researchers

Let's start with a typical day for two different types of researchers. First, an AI researcher working on OLMo, Ai2's LLM framework. They meticulously analyze how OLMo addresses complex ethical and moral questions, sifting through data to detect and understand patterns in the model's responses. Their insights are vital and could enhance data collection and boost the model's performance in future updates.

Meanwhile, a social scientist focuses on the effects of various behaviors on the physical health and financial habits of the U.S. population. Working with data from the National Longitudinal Surveys, they carefully interpret survey data, crafting hypotheses that bridge multiple disciplines. Their work seeks to uncover significant patterns that can inform decisions in health, public policy, and urban planning.

Scientists from AI and Social Science domains engage in similar processes of data understanding, hypothesis generation, and evaluation in their daily activities.

Though their fields and goals differ, both researchers transform data, formulate hypotheses, conduct exploratory and predictive analyses, and identify significant patterns that could explain underlying phenomena. Right at the heart of this scientific process, we propose the potential of transformative acceleration in scientific discovery through automation and assistance. With continuous data ingestion, creative idea generation, and analytical reasoning at a massive scale, researchers can catalyze notable progress in scientific inquiry.

Untapped Data: Underutilized Scientific Goldmines

Many datasets in observational and experimental sciences are underutilized today, ranging in topics from computational sciences, social science, and health to climate science and astrophysics. Recognizing this potential, we aimed to harness the power of massive datasets and advancements in Large Generative Models (LGMs) to accelerate scientific discovery. Our work initiates a series of research articles that seek to achieve this goal and build systems that scientists can use to improve scientific processes and efficiency.

Scientific discoveries are rapidly increasing our shared knowledge, which poses a challenge to scientists to keep up, understand how things are connected, come up with new ideas, and make sense of it all. While our ultimate goal encompasses the full spectrum of scientific inquiry, we are focusing on end-to-end discovery from observational or experimental data first for two reasons: (1) an abundance of large-scale datasets that would benefit highly from automated discovery and (2) the practicality of automated verification enabled by data without the need for additional data collection.

Developing an end-to-end discovery system is challenging. Previous works have either severely lacked the requisite computational power, developed domain-specific bespoke methodologies, or involved substantial human intervention (e.g., wet lab experiments), thus not qualifying as autonomous end-to-end. In our recent position paper, published in the International Conference on Machine Learning (ICML), 2024, we argue that a focus on data-driven discovery using large generative models addresses each of these prior shortcomings and presents a practical first step toward the goal of an end-to-end system for automating the scientific process.

Main Contributions

We are the first to apply large generative models in the context of data-driven discovery. Defining data-driven discovery as a computational task is non-trivial and has rarely been done before. We define this paradigm as a heuristic search framework that aims to describe a given set of observations by uncovering the laws that govern its data-generating process.

Categorically, we show a simple LGM-based framework (for instance, GPT-4), DataVoyager, is capable of uncovering chain-of-insights by performing data transformation, selecting essential variables given a broad goal, developing initial hypotheses to validate, generate, and execute codes to verify those hypotheses, and finally conclude with salient insights that deemed verifiably correct through an independent investigation. Through user moderation, DataVoyager could replicate hypotheses from a set of published papers in social science, starting just from the underlying dataset and a high-level research goal.

Though the problem of data-driven discovery has been studied for decades, we point to critical advantages as well as limitations of LGMs for data-driven discovery and conjecture through anecdotal examples that LGMs alone are not enough - user moderation and fail-proof tool integration are necessary for scale and efficiency. We want our work to bring new life to automated scientific discovery and LGMs to initiate a Cambrian explosion of discovery.

Key Results & Surprises

Our initial attempt to autonomously replicate three papers from social science based on the National Longitudinal Surveys (NLS) was overwhelmingly successful. To date, thousands of research papers have been published by gleaning insights from NLS, a longitudinal dataset collected from 1979. Not only did DataVoyager replicate some of the results previously obtained from NLS, but it also produced novel connections between possible interdisciplinary connections - a classic case of Swanson Linking. Here is an example of DataVoyager's workflow on NLS. Starting from a high-level query, it navigates through cycles of hypothesis generation, validation, and analysis to uncover complex insights. This replicates the research workflow of our imaginary social scientist researcher working on NLS data by reducing their effort and enhancing the richness of scientific findings.

An example workflow of DataVoyager. Starting from a user-provided dataset and a high-level query, it navigates through cycles of hypothesis generation, validation, and analysis to uncover complex insights.

Skeptics may argue that GPT-4's training data includes replicated workflows and analyses on NLS available on the internet, rendering DataVoyager's success in NLS analysis potentially a result of memorization. While DataVoyager's verification process seeks to mitigate memorization and hallucination risks by generating programs, the remote possibility of data leakage impacting assessment remains.

In addition to this, we synthetically inverted the relation between demographics and wealth accumulation in the NLS data to check DataVoyager's ability to react to surprising discoveries. The system correctly analyzed the inverted data and generated insights by showing surprise that the results were counterintuitive. This highlights a necessary property of a scientific assistant.

For a robustness check of our anecdotal results, we simulated a knowledge frontier where we accessed a popular language agent repository, Reflexion, and modified the experiment design following our recent language agent algorithm, published after GPT-4's knowledge cutoff. We fed the new experimental data to DataVoyager, resulting in coherent findings with a sound chain of reasoning, a surprising and encouraging result in line with our original insights. This highlights how our system can help our imaginary AI researcher, who focuses on uncovering insights from experimental data.

However, despite the success, DataVoyager lacks reasoning and planning ability in complex cases requiring a multi-step analysis, mirroring verdicts from recent works on LGMs' reasoning capabilities. Most importantly, free-form code generation for hypothesis verification was limited and could only be mitigated by providing access to external tools. For example, DataVoyager could not achieve domain-specific data transformation (long-tail) and hypothesis verification even with state-of-the-art code generation models. One of our key arguments summarizes this observation: interfacing with fail-proof tools and inference-time functions and catering to long-tail domains with user moderation are required to build an accurate, reliable, and robust data-driven discovery system capable of advancing scientific progress with speed and reproducibility.

DataVoyager

We identify two main challenges in automating data-driven discovery: (1) hypothesis search - effectively using data and existing knowledge to formulate novel hypotheses through data-driven reasoning, and (2) hypothesis verification - evaluating these hypotheses for rapid iteration and continual discovery.

DataVoyager, our proof-of-concept system, semantically understands datasets, programmatically explores verifiable hypotheses, runs basic statistical tests (e.g., correlation and regression analyses) by invoking predefined functions or generating code snippets, and analyzes the outputs in detail.

A blueprint flow demonstrating ideal workflows for data-driven discovery. 1) The user can ask an explicit question around a particular line of inquiry or hypothesis. 2) The user can also ask a broad and partially defined high-level question, where the system must figure out the appropriate datasets, data transformations, variables, a list of possible hypotheses, and their verification. 3) The user can provide follow-up feedback at any time and the system's continual learner will learn from it.

The core components of DataVoyager include specialized agents - planner, programmer, data expert, and critic - designed to manage various aspects of the data-driven discovery process, along with structured functions or programs for specific data analyses. The capabilities of the underlying LLM, such as function calls, code generation, and language generation, are critical for success.

We demonstrate that DataVoyager is capable of data understanding, hypothesis generation, multi-step planning, and interdisciplinary knowledge integration, showing promise for ideal data-driven discovery - a capability not previously achievable before the widespread adoption of LLMs. However, DataVoyager's shortcomings in data transformation, scalability, hypothesis verification, accommodating human feedback, and resistance to p-hacking confirm that LLMs alone are inadequate. Integrating robust, scalable tools and user-centric interventions is crucial for a successful data-driven discovery system.

Lessons Learned

While GPT-4 performs well in DataVoyager, other open-source language models (e.g., Llama-2, Mixtral-8x7B) struggle with the complex functions required for data-driven discovery: multi-step planning, code generation, and data analysis.

Crucially, LLMs often suffer from output hallucinations, exacerbated by issues of memorization and superposition - most problematic during hypothesis generation, planning, and output comprehension. This undermines the automation benefits, necessitating external verification and user moderation. Most of our examples require additional human feedback.

Diverse scientific domains require specific workflows and tools, making it challenging to set up autonomous systems like DataVoyager to function effectively. We had to enable access to domain-specific tools and cross-language tool-calling capabilities. It is still unknown to us how these tools will interact with each other in complex scenarios.

Finally, data-driven discovery is a nascent field, preceded only by automated data analysis and AutoML. Therefore, existing benchmarks do not adequately meet our needs for evaluating data-driven discovery as they are confined to specific domains or are narrow in scope. We have had to rely on anecdotal evidence to demonstrate the capabilities and limitations of existing LGMs.

New Research Frontiers

The lack of comprehensive data-discovery benchmarks stems from the challenges of automating the evaluation of scientific discovery, as human evaluation is impractical at scale. There are two ways to handle automatic evaluation: 1) An outcome-based evaluation - verifying if the generated hypothesis matches the gold standard - overlooks syntactically different yet correct hypotheses; 2) a process-based evaluation, which measures how closely hypothesis verification follows a gold-standard workflow, depends on the availability of such workflows, which are often difficult to collect or may not be optimal. Our current focus includes developing a novel evaluation scheme to enable automatic or scaled evaluation of generated hypotheses.

We are also collecting real hypotheses and associated workflows from various scientific domains, including computer science, climate science, social science, and biology. We ensure diversity in the workflows and the potential presence of compositional sub-flows, enabling systematic evaluation and model improvement. Such a benchmark should not only test the scientific reasoning capabilities of the underlying LLMs, but also aid in enhancing their overall reasoning abilities through improved performance on these benchmarks.

Finally, we are exploring agentic solutions that interact with data by generating code or calling functions for intermediate analysis, forming the basis of a hypothesis. To accommodate the dynamic needs of researchers and various domains, we are enhancing DataVoyager to include robust and cross-language tool usage (e.g., Python to R). Our research on agent-based frameworks is paving the way toward systems capable of long-term planning, task decomposition, and continual adaptation to user feedback.

Breakthrough AI for Science

Our primary goal is to assist scientists in their daily work using automated discovery systems. DataVoyager focuses on uncovering data-driven insights without the need for additional data collection - crucial for observational and experimental sciences. Research outcomes often depend on time, skill, and objectives, leaving much potential knowledge undiscovered. With automation, scientists can leverage the capabilities of large generative models, delegating routine tasks to focus on creative ideation.

Similarly, data-rich industries can greatly benefit from our system. Many industry professionals are currently limited by their training in performing efficient data analysis, crucial to their roles. LGM-powered automated data analysis will lower the barrier to entry, enabling more professionals to harness the power of data-driven science.

Our project stems from AI2's mission of using AI to address some of the most pressing scientific challenges facing humanity and the planet, particularly in climate science, genomics, materials science, and AI itself. Our efforts can propel scientific discovery, leading to breakthroughs in living standards and social well-being. This work is a fundamental part of AI2's initiative to develop an AI research assistant to support scientists in their everyday tasks.