# Ronan Le Bras

## Research

Ronan Le Bras is a research scientist at the Allen Institute for AI. His research interests include commonsense reasoning through natural language understanding, as well as computational methods for large-scale combinatorial optimization, reasoning, machine learning and human computation. He earned his Ph.D. in Computer Science from Cornell University in 2016. Prior to his Ph.D., he received his M.S. and B.S. from Ecole Polytechnique Montreal in Computer Engineering and in Software Engineering.

## Papers

## Abductive Commonsense Reasoning

**Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Yih, Yejin Choi**ICLR | 2020Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been considered to be at the core of how people interpret and read between the lines in natural language (Hobbs et al. (1988)), there has been relatively little NLP research in support of abductive natural language inference. We present the first study that investigates the viability of language-based abductive reasoning. We conceptualize a new task of Abductive NLI and introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations, formulated as multiple choice questions for easy automatic evaluation. We establish comprehensive baseline performance on this task based on state-of-the-art NLI and language models, which leads to 68.9% accuracy, well below human performance (91.4%). Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform -- despite their strong performance on the related but fundamentally different task of entailment NLI -- pointing to interesting avenues for future research.

## WinoGrande: An Adversarial Winograd Schema Challenge at Scale

**Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi**AAAI | 2020The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) as an alternative to the Turing Test, was originally designed as a pronoun resolution problem that cannot be solved based on statistical patterns in large text corpora. However, recent studies suggest that current WSC datasets, even when composed carefully by experts, are still prone to such biases that statistical methods can exploit. We introduce WINOGRANDE, a new collection of WSC problems that are adversarially constructed to be robust against spurious statistical biases. While the original WSC dataset provided only 273 instances, WINOGRANDE includes 43,985 instances, half of which are determined as adversarial. Key to our approach is a novel adversarial filtering algorithm AFLITE for systematic bias reduction, combined with a careful crowdsourcing design. Despite the significant increase in training data, the performance of existing state-of-the-art methods remains modest (61.6%) and contrasts with high human performance (90.8%) for the binary questions. In addition, WINOGRANDE allows us to use transfer learning for achieving new state-of-the-art results on the original WSC and related datasets. Finally, we discuss how biases lead to overestimating the true capabilities of machine commonsense.

## PIQA: Reasoning about Physical Commonsense in Natural Language

**Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi**AAAI | 2020To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today’s natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains – such as news articles and encyclopedia entries, where text is plentiful – in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical commonsense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA . Though humans find the dataset easy (95% accuracy), large pretrained models struggle (∼77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

## Social IQa: Commonsense Reasoning about Social Interactions

**Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, Yejin Choi**EMNLP | 2019We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

## Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

**Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi**EMNLP | 2019Understanding narratives requires reading between the lines, which in turn, requires interpreting the likely causes and effects of events, even when they are not mentioned explicitly. In this paper, we introduce Cosmos QA, a large-scale dataset of 35,600 problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. In stark contrast to most existing reading comprehension datasets where the questions focus on factual and literal understanding of the context paragraph, our dataset focuses on reading between the lines over a diverse collection of people's everyday narratives, asking such questions as "what might be the possible reason of ...?", or "what would have happened if ..." that require reasoning beyond the exact text spans in the context. To establish baseline performances on Cosmos QA, we experiment with several state-of-the-art neural architectures for reading comprehension, and also propose a new architecture that improves over the competitive baselines. Experimental results demonstrate a significant gap between machine (68.4%) and human performance (94%), pointing to avenues for future research on commonsense machine comprehension. Dataset, code and leaderboard is publicly available at this https URL.

## SemEval 2019 Task 10: Math Question Answering

**Mark Hopkins, Ronan Le Bras, Cristian Petrescu-Prahova, Gabriel Stanovsky, Hannaneh Hajishirzi, and Rik Koncel-Kedziorski**SemEval-NAACL | 2019We report on the SemEval 2019 task on math question answering. We provided a question set derived from Math SAT practice exams, including 2778 training questions and 1082 test questions. For a significant subset of these questions, we also provided SMT-LIB logical form annotations and an interpreter that could solve these logical forms. Systems were evaluated based on the percentage of correctly answered questions. The top system correctly answered 45% of the test questions, a considerable improvement over the 17% random guessing baseline.

## ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning

**Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi**AAAI | 2019We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 300k textual descriptions. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment"). We propose nine if-then relation types to distinguish causes v.s. effects, agents v.s. themes, voluntary v.s. involuntary events, and actions v.s. mental states. By generatively training on the rich inferential knowledge described in ATOMIC, we show that neural models can acquire simple commonsense capabilities and reason about previously unseen events. Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.

## Beyond Sentential Semantic Parsing: Tackling the Math SAT with a Cascade of Tree Transducers

**Mark Hopkins, Cristian Petrescu-Prahova, Roie Levin, Ronan Le Bras, Alvaro Herrasti, and Vidur Joshi**EMNLP | 2017We present an approach for answering questions that span multiple sentences and exhibit sophisticated cross-sentence anaphoric phenomena, evaluating on a rich source of such questions – the math portion of the Scholastic Aptitude Test (SAT). By using a tree transducer cascade as its basic architecture, our system (called EUCLID) propagates uncertainty from multiple sources (e.g. coreference resolution or verb interpretation) until it can be confidently resolved. Experiments show the first-ever results (43% recall and 91% precision) on SAT algebra word problems. We also apply EUCLID to the public Dolphin algebra question set, and improve the state-of-the-art F1-score from 73.9% to 77.0%.

## Phase-Mapper: An AI Platform to Accelerate High Throughput Materials Discovery

**Yexiang Xue, Junwen Bai, Ronan Le Bras, Brendan Rappazzo, Richard Bernstein, Johan Bjorck, Liane Longpre, Santosh K. Suram, Robert B. van Dover, John Gregoire and Carla P. Gomes**IAAI | 2017 | Best paper awardHigh-throughput materials discovery involves the rapid synthesis, measurement, and characterization of many different but structurally related materials. A central problem in materials discovery, the phase map identification problem, involves the determination of the crystal structure of materials from materials composition and structural characterization data. We present Phase-Mapper, a novel solution platform that allows humans to interact with both the data and products of AI algorithms, including the incorporation of human feedback to constrain or initialize solutions. Phase-Mapper is compatible with any spectral demixing algorithm, including our novel solver, AgileFD, which is based on convolutive non-negative matrix factorization. AgileFD allows materials scientists to rapidly interpret XRD patterns, and can incorporate constraints to capture the physics of the materials as well as human feedback. We compare three solver variants with previously proposed methods in a large-scale experiment involving 20 synthetic systems, demonstrating the efficacy of imposing physical constraints using AgileFD. Since the deployment of Phase-Mapper at the Department of Energy’s Joint Center for Artificial Photosynthesis (JCAP), thousands of X-ray diffraction patterns have been processed and the results are yielding discovery of new materials for energy applications, as exemplified by the discovery of a new family of metal oxide solar light absorbers, among the previously unsolved Nb-Mn-V oxide system, which is provided here as an illustrative example. Phase-Mapper is also being deployed at the Stanford Synchrotron Radiation Lightsource (SSRL) to enable phase mapping on datasets in real time.

## In Search of Balance: The Challenge of Generating Balanced Latin Rectangles

**Mateo Diaz, Ronan Le Bras and Carla P. Gomes**CP-AI-OR | 2017Spatially Balanced Latin Squares are combinatorial struc- tures of great importance for experimental design. From a computational perspective they present a challenging problem and there is a need for efficient methods to generate them. Motivated by a real-world applica- tion, we consider a natural extension to this problem, balanced Latin Rectangles. Balanced Latin Rectangles appear to be even more defiant than balanced Latin Squares, to such an extent that perfect balance may not be feasible for Latin rectangles. Nonetheless, for real applications, it is still valuable to have well balanced Latin rectangles. In this work, we study some of the properties of balanced Latin rectangles, prove the nonexistence of perfect balance for an infinite family of sizes, and present several methods to generate the most balanced solutions.

## Automated Phase Mapping with AgileFD and its Application to Light Absorber Discovery in the V–Mn–Nb Oxide System

**Santosh K. Suram, Yexiang Xue, Junwen Bai, Ronan Le Bras, Brendan Rappazzo, Richard Bernstein, Johan Bjorck, Lan Zhou, R. Bruce van Dover, Carla P. Gomes and John M. Gregoire**ACS Combinatorial Science | 2017Rapid construction of phase diagrams is a central tenet of combinatorial materials science with accelerated materials discovery efforts often hampered by challenges in interpreting combinatorial X-ray diffraction data sets, which we address by developing AgileFD, an artificial intelligence algorithm that enables rapid phase mapping from a combinatorial library of X-ray diffraction patterns. AgileFD models alloying-based peak shifting through a novel expansion of convolutional nonnegative matrix factorization, which not only improves the identification of constituent phases but also maps their concentration and lattice parameter as a function of composition. By incorporating Gibbs’ phase rule into the algorithm, physically meaningful phase maps are obtained with unsupervised operation, and more refined solutions are attained by injecting expert knowledge of the system. The algorithm is demonstrated through investigation of the V–Mn–Nb oxide system where decomposition of eight oxide phases, including two with substantial alloying, provides the first phase map for this pseudoternary system. This phase map enables interpretation of high-throughput band gap data, leading to the discovery of new solar light absorbers and the alloying-based tuning of the direct-allowed band gap energy of MnV2O6. The open-source family of AgileFD algorithms can be implemented into a broad range of high throughput workflows to accelerate materials discovery.

## Variable Elimination in the Fourier Domain

**Yexiang Xue, Stefano Ermon, Ronan Le Bras, Carla P. Gomes and Bart Selman**ICML | 2016The ability to represent complex high dimensional probability distributions in a compact form is one of the key insights in the field of graphical models. Factored representations are ubiquitous in machine learning and lead to major computational advantages. We explore a different type of compact representation based on discrete Fourier representations, complementing the classical approach based on conditional independencies. We show that a large class of probabilistic graphical models have a compact Fourier representation. This theoretical result opens up an entirely new way of approximating a probability distribution. We demonstrate the significance of this approach by applying it to the variable elimination algorithm. Compared with the traditional bucket representation and other approximate inference algorithms, we obtain significant improvements.

## ClouDiA: a deployment advisor for public clouds

**Tao Zou, Ronan Le Bras, Marcos Vaz Salles, Alan Demers, Johannes Gehrke**The VLDB Journal | 2015An increasing number of distributed data-driven applications are moving into shared public clouds. By sharing resources and operating at scale, public clouds promise higher utilization and lower costs than private clusters. To achieve high utilization, however, cloud providers inevitably instances of a given application may end-up in physically distant machines in the cloud. This allocation strategy can lead to large differences in average latency between instances. For a large class of applications, this difference can result in significant performance degradation, unless care is taken in how application components are mapped to instances. In this paper, we propose ClouDiA, a general deployment advisor that selects application node deployments minimizing either (i) the largest latency between application nodes, or (ii) the longest critical path among all application nodes. ClouDiA employs a number of algorithmic techniques, including mixed-integer programming and constraint programming techniques, to efficiently search the space of possible mappings of application nodes to instances. Through experiments with synthetic and real applications in Amazon EC2, we show that mean latency is a robust metric to model communication cost in these applications and that our search techniques yield a 15–55 % reduction in time-to-solution or service response time, without any need for modifying application code.

## Pattern Decomposition with Complex Combinatorial Constraints: Application to Materials Discovery

**Stefano Ermon, Ronan Le Bras, Santosh Suram, John M. Gregoire, Carla Gomes, Bart Selman and Robert B. van Dover**AAAI | 2015Identifying important components or factors in large amounts of noisy data is a key problem in machine learning and data mining. Motivated by a pattern decomposition problem in materials discovery, aimed at discovering new materials for renewable energy, e.g. for fuel and solar cells, we introduce CombiFD, a framework for factor based pattern decomposition that allows the incorporation of a-priori knowledge as constraints, including complex combinatorial constraints. In addition , we propose a new pattern decomposition algorithm , called AMIQO, based on solving a sequence of (mixed-integer) quadratic programs. Our approach considerably outperforms the state of the art on the materials discovery problem, scaling to larger datasets and recovering more precise and physically meaningful de-compositions. We also show the effectiveness of our approach for enforcing background knowledge on other application domains.

## A Human Computation Framework for Boosting Combinatorial Solvers

**Ronan Le Bras, Yexiang Xue, Richard Bernstein, Carla P. Gomes and Bart Selman**HCOMP | 2014We propose a general framework for boosting combinatorial solvers through human computation. Our framework combines insights from human workers with the power of combinatorial optimization. The combinatorial solver is also used to guide requests for the workers , and thereby obtain the most useful human feedback quickly. Our approach also incorporates a problem decomposition approach with a general strategy for discarding incorrect human input. We apply this framework in the domain of materials discovery, and demonstrate a speedup of over an order of magnitude.

## On the Erdos Discrepancy Problem

**Ronan Le Bras, Carla P. Gomes, and Bart Selman**CP | 2014According to the Erdős discrepancy conjecture, for any infinite ±1 sequence, there exists a homogeneous arithmetic progression of unbounded discrepancy. In other words, for any ±1 sequence (x1, x2, ...) and a discrepancy C, there exist integers m and d such that | m i=1 x i·d | > C. This is an 80-year-old open problem and recent development proved that this conjecture is true for discrepancies up to 2. Paul Erd˝ os also conjectured that this property of unbounded discrepancy even holds for the restricted case of completely multiplicative sequences (CMSs), namely sequences (x1, x2, ...) where x a·b = xa · x b for any a, b ≥ 1. The longest CMS with discrepancy 2 has been proven to be of size 246. In this paper, we prove that any completely multiplicative sequence of size 127, 646 or more has discrepancy at least 4, proving the Erd˝ os discrepancy conjecture for CMSs of discrepancies up to 3. In addition, we prove that this bound is tight and increases the size of the longest known sequence of discrepancy 3 from 17, 000 to 127, 645. Finally, we provide inductive construction rules as well as streamlining methods to improve the lower bounds for sequences of higher discrepancies.

## A Computational Challenge Problem in Materials Discovery: Synthetic Problem Generator and Real-World Datasets

**Ronan Le Bras, Richard Bernstein, John M. Gregoire, Santosh K. Suram, Carla P. Gomes, Bart Selman and Robert B. van Dover**AAAI | 2014Newly-discovered materials have been central to recent technological advances. They have contributed significantly to breakthroughs in electronics, renewable energy and green buildings, and overall, have promoted the advancement of global human welfare. Yet, only a fraction of all possible materials have been explored. Accelerating the pace of discovery of materials would foster technological innovations, and would potentially address pressing issues in sustainability, such as energy production or consumption. The bottleneck of this discovery cycle lies, however, in the analysis of the materials data. As materials scientists have recently devised techniques to efficiently create thousands of materials and experimentalists have developed new methods and tools to characterize these materials, the limiting factor has become the data analysis itself. Hence, the goal of this paper is to stimulate the development of new computational techniques for the analysis of materials data, by bringing together the complimentary expertise of materials scientists and computer scientists. In collaboration with two major research laboratories in materials science, we provide the first publicly available dataset for the phase map identification problem. In addition, we provide a parameterized synthetic data generator to assess the quality of proposed approaches, as well as tools for data visualization and solution evaluation.

## Crowdsourcing Backdoor Identification for Combinatorial Optimization

**Ronan Le Bras, Richard Bernstein, Carla P. Gomes, Bart Selman and Robert B. van Dover**IJCAI | 2013We will show how human computation insights can be key to identifying so-called backdoor variables in combinatorial optimization problems. Backdoor variables can be used to obtain dramatic speed-ups in combinatorial search. Our approach leverages the complementary strength of human input, based on a visual identification of problem structure , crowdsourcing, and the power of combina-torial solvers to exploit complex constraints. We describe our work in the context of the domain of materials discovery. The motivation for considering the materials discovery domain comes from the fact that new materials can provide solutions for key challenges in sustainability, e.g., in energy, new catalysts for more efficient fuel cell technology.

## Double-Wheel Graphs Are Graceful

**Ronan Le Bras, Carla P. Gomes and Bart Selman**IJCAI | 2013We present the first polynomial time construction procedure for generating graceful double-wheel graphs. A graph is graceful if its vertices can be labeled with distinct integer values from {0, ..., e}, where e is the number of edges, such that each edge has a unique value corresponding to the absolute difference of its endpoints. Graceful graphs have a range of practical application domains, including in radio astronomy, X-ray crystallography, cryptography , and experimental design. Various families of graphs have been proven to be graceful, while others have only been conjectured to be. In particular, it has been conjectured that so-called double-wheel graphs are graceful. A double-wheel graph consists of two cycles of N nodes connected to a common hub. We prove this conjecture by providing the first construction for graceful double-wheel graphs, for any N > 3, using a framework that combines streamlined constraint reasoning with insights from human computation. We also use this framework to provide a polynomial time construction for diagonally ordered magic squares.

## ClouDiA: A Deployment Advisor for Public Clouds

**Tao Zou, Ronan Le Bras, Marcos Vaz Salles, Alan Demers and Johannes Gehrke**VLDB | 2013An increasing number of distributed data-driven applications are moving into shared public clouds. By sharing resources and operating at scale, public clouds promise higher utilization and lower costs than private clusters. To achieve high utilization, however, cloud providers inevitably allocate virtual machine instances non-contiguously; i.e., instances of a given application may end-up in physically distant machines in the cloud. This allocation strategy can lead to large differences in average latency between instances. For a large class of applications, this difference can result in significant performance degradation, unless care is taken in how application components are mapped to instances. In this paper, we propose ClouDiA, a general deployment advisor that selects application node deployments minimizing either (i) the largest latency between application nodes, or (ii) the longest critical path among all application nodes. ClouDiA employs a number of algorithmic techniques, including mixed-integer programming and constraint programming techniques, to efficiently search the space of possible mappings of application nodes to instances. Through experiments with synthetic and real applications in Amazon EC2, we show that mean latency is a robust metric to model communication cost in these applications and that our search techniques yield a 15–55 % reduction in time-to-solution or service response time, without any need for modifying application code.

## Robust Network Design for Multispecies Conservation

**Ronan Le Bras, Bistra Dilkina, Yexiang Xue, Carla P. Gomes, Kevin S. McKelvey, Claire Montgomery and Michael K. Schwartz**AAAI | 2013Our work is motivated by an important network design application in computational sustainability concerning wildlife conservation. In the face of human development and climate change, it is important that conservation plans for protecting landscape connectivity exhibit certain level of robustness. While previous work has focused on conservation strategies that result in a connected network of habitat reserves, the ro-bustness of the proposed solutions has not been taken into account. In order to address this important aspect, we formalize the problem as a node-weighted bi-criteria network design problem with connectivity requirements on the number of disjoint paths between pairs of nodes. While in most previous work on survivable network design the objective is to minimize the cost of the selected network, our goal is to optimize the quality of the selected paths within a specified budget , while meeting the connectivity requirements. We characterize the complexity of the problem under different restrictions. We provide a mixed-integer programming encoding that allows for finding solutions with optimality guarantees, as well as a hybrid local search method with better scaling behavior but no guarantees. We evaluate the typical-case performance of our approaches using a synthetic benchmark, and apply them to a large-scale real-world network design problem concerning the conservation of wolverine and lynx populations in the U.S. Rocky Mountains (Montana).

## Large Landscape Conservation - Synthetic and Real-World Datasets

**Bistra Dilkina, Katherine Lai, Ronan Le Bras, Yexiang Xue, Carla P. Gomes, Ashish Sabharwal, Jordan Suter, Kevin S. McKelvey, Michael K. Schwartz and Claire Montgomery**AAAI | 2013Biodiversity underpins ecosystem goods and services and hence protecting it is key to achieving sustainability. However, the persistence of many species is threatened by habitat loss and fragmentation due to human land use and climate change. Conservation efforts are implemented under very limited economic resources, and therefore designing scalable, cost-efficient and systematic approaches for conservation planning is an important and challenging computational task. In particular, preserving landscape connectivity between good habitat has become a key conservation priority in recent years. We give an overview of landscape connectivity conservation and some of the underlying graph-theoretic optimization problems. We present a synthetic generator capable of creating families of randomized structured problems, capturing the essential features of real-world instances but allowing for a thorough typical-case performance evaluation of different solution methods. We also present two large-scale real-world datasets, including economic data on land cost, and species data for grizzly bears, wolverines and lynx.

## Solutions for Hard and Soft Constraints Using Optimized Probabilistic Satisfiability

**Marcelo Finger, Ronan Le Bras, Carla P. Gomes and Bart Selman**SAT | 2013Practical problems often combine real-world hard constraints with soft constraints involving preferences, uncertainties or flexible requirements. A probability distribution over the models that meet the hard constraints is an answer to such problems that is in the spirit of incorporating soft constraints. We propose a method using SAT-based reasoning, probabilistic reasoning and linear programming that computes such a distribution when soft constraints are interpreted as constraints whose violation is bound by a given probability. The method, called Optimized Probabilistic Satis-fiability (oPSAT), consists of a two-phase computation of a probability distribution over the set of valuations of a SAT formula. Algorithms for both phases are presented and their complexity is discussed. We also describe an application of the oPSAT technique to the problem of combinatorial materials discovery.

## From Streamlined Combinatorial Search to Efficient Constructive Procedures

**Ronan Le Bras, Carla P. Gomes and Bart Selman**AAAI | 2012In recent years, significant progress in the area of search, constraint satisfaction, and automated reasoning has been driven in part by the study of challenge problems from combinatorics and finite algebra. This work has led to the discovery of interesting discrete structures with intricate mathematical properties. While some of those results have resolved open questions and conjectures, a shortcoming is that they generally do not provide further mathematical insights, from which one could derive more general observations. We propose an approach that integrates specialized combina-torial search, using so-called streamlining, with a human computation component. We use this approach to discover efficient constructive procedures for generating certain classes of combinatorial objects of any size. More specifically, using our framework, we discovered two complementary efficient constructions for generating so-called Spatially Balanced Latin squares (SBLS) of any order N, such that 2N+1 is prime. Previously constructions for SBLSs were not known. Our approach also enabled us to derive a new lower bound for so-called weak Schur numbers, improving on a series of earlier results for Schur numbers.

## Materials Discovery: New Opportunities at the Intersection of Constraint Reasoning and Learning

**Ronan Le Bras, Stefano Ermon, Theodoros Damoulas, Richard Bernstein, Carla P. Gomes, Bart Selman and Robert B. van Dover**CompSust | 2012Combinatorial materials science involves the rapid, high-throughput synthesis, measurement, and analysis of a large number of different but structurally related materials. In combinatorial materials discovery, materials scientists search for intermetallic compounds with desirable physical properties by obtaining measurements on hundreds of samples from a composition spread. Determining the structure of the materials formed in a composition spread is key to understanding composition and property relations and can potentially result in a breakthrough materials discovery. This is an important and exciting direction in the emerging field of computational sustainability [4] as it aims to achieve the best possible use of our available material resources. One ultimate objective is to help discover the next-generation materials for fuel-cell catalysis, as such materials have the potential of dramatically increasing fuel cell capacity while reducing their cost. The analysis of composition spreads remains, however, a manual and laborious task. Thus the need for new techniques to automatically analyze and interpret such data. Whereas the data-intensive aspect of the area of materials discovery seems to favor Data-Mining or Machine Learning techniques, the rigorous and highly-structured physical properties that govern the crystallization on the composition spread interestingly suggest that constraint reasoning is key to a physically meaningful analysis. In this paper, we describe two novel approaches to this problem that integrate domain-specific scientific background knowledge about the physical and chemical properties of the materials. Our first approach combines constraint programming (CP) and machine learning (ML), while the second is based on satisfiability modulo theory (SMT). We evaluate the performance of our methods on realistic synthetic measurements, and we show that it provides accurate and physically meaningful interpretations of the data, even in the presence of artificially added noise.

## SMT-Aided Combinatorial Materials Discovery

**Stefano Ermon, Ronan Le Bras, Carla P. Gomes, Bart Selman and Robert B. van Dover**SAT | 2012In combinatorial materials discovery, one searches for new materials with desirable properties by obtaining measurements on hundreds of samples in a single high-throughput batch experiment. As manual data analysis is becoming more and more impractical, there is a growing need to develop new techniques to automatically analyze and interpret such data. We describe a novel approach to the phase map identification problem where we integrate domain-specific scientific background knowledge about the physical and chemical properties of the materials into an SMT reasoning framework. We evaluate the performance of our method on realistic synthetic measurements, and we show that it provides accurate and physically meaningful interpretations of the data, even in the presence of artificially added noise.

## Constraint Reasoning and Kernel Clustering for Pattern Decomposition With Scaling

**Ronan Le Bras, Theodoros Damoulas, John M. Gregoire, Ashish Sabharwal, Carla P. Gomes and Robert B. van Dover**CP | 2011Motivated by an important and challenging task encountered in material discovery, we consider the problem of finding K basis patterns of numbers that jointly compose N observed patterns while enforcing additional spatial and scaling constraints. We propose a Constraint Programming (CP) model which captures the exact problem structure yet fails to scale in the presence of noisy data about the patterns. We alleviate this issue by employing Machine Learning (ML) techniques, namely kernel methods and clustering, to decompose the problem into smaller ones based on a global data-driven view, and then stitch the partial solutions together using a global CP model. Combining the complementary strengths of CP and ML techniques yields a more accurate and scalable method than the few found in the literature for this complex problem.

## Efficient Generic Search Heuristics within the EMBP Framework

**Ronan Le Bras, Alessandro Zanarini and Gilles Pesant**CP | 2009Accurately estimating the distribution of solutions to a problem , should such solutions exist, provides efficient search heuristics. The purpose of this paper is to propose new ways of computing such estimates , with different degrees of accuracy and complexity. We build on the Expectation-Maximization Belief-Propagation (EMPB) framework proposed by Hsu et al. to solve Constraint Satisfaction Problems (CSPs). We propose two general approaches within the EMBP framework: we firstly derive update rules at the constraint level while enforcing domain consistency and then derive update rules globally, at the problem level. The contribution of this paper is twofold: first, we derive new generic update rules suited to tackle any CSP; second, we propose an efficient EMBP-inspired approach, thereby improving this method and making it competitive with the state of the art. We evaluate these approaches experimentally and demonstrate their effectiveness.