Understanding the World Through Code

Funded through the NSF Expeditions in Computing Program

Overview

In almost every field of science, it is now possible to capture large amounts of data. This has led machine learning to play an increasingly important role in scientific discovery, for example sifting through large amounts of data to identify interesting events. But modern machine learning techniques are less well suited for the critical tasks of devising hypotheses consistent with the data or imagining new experiments to test those hypotheses.

Main:Slide27 The goal of our project is to develop new learning techniques that can help automate this process of generating scientific theories from data. In particular, we are working to develop methods for learning neurosymbolic models that combine neural elements capable of identifying complex patterns in data with symbolic constructs that are able to represent higher level concepts. Our approach is based on the observation that programming languages provide a uniquely expressive formalism to describe complex models. Our aim is therefore to develop learning techniques that can produce models that look more like the models that scientists already write by hand in code. Our proposed techniques will more easily incorporate prior knowledge about the phenomena we want to model, and produce interpretable models that can be analyzed to devise new experiments or to infer causal relations. We believe these techniques have the potential to revolutionize the way scientists in a variety of fields interact with data. More broadly, our proposed techniques will be useful in any setting that requires learning more interpretable models with strong requirements on their desired behavior.

In order to drive this research, we have selected four domains where we believe these techniques can have significant impact: organic chemistry, RNA splicing, cognitive science / behavioral modeling, and computing systems. Machine learning is already demonstrating value in all of these domains, including predicting properties of organic compounds, recognizing complex social activities, and modeling CPU performance. However, our proposed techniques could have a transformative impact in all of these domains by helping scientists move from black-box predictions to a deeper understanding of the processes that give rise to the data.

This material is based upon work supported by the National Science Foundation under Grant No. 1918839. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.