Understanding the World Through Code

Funded through the NSF Expeditions in Computing Program


You can subscribe to our public mailing list to receive announcements about upcoming events.
10/2021: Neurosym webinar series Xinyun Chen — Ph.D. candidate at UC Berkeley — will be talking about neural program synthesis.

Neural Program Synthesis for Language Understanding in the Wild

Deep neural networks have achieved remarkable success in natural language processing, especially with the advancement of pre-training techniques. Moreover, recent works show that by training on a large-scale code corpus, sometimes these language models could even generate moderately complicated code from text descriptions, including Tabnine and Codex. In this talk, I will discuss my research on deep learning for program synthesis with two central goals: (1) developing program synthesizers that learn to infer the user intents for real-world deployment; and (2) improving the reasoning and generalization capabilities of existing language models via symbolic representations. First, I will discuss my SpreadsheetCoder work, where we aim to predict spreadsheet formulas only from the user-written tabular data, without the requirements of any explicit specifications. The SpreadsheetCoder model was recently integrated into Google Sheets, and could potentially benefit hundreds of millions of users. In the second part of my talk, I will go beyond program synthesis applications, and discuss my work on neural-symbolic techniques for language understanding. Despite the tremendous achievements of pre-trained language models, large-scale training does not automatically result in the capability of complex reasoning beyond text pattern matching. By integrating a symbolic reasoning module that synthesizes and executes programs for the task of interest, our neural-symbolic models demonstrate superior compositional reasoning ability, including numerical reasoning and compositional generalization.

Bio: Xinyun Chen is a Ph.D. candidate at UC Berkeley, working with Prof. Dawn Song. Her research lies at the intersection of deep learning, programming languages, and security. Her recent research focuses on neural program synthesis and adversarial machine learning. She received the Facebook Fellowship in 2020, and was selected for Rising Stars in EECS in 2020 and 2021.

When: Tuesday October 26, 2021 04:00 PM Eastern Time (US and Canada)
Where: Zoom
9/2021: Neurosym webinar series Petar Veličković — Senior Research Scientist at DeepMind — will be talking about neural algorithmic reasoning.

Neuralising a Computer Scientist: The Story So Far

Neural networks that are able to reliably execute algorithmic computation may hold transformative potential to both machine learning and theoretical computer science. On one hand, they could enable the kind of extrapolative generalisation scarcely seen with deep learning models. On another, they may allow for running classical algorithms on inputs previously considered inaccessible to them. Both of these promises are shepherded by the neural algorithmic reasoning blueprint, which I have recently proposed in a position paper alongside Charles Blundell. On paper, this is a remarkably elegant pipeline for reasoning on natural inputs which carefully leverages the tried-and-tested power of deep neural networks as feature extractors. In practice, how far did we actually take it? In this talk, I will present three concrete steps we've recently taken towards viably deploying the blueprint at scale: A dataset of algorithmic reasoning tasks, to be used as a bootstrapping basis; Using algorithmic reasoners to positively modulate self-supervised representations; Data-efficient implicit planning using algorithmic reasoners. along with some thoughts on where we could go next.

Bio: Petar Veličković is a Senior Research Scientist at DeepMind. He holds a PhD in Computer Science from the University of Cambridge (Trinity College), obtained under the supervision of Pietro Liò. His research concerns geometric deep learning—devising neural network architectures that respect the invariances and symmetries in data (a topic he's co-written a proto-book about). Within this area, Petar focuses on graph representation learning and its applications in algorithmic reasoning and computational biology. He has published relevant research in these areas at both machine learning venues (NeurIPS, ICLR, ICML-W) and biomedical venues and journals (Bioinformatics, PLOS One, JCB, PervasiveHealth). In particular, he is the first author of Graph Attention Networks—a popular convolutional layer for graphs—and Deep Graph Infomax—a scalable local/global unsupervised learning pipeline for graphs (featured in ZDNet). Further, his research has been used in substantially improving the travel-time predictions in Google Maps (covered by outlets including CNBC, Endgadget, VentureBeat, CNET, the Verge and ZDNet).

When: Tuesday Sept 28, 2021 04:00 PM Eastern Time (US and Canada)
Watch: Recorded Talk
09/2021: Our second annual meeting will be held September 13-14 2021 at CSAIL Stata center. See the schedule.
7/2021: Neurosym webinar series Hima Lakkaraju — Assistant Professor at Harvard University — will be talking about interpretable machine learning.

Towards Reliable and Robust Model Explanations

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this talk, I will present some of our recent research that sheds light on the vulnerabilities of popular post hoc explanation techniques such as LIME and SHAP, and also introduce novel methods to address some of these vulnerabilities. More specifically, I will first demonstrate that these methods are brittle, unstable, and are vulnerable to a variety of adversarial attacks. Then, I will discuss two solutions to address some of the aforementioned vulnerabilities–(i) a Bayesian framework that captures the uncertainty associated with post hoc explanations and in turn allows us to generate explanations with user specified levels of confidence, and (ii) a framework based on adversarial training that is designed to make post hoc explanations more stable and robust to shifts in the underlying data; I will conclude the talk by discussing our recent theoretical results which shed light on the equivalence and robustness of state-of-the-art explanation methods.

Bio: Hima Lakkaraju is an Assistant Professor at Harvard University focusing on explainability, fairness, and robustness of machine learning models. She has also been working with various domain experts in criminal justice and healthcare to understand the real world implications of explainable and fair ML. Hima has recently been named one of the 35 innovators under 35 by MIT Tech Review, and has received best paper awards at SIAM International Conference on Data Mining (SDM) and INFORMS. She has given invited workshop talks at ICML, NeurIPS, AAAI, and CVPR, and her research has also been covered by various popular media outlets including the New York Times, MIT Tech Review, TIME, and Forbes. For more information, please visit: https://himalakkaraju.github.io/

When: Tuesday Jul 27, 2021 04:00 PM Eastern Time (US and Canada)
Watch: Recorded Talk
4/2021: Neurosym webinar series Percy Liang — Associate Professor of Computer Science at Stanford University — will be talking about machine learning for program repair.

Learning to Fix Programs

A huge amount of time is spent by programmers fixing broken code. Our goal is to train neural models that can do this automatically. I will first present DrRepair, a system that learns to edit programs based on error messages. We leverage a large number of valid programs by artificially perturbing (and thus breaking) them. DrRepair obtains strong results on two tasks: fixing errors made by students and pseudocode-to-code translation. We then present a new framework, Break-It-Fix-It (BIFI), which additionally leverages unlabeled broken code to learn a model that perturbs code to generate more realistic broken code. We show that this results in further improvements over DrRepair. Taken together, our work suggests that one can learn a lot just from unlabeled programs and a compiler and no further manual annotations.

Bio: Percy Liang is an Associate Professor of Computer Science at Stanford University (B.S. from MIT, 2004; Ph.D. from UC Berkeley, 2011). His research spans many topics in machine learning and natural language processing, including robustness, interpretability, semantics, and reasoning. He is also a strong proponent of reproducibility through the creation of CodaLab Worksheets. His awards include the Presidential Early Career Award for Scientists and Engineers (2019), IJCAI Computers and Thought Award (2016), an NSF CAREER Award (2016), a Sloan Research Fellowship (2015), a Microsoft Research Faculty Fellowship (2014), and multiple paper awards at ACL, EMNLP, ICML, and COLT.

When: Tuesday, April 27 2021, 4-5pm EST
Watch: Recorded Talk
3/2021: Neurosym webinar series Jacob Andreas — X Consortium Assistant Professor at MIT in EECS and CSAIL — will be talking about symbolic representation and reasoning in DNNs.

Implicit Symbolic Representation and Reasoning in Deep Neural Networks

Standard neural network architectures can *in principle* implement symbol processing operations like logical deduction and simulation of complex automata. But do current neural models, trained on standard tasks like image recognition and language understanding, learn to perform symbol manipulation *in practice*? I'll survey two recent findings about implicit symbolic behavior in deep networks. First, I will describe a procedure for automatically labeling neurons with compositional logical descriptions of their behavior. These descriptions surface interpretable learned abstractions in models for vision and language, reveal implicit logical "definitions" of visual and linguistic categories, and enable the design of simple adversarial attacks that exploit errors in definitions. Second, I'll describe ongoing work showing that neural models for language generation perform implicit simulation of entities and relations described by text. Representations in these language models can be (linearly) translated into logical representations of world state, and can be directly edited to produce predictable changes in generated output. Together, these results suggest that highly structured representations and behaviors can emerge even in relatively unstructured models trained on natural tasks. Symbolic models of computation can play a key role in helping us understand these models.

Bio: Jacob Andreas is the X Consortium Assistant Professor at MIT in EECS and CSAIL. He did his PhD work at Berkeley, where he was a member of the Berkeley NLP Group and the Berkeley AI Research Lab. He has also spent time with the Cambridge NLIP Group, and the Center for Computational Learning Systems and NLP Group at Columbia.

When: Tuesday, March 23 2021, 4-5pm EST
Watch: Recorded Talk
2/2021: Neurosym webinar series Mayur Naik — Professor of Computer and Information Science at the University of Pennsylvania — will be talking about differentiable reasoning.

Scallop: End-to-end Differentiable Reasoning at Scale

Approaches to systematically combine symbolic reasoning with deep learning have demonstrated remarkable promise in terms of accuracy and generalizability. However, the complexity of exact probabilistic reasoning renders these methods inefficient for real-world, data-intensive machine learning applications. I will present Scallop, a scalable differentiable probabilistic Datalog engine equipped with a top-k approximate inference algorithm. The algorithm significantly reduces the amount of computation needed for inference and learning tasks without affecting their principal outcomes. To evaluate Scallop, we have crafted a challenging dataset, VQAR, comprising 4 million Visual Question Answering (VQA) instances that necessitate reasoning about real-world images with external common-sense knowledge. Scallop not only scales to these instances but also outperforms state-of-the-art neural-based approaches by 12.44%.

Bio: Mayur Naik is a Professor of Computer and Information Science at the University of Pennsylvania. His research spans the area of programming languages, with a current emphasis on developing scalable techniques to reason about programs by combining machine learning and formal methods. He is also interested in foundations and applications of neuro-symbolic approaches that synergistically combine deep learning and symbolic reasoning. He received a Ph.D. in Computer Science from Stanford University in 2008. Previously, he was a researcher at Intel Labs, Berkeley from 2008 to 2011, and an assistant professor in the College of Computing at Georgia Tech from 2011 to 2016.

When: Tuesday, February 23 2021, 4-5pm EST
Watch: Recorded Talk
1/2021: Neurosym webinar series Jiajun Wu — Assistant Professor of Computer Science at Stanford University — will be talking about some of his work on neurosymbolic approaches to computer vision.

Understanding the Visual World Through Code

Much of our visual world is highly regular: objects are often symmetric and have repetitive parts; indoor scenes such as corridors often consist of objects organized in a repetitive layout. How can we infer and represent such regular structures from raw visual data, and later exploit them for better scene recognition, synthesis, and editing? In this talk, I will present our recent work on developing neuro-symbolic methods for scene understanding. Here, symbolic programs and neural nets play complementary roles: symbolic programs are more data-efficient to train and generalize better to new scenarios, as they robustly capture high-level structure; deep nets effectively extract complex, low-level patterns from cluttered visual data. I will demonstrate the power of such hybrid models in three different domains: 2D image editing, 3D shape modeling, and human motion understanding.

Bio: Jiajun Wu is an Assistant Professor of Computer Science at Stanford University, working on computer vision, machine learning, and computational cognitive science. Before joining Stanford, he was a Visiting Faculty Researcher at Google Research. He received his PhD in Electrical Engineering and Computer Science at Massachusetts Institute of Technology. Wu's research has been recognized through the ACM Doctoral Dissertation Award Honorable Mention, the AAAI/ACM SIGAI Doctoral Dissertation Award, the MIT George M. Sprowls PhD Thesis Award in Artificial Intelligence and Decision-Making, the 2020 Samsung AI Researcher of the Year, the IROS Best Paper Award on Cognitive Robotics, and fellowships from Facebook, Nvidia, Samsung, and Adobe.

When: Tuesday January 26, 2021, 4-5pm EST
Watch: Recorded Talk
12/2020: Neurosym webinar series Justin Gottschlich — Principal Scientist and the Director & Founder of Machine Programming Research at Intel Labs — will be talking about Machine Programming.

A Glance into Machine Programming @ Intel Labs

As defined by "The Three Pillars of Machine Programming", machine programming (MP) is concerned with the automation of software development. The three pillars partition MP into the following conceptual components: (i) intention, (ii) invention, and (iii) adaptation, with data being a foundational element that is generally necessary for all pillars. While the goal of MP is complete software automation – something that is likely decades away – we believe there are many seminal research opportunities waiting to be explored today across the three pillars.
In this talk, we will provide a glance into the new Pioneering Machine Programming Research effort at Intel Labs and how it has been established around the three pillars across the entire company. We will also discuss Intel Labs’ general charter for MP, as well as a few early research systems that we have built and are using today to improve the quality and rate at which we are developing software (and hardware) in production systems

Bio: Justin Gottschlich is a Principal Scientist and the Director & Founder of Machine Programming Research at Intel Labs. He also has an academic appointment as an Adjunct Assistant Professor at the University of Pennsylvania. Justin is the Principal Investigator of the joint Intel/NSF CAPA research center, which focuses on simplifying the software programmability challenge for heterogeneous hardware. He co-founded the ACM SIGPLAN Machine Programming Symposium (previously Machine Learning and Programming Languages) and currently serves as its Steering Committee Chair. He is currently serving on two technical advisory boards: the 2020 NSF Expeditions “Understanding the World Through Code” and a new MP startup fully funded by Intel, which is currently in stealth.
Justin has a deep desire to build bridges with thought leaders across industry and academia to research disruptive technology as a community. Recently, he has been focused on machine programming, which is principally about automating software development. Justin currently has active collaborations with Amazon, Brown University, Georgia Tech, Google AI, Hebrew University, IBM Research, Microsoft Research, MIT, Penn, Stanford, UC-Berkeley, UCLA, and University of Wisconsin. He received his PhD in Computer Engineering from the University of Colorado-Boulder in 2011. Justin has 30+ peer-reviewed publications, 35+ issued patents, with 100+ patents pending.

When: Tuesday December 1, 4-5PM EST.
Watch: Recorded Talk
10/2020: Neurosym webinar series Abhinav Verma — PhD student at UT Austin — will talk about his recent work on reinforcement learning algorithms.

Programmatic Reinforcement Learning

We study reinforcement learning algorithms that generate policies that can be represented in expressive high-level Domain Specific Languages (DSL). This work aims to simultaneously address four fundamental drawbacks of Deep Reinforcement Learning (Deep-RL), where the policy is represented by a neural network; interpretability, verifiability, reliability and domain awareness. We formalize a new learning paradigm and provide empirical and theoretical evidence to show that we can generate policies in expressive DSLs that do not suffer from the above shortcomings of Deep-RL. To overcome the challenges of policy search in non-differentiable program space, we introduce a meta-algorithm that is based on mirror descent, program synthesis, and imitation learning. This approach leverages neurosymbolic learning, using synthesized symbolic programs to regularize Deep-RL and using the gradients available to Deep-RL to improve the quality of synthesized programs. Overall this approach establishes a synergistic relationship between Deep-RL and program synthesis.

Bio: Abhinav Verma is a PhD student at UT Austin where he is supervised by Swarat Chaudhuri. His research lies at the intersection of machine learning and program synthesis, with a focus on programmatically interpretable learning. He is a recipient of the 2020 JP Morgan AI Research PhD Fellowship.

When: Tuesday October 27, 4-5PM EST.
Watch: Recorded Talk
10/2020: We are having our official kickoff meeting Some of the talks will be streamed online, see the schedule for the recordings.
9/2020: Neurosym webinar series. In the first talk in the series, Kevin Ellis — research scientist at Common Sense Machines, and soon to be faculty member at the Computer Science Department at Cornell — will talk about his recent work on growing domain specific languages.

Growing domain-specific languages alongside neural program synthesizers via wake-sleep program learning

Two challenges in engineering program synthesis systems are: (1) crafting specialized yet expressive domain specific languages, and (2) designing search algorithms that can tractably explore the space of expressions in this domain specific language. We take a step toward the joint learning of domain specific languages, and the search algorithms that perform synthesis in that language. We propose an algorithm which starts with a relatively minimal domain specific language, and then enriches that language by compressing out common syntactic patterns into a library of reusable domain specific code. In tandem, the system trains a neural network to guide search over expressions in the growing language. From a machine learning perspective, this system implements a wake-sleep algorithm similar to the Helmholtz machine. We apply this algorithm to AI and program synthesis problems, with the goal of understanding how domain specific languages and neural program synthesizers can mutually bootstrap one another.

Related paper

Bio: Kevin Ellis is a research scientist at Common Sense Machines, and recently finished a PhD at MIT under Armando Solar-Lezama and Josh Tenenbaum. He works on program synthesis and artificial intelligence. He will be moving to Cornell to start as an assistant professor in the computer science department starting fall 2021.

When: Tuesday September 29, 4-5PM EST.
Watch: Recorded Talk
7/2020: Meet us at Tapia 2020. We will be present at Tapia 2020. If you are attending the (virtual) conference, come talk to us to learn more about the project and opportunities for undergraduate summer research.