Understanding the World Through Code

Funded through the NSF Expeditions in Computing Program

Software systems

While many of the research questions in this proposal have been motivated from the natural sciences (e.g., organic chemistry or understanding animal behavior), we believe that the proposed ideas can also be gainfully applied to address a number of concerns of real-world software systems. In many ways, reasoning large scale software is akin to scientific discovery. For example, consider software security: As new attacks and exploits are discovered almost on a daily basis and existing software is patched to address these exploits, it becomes increasingly important to construct models that explain what constitutes a security vulnerability or malicious behavior.

Working with code

Software engineering applications were one of the original driving forces behind the development of some of the neurosymbolic techniques leveraged by our project. We are continuing to explore applications ranging from program manipulation to bug finding and vulnerability detection. Below are some of our ongoing efforts in this space.

Neural-guided Program Synthesis for Code Transpilation

Due to the rapidly evolving nature of modern programming, many code bases need to be modernized by either re-writing them in an entirely different language or updating them to use different APIs. Motivated by this problem, our project mariano2022automated-ours aims to automate code transpilation using neural-guided program synthesis. We address this problem using a synthesis-based approach because the modernized version of the code is often written in a higher level of abstraction than the original version, making techniques like syntax-directed translation unsuitable in this setting. To address the challenging nature of the synthesis problem, we take a neural-guided approach, meaning that the search performed by the synthesizer is guided by a neural network that has been trained off-line. Our approach uses a new neural architecture called a cognate grammar network suitable for the transpilation task and leverages a novel pruning technique to rule out incorrect translations. A publication summarizing these results will appear at OOPSLA’22, with extensions and applications to other problems (e.g., de-obfuscation) being underway.

Working with data

Another early and promising application of neurosymbolic techniques is working with data, whether for the purpose of manipulation and visualization, or for the efficient storage and querying of it.

Querying and Visualizing Scientific Data using Program Synthesis

Data querying and visualization play a key role in many scientific disciplines, ranging from biology to physics. The goal of this project is to make it easier for scientists to query and visualize data using (neuro-symbolic) program synthesis.

One aspect of this project focuses on querying data that is comprised of a combination of structured formats (e.g., table or XML document) and unstructured information (e.g., text). Such hybrid formats are very common in scientific applications, but they are not very amenable to data querying. In particular, purely neural approaches (e.g., developed for natural language processing) fail to adequately handle the structured representation, while purely programmatic querying techniques (e.g., based on SQL-like languages) fail to handle unstructured text. Our research addresses this problem by developing neuro-symbolic query DSLs and corresponding learning/synthesis techniques for making it easier to query data in such hybrid formats. A publication summarizing our initial findings in this context appeared at PLDI’21.

Another aspect of this project focuses on generating visualizations from (tabular) data using program synthesis. We consider two ways to simplify visualization authoring. In one thread of work, we consider a user interaction scenario where the user generates a partial visualization with the aid of graphical user interface, and our method completes the visualization by synthesizing a suitable visualization script that is consistent with the user-provided partial visualization. In another thread of work, we consider a natural language interface (NLI) for visualizations wherein a visualization program is synthesized based on the user’s natural language description. Initial results from this work appeared at POPL and CHI, with newer results involving NLIs currently under peer-review.

Understanding software vulnerabilities.

What constitutes a software vulnerability is not a static notion, but one that evolves dynamically as attackers discover new exploits or reasonable security-theoretic assumptions are challenged over time and become obsolete (e.g., regarding what constitutes a sufficiently good source of entropy). As a result, security experts must invest significant time and effort into understanding root causes of vulnerable behavior, finding instances of these vulnerabilities, and making recommendations about how to fix these vulnerabilities. We believe that our proposed techniques can greatly facilitate the jobs of security analysists and software developers and ultimate make secure software development a much easier task than it is today. In this context, there are many sources of existing data that we can learn from. These include known vulnerabilities as cataloged in CVE databases and patches to existing software. We can learn such data to both develop a model of what constitutes a vulnerability as well as how to fix the underlying vulnerability. Recent efforts by our team and others have already demonstrated the feasibility of this approach for some specific types of vulnerabilitieslong2015stagedLong2016apisan, and we believe that the ideas outlined in this proposal can help us dramatically expand the class of vulnerabilities that can be automatically detected and patched.

Characterizing and detecting malware.

While vulnerabilities correspond to unintended mistakes in otherewise-benign code, another huge problem in software security is the presence of malicious applications that masquerade as useful software. Malicious applications are becoming a particularly severe concern due to the soaring popularity of smart phone applications and the near-infeasibility of manually vetting the enormous number of applications that are submitted to a growing number of app markets. We believe that our proposed ideas are ideally suited for understanding and detecting real-world malware by automatically synthesizing programmatic models of malicious behavior. While there is a large body of work on detecting malicious behavior using machine learning (and, in particular, deep learning), such techniques are not widely adopted in industry due to the opaque nature of these models. In particular, a significant drawback of standard machine learning models is that they fail to provide evidence of malice in cases where an application is classified as malware. Prior work by PIs Dillig and Bastani has shown that it is feasible to automatically synthesize such malware-characterizing programs given samples of benign and malicious applicationsfeng2017automated. Building on this prior work, we plan to develop techniques that can both synthesize models for a much larger class of malware and allow security analysts gain insights into malicious behavior by querying the learned models (e.g., ``which of these applications belong to the same malware family?'' or ''what changes are required to make this otherwise-useful application benign?'').