Understanding the World Through Code

Funded through the NSF Expeditions in Computing Program

RNA Splicing

In most bacteria, the process of protein synthesis involves a transcription step, where a strand of messenger RNA is assembled as a copy of a gene with the help of RNA polymerase, followed by a translation step, where Rhybosomes decode the gene into a sequence of aminoacids that will fold into a protein. Back in the 1970s, however, co-PI Phillip Sharp and his team discovered that in eukaryotes, transcription also involves splicing, where a complex of molecules called the spliceosome would bind to the RNA to remove segments of non-coding RNA known as introns, leaving behind the expressed portions of the RNA strand known as exons.

In the years since that discovery, biology has learned a great amount about the mechanisms involved in RNA splicing and the myriad of RNA-binding proteins that regulate the action of the splyceosome. However, we are still far from a comprehensive model that would help us predict with certainty the effect that different intervations---whether mutations or the addition of small molecules--- will have on how a given RNA sequence will be spliced.

[Check out our quick primer to learn more about splicing.]

Developing better models of splicing is critical to understanding a myriad of diseases. For example, many genetic diseases including early onset Parkinson disease, spinal muscular atrophy and amyotrophic lateral sclerosis can all be traced back to mutations that alter splicing. Splicing errors are also behind many forms of cancer.

Deep learning has already demonstrated some promise in turning the large amounts of data available about how particular genes are spliced into predictions about the effect of certain mutations on RNA splicing. However, the neurosymbolic learning techniques that we plan to develop as part of this proposal could allow us to develop richer models that capture more of what is known about the splicing process and the molecules involved. This could help us move beyond simply making predictions, to a real understanding of why particular mutations have the effects that they have.

We have made significant progress on this project, completing our first major objective: decoupling the identification of RBP binding sites from the prediction of splicing. Specifically, we have done this via the "Adjusted Motifs" approach, as depicted in the presentation below.

New research from our group gupta2023improved-ours aims to better understand how cells process genetic information to make proteins. Specifically, the model models splicing as a two-stage process, first modelling the attachment of RNA-binding proteins, and then modeling splicing based on the locations of these attachments. This improves interpretability by allowing researchers to see which RNA-binding proteins are predicted to attach where, and to delete specific ones to see how splicing predictions change.

Additionally, the project yielded improved models of RBP attachment, as validated using several experiments.

The key findings are:

Overall, this interpretable model integrates both lab measurements and genomic data to better reveal the "splicing code" - the specific rules cells use to process RNA messages into proteins. In the future, this can be used for research and clinical applications.