In most bacteria, the process of protein synthesis involves a transcription step, where a strand of messenger RNA is assembled as a copy of a gene with the help of RNA polymerase, followed by a translation step, where Rybosomes decode the gene into a sequence of aminoacids that will fold into a protein. Back in the 1970s, however, co-PI Phillip Sharp and his team discovered that in eucaryotes, transcription also involves splicing, where a complex of molecules called the spliceosome would bind to the RNA to remove segments of non-coding RNA known as introns, leaving behind the expressed portions of the RNA strand known as exons.
In the years since that discovery, biology has learned a great amount about the mechanisms involved in RNA splicing and the myriad of RNA-binding proteins that regulate the action of the splyceosome. However, we are still far from a comprehensive model that would help us predict with certainty the effect that different intervations---whether mutations or the addition of small molecules--- will have on how a given RNA sequence will be spliced.
Developing better models of splicing is critical to understanding a myriad of diseases. For example, many genetic diseases including early onset Parkinson disease, spinal muscular atrophy and amyotrophic lateral sclerosis can all be traced back to mutations that alter splicing. Splicing errors are also behind many forms of cancer.
Deep learning has already demonstrated some promise in turning the large amounts of data available about how particular genes are spliced into predictions about the effect of certain mutations on RNA splicing. However, the neurosymbolic learning techniques that we plan to develop as part of this proposal could allow us to develop richer models that capture more of what is known about the splicing process and the molecules involved. This could help us move beyond simply making predictions, to a real understanding of why particular mutations have the effects that they have.