An RNA splicing primer
The problem of predicting RNA splicing is of crucial importance to biology, and
is one where formal methods and machine learning have a lot to contribute.
In this tutorial we explain the basics of the RNA splicing problem.
This content sumarizes data from multiple sources
WangCooper07WangBurge08
and is expected to serve as a quick primer for researchers from the ML and PL community
who want to work on this problem.
The basics of DNA
To understand the RNA splicing problem, it is important to first understand how
cells make proteins. The starting point for the synthesis of a protein is a sequence of
DNA (the primary sequence). DNA is composed of 4 basic nucleotides:
adenine, guanine, thiamine and cytosine, usually abreviated as a, g, t and c.
The first two (adenine and guanine) are known as purines, and the other two
(thiamine and cytosine)
are known as pyrimidines. The most important thing to know about DNA is that
adenine and thiamine always match together as do guanine and cytosine.
DNA is generally aranged as two complementary sequences of nucleotides, where every
base in the first sequence has a corresponding matching base in the second sequence
as illustrated in the figure. Each sequence has a starting point,
known as the 5' (five prime) end, and an end point,
known as the 3' end.
A fragment of DNA that encodes for a protein is known as a gene. The first step in
creating a protein from a gene is known as RNA transctiption.
Transcription at a glance
The goal of transcription is to create a strand of RNA that copies the information
contained in the original DNA strand. During transcription, the two complementary
strands of DNA are separated , and an enzyme called RNA polymerase assembles a strand
of RNA by adding one nucleotide at a time to the 3' end of an RNA sequence. As a computer
scientist, the most important thing to know about RNA is that it is also composed
of nucleotides, but the thiamine is replaced with the pyrimidine uracil (abreviated as u).
This strand of RNA is called a messenger RNA (mRNA) and it will
carry the information about how to construct the
protein out of the nucleus and into the
ribosomes where the proteins will be assembled.
But before that happens, in fact as the
RNA polymerase is assembling the RNA, this RNA
must be spliced.
Splicing basics
Before the mRNA is spliced, it is called a pre-mRNA. The parts of the pre-mRNA
that will be cut out (spliced) are known as introns. Those that will stay as
part of the final RNA are called exons (the ex stands for expressed).
Some RNA sequences are always spliced the same way (this is known
as constitutive splicing), although in some instances,
the same sequence can splice in multiple different ways depending on the
environment (this is known as alternative splicing).
The fundamental question that we want to answer is: How does splicing work?
More specifically, there are multiple different questions that we can answer.
The constitutive splicing code.
The simplest version of the question is, can we identify all the locations in
an RNA sequence that could be the start or end of an intron.
Some of the standard measures of quality for such a model are:
- False positives and false negatives in identifying start and end of introns
- Fraction of genes with correctly predicted splices
In the simplest case, we can just train a model with a set of RNA sequences
labeled with the start and end of all their introns and use it to make predictions
for new sequences. However, we would like to have models that satisfy the
following properties:
- Generalize well to unseen data
(bonus points if they generalize to brand new mutations.)
- Transfer across species. It is unlikely that the models will generalize out-of-the-box
across species,
but it would be great to be able to do meta-learning to help you
train a model for a new species very efficiently with little data
based on knowledge accumulated from other species.
- Provide insight about the underlying splicing mechanisms.
- Incorporate our existing knowledge about splicing.
- Can make use of all the different sources of data available.
Alternative splicing.
For sequences that are alternatively spliced, can we label each splice site
with the probability that it will splice at that point?
And even better, can we identify the actual sequences that can be produced from
a single sequence of pre-mRNA?
Some things we know about splicing
There is a broad literature on splicing, but below are a few of the things we know
about how splicing works that would be useful to incorporate into any model.
First, introns tend to follow a common pattern. The start of an intron (the 5' end)
often refered to as the
donor site usually starts with the sequence 'GU'.
The end of the intron, (the 3' end) often refered to as the
acceptor site,
usually ends with the sequence 'AG'. These patterns are known as
consensus
sequences, and you can find published data on the
matrices that define
these consensus sequences.
In the middle of the intron, there is a region known as the branch site that
always includes an 'A', and often has a distinctive pattern. Between the
branch site and the acceptor site, there is usually a
pyrimidine tract,
a sequence of 'U's or 'C's.
Splicing is performed by a complex of
small nuclear ribonucleoproteins
(snRNPs pronounced as "snurps") known as the
spliceosome. The spliceosome
is composed of 5 snRNPs called U1, U2, U4, U5 and U6, and is aided and regulated by a
large number (about 150) of RNA-binding molecules.
The figure provides a cartoon illustration of how these snRNPs attach to the
RNA and manipulate it in order to splice out the intron into a structure known
as a lariat. The crucial question is: how do these snRNPs know where to attach?
The consensus sequences at the ends of the intron are not specific enough to regulate this
process. We cannnot just say "Every GU is the start of an intron and every AG is
the end of one". Consensus sequences that don't correspond to actual start and
ends of introns are known as cryptic splice sites.
We know the proces is regulated by additional RNA-binding molecules
that attach to particular sequences of RNA known as Splicing Regulatory Elements (SREs).
Splicing regulatory elemsnts (SREs)
There are 4 main types of SREs:
- ESE: Exonic splicing enhancer, attaches to an exon and promotes splicing nearby.
- ESS: Exonic splicing silencer, attaches to an exon and inhibits splicing nearby.
- ISE: Intronic splicing enhancer, attaches to an intron and promotes splicing nearby.
- ISS: Intronic splicing silencer, attaches to an intron and inhibits splicing nearby.
These splicing enhancer and silencers
are known as cis-acting elements (cis==same) because they are a pattern in the
RNA molecule that affects the splicing of that molecule.
These elements work by recruiting trans-acting elements (trans==different), which
are different proteins present in the nucleus which are recruited into the splicing
machinery.
For example, in the exon,
we generally find ESE sequences close to the exon-intron boundary. These help
recruit SR proteins (a type of RNA-binding protein) which then help the U1 snRNP
attach to the donor site. A consensus sequence that does not have this ESE nearby is
more likely to be a cryptic splice site. As another examples, ESSs can bind
a type of proteins called hnRNP I, which can inhibit splicing by blocking the
interaction between U1 and U2.
Usually, the SREs will correspond to short patterns. When modeling, it is common
to assume they are hexamers (aka 6mers or 6-base long patterns). One important
consideration is that some SRE sequences have been observed to be context
dependent, so they can act as silencers or enhancers depending on the context.
The patterns can also be species specific, and some categories of organisms rely more on intronic
SREs, whereas others rely more on exonic SREs.
In many cases, the action of SREs is additive, so more enhancers nearby will
increase the probability that a particular consensus sequence is a real splice
site. However, sometimes the interactions are non-linear, so a particular
combination of enhancers will have a strong effect even if each enhancer individually
would have none.
A key aspect of the models we are trying to build will be to incorporate what
is known about existing SREs to help train models with less data, as well as
to generate new hypothesis about SREs and their interactions.