Splice site prediction in eukaryote genome sequences: the algorithmic issues.
Translating a gene into a protein starts by copying the part of the genome that codes for the protein on a primary transcript also called precursor RNA (or pre-mRNA). In eukaryotes, the primary transcripts of most protein-encoding nuclear genes are interrupted by introns that are removed by a process called splicing. The pre-mRNA serves as a code messenger between the cell nucleus that contains the DNA and the cytoplasm where the code (mRNA) is translated into a protein. Before leaving the nucleus, the pre-mRNA is spliced to obtain the mature mRNA. This splicing process identifies non-coding parts of the pre-mRNA transcript, the introns, and excises them out. The biological machinery that performs the actual splicing is called the spliceosome. This cellular machinery is a huge protein complex, which is formed through the dynamic association of smaller complexes called snRNP, embedding RNA components (snRNAs) labeled U1, U2, U4, U5 and U6. The snRNP recognizes features in introns that are landmarks for splicing. The snRNAs are playing a central role in this process through base-pairing with specific binding sites located on the pre-mRNA and/or to each other. In a simplified splicing model, the transition from a coding (exon) to a non-coding (intron) part of the pre-mRNA, which is called the donor site, is identified by the U1 snRNP through base-pairing of the U1 snRNA with this donor site. The U2 snRNP recognizes a branch site that is located somewhere downstream the donor, through base-pairing of the U2 snRNA with the branch site. The U1 and U2 snRNP come together while the other snRNPs associate to form the fully functional spliceosome. The pre-mRNA is excised at the donor site and the free intron boundary loops with the branch site, forming a lariat. Only then is the transition from the intron to the next exon, called the acceptor site, recognized. This acceptor site is usually the first AG dinucleotide located downstream the branch site. An intron can then in practice be defined as the part of the premRNA located between a donor and an acceptor site. Conversely, locating the non-coding parts of the pre-mRNA involves the identification of the donor and acceptor sites in pairs on the pre-mRNA.
Degroeve, S., Saeys, Y., De Baets, B., Van de Peer, Y., Rouzé, P. (2004) Splice site prediction in eukaryote genome sequences: the algorithmic issues. The New Avenues in Bioinformatics J. Seckbach (ed.) another book of the Cellular Origin and Life in Extreme Habitats Book Series. Kluwer Academic Publishers, Dordrecht, The Netherlands. (2004).
VIB / UGent
Bioinformatics & Evolutionary Genomics
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)