Predicting splice sites from high-dimensional local context representations.

Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., Van de Peer, Y.

Corresponding author:

Abstract



Motivation
In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions.

Results
This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation and (iii) can be trained by the user on its own data.

Availability
Results, figures and software are available at http://bioinformatics.psb.ugent.be/supplementary_data/svgro/splicemachine/

Contact
sven.degroeve@psb.ugent.be; yves.vandepeer@psb.ugent.be.

Supplementary Data

Methods and Data

Parameter Optimization

The following plots show contour plots of the (p,q) optimization results for each feature sets. A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. That is, given a value for FN5% (see manuscript for an explanation), lines are drawn for connecting the (p,q) coordinates where that FN5% value occurs. The plots present a nice visualization of the optimal (p,q) area.

Prediction Performance

The results on the Arabidopsis and Human GeneSplicer data set:

The results on AraSet:

Online Prediction Server

  • Run splicemachine online
  • Create your own training sets. (Download - Instructions inside.)
  • Send us your own training sets for use with the online prediction server












Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)

Don't hesitate to contact the in case of problems with the website!