AbstractMotivation: Promoter prediction is an important task in genome annotation projects, and during the past years many new promoter prediction programs (PPPs) have emerged. However, many of these programs are compared inadequately to other programs. In most cases, only a small portion of the genome is used to evaluate the program, which is not a realistic setting for whole genome annotation projects. In addition, a common evaluation design to properly compare PPPs is still lacking.
Results: We present a large-scale benchmarking study of 17 state-of-the-art PPPs. A multi-faceted evaluation strategy is proposed that can be used as a gold standard for promoter prediction evaluation, allowing authors of promoter prediction software to compare their method to existing methods in a proper way. This evaluation strategy is subsequently used to compare the chosen promoter predictors, and an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of the predictions is conducted.
ReferenceAbeel, T., Van de Peer, Y., Saeys, Y. 2009. Towards a gold standard for promoter prediction evaluation. Bioinformatics. 25(12):i313-i320.
PubMed - Bioinformatics
Important note for novel promoter prediction programs!When using this program to validate newly developed programs, it is important that you don't use the data that is used for validation in pppBenchmark as training data for your program.
The proper procedure here would be to split the data in training and testing, and run your program and pppBenchmark for each split. The best way to split the data is on the level of complete chromosomes. For example use 23 chromosomes for training and the one left out for validation. Do this for each chromosome left out.
Sample for splits
Download the sample data.
Version 1.3 (Dec. 1, 2009) Based on user feedback, we have made some changes to pppBenchmark. We added the --prefix option to have a prefix to all outputted files. The recall and precision axis are flipped, this allows slightly more accurate calculation of the auPRC as no points have to be discarded for a smooth curve. Version 1.2 (Nov. 4, 2009) Cleaned up the program output to something more sensible. Cleaned the code to make it easier to maintain and to do updates in the future. Version 1.1 (Sep. 25, 2009) Added two command line switches to allow you to set which reference annotation to use. This allows you to use training/validation folds of the original data. Version 1.0 Initial version as described in the paper