Evaluating large-scale text mining applications beyond the traditional numeric performance measures
Sofie Van Landeghem1,2, Suwisa Kaewphan3, Filip Ginter3, Yves Van de Peer1,2,*
1 Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Ghent, Belgium
2 Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
3 Dept. of Information Technology, University of Turku, Finland, Joukahaisenkatu 3-5, 20520 Turku, Finland
*Corresponding author, E-mail: email@example.com
Text mining methods for the biomedical domain have matured substantially and are currently being applied on a large scale to support a variety of applications in systems biology, pathway curation, data integration and gene summarization. Community-wide challenges in the BioNLP research field provide gold-standard datasets and rigorous evaluation criteria, allowing for a meaningful comparison between techniques as well as measuring progress within the field. However, such evaluations are typically conducted on relatively small training and test datasets. On a larger scale, systematic erratic behaviour may occur that severely influences hundreds of thousands of predictions. In this work, we perform a critical assessment of a large-scale text mining resource, identifying systematic errors and determining their underlying causes through semi-automated analyses and manual evaluations.
In the first file, the term *Any_numeric_value refers to any number, such as -9 or 342, which can easily be checked within any programming language or with a regular expression.
The first file shows how 3 false-positive predictions could be detected in the winning submission of the BioNLP ST'13 GE challenge.
VIB / UGent
Bioinformatics & Evolutionary Genomics
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)