Evaluating large-scale text mining applications beyond the traditional numeric performance measures

Sofie Van Landeghem1,2, Suwisa Kaewphan3, Filip Ginter3, Yves Van de Peer1,2,*

1 Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Ghent, Belgium

2 Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium

3 Dept. of Information Technology, University of Turku, Finland, Joukahaisenkatu 3-5, 20520 Turku, Finland

*Corresponding author, E-mail: yves.vandepeer@psb.vib-ugent.be

Abstract

Text mining methods for the biomedical domain have matured substantially and are currently being applied on a large scale to support a variety of applications in systems biology, pathway curation, data integration and gene summarization. Community-wide challenges in the BioNLP research field provide gold-standard datasets and rigorous evaluation criteria, allowing for a meaningful comparison between techniques as well as measuring progress within the field. However, such evaluations are typically conducted on relatively small training and test datasets. On a larger scale, systematic erratic behaviour may occur that severely influences hundreds of thousands of predictions. In this work, we perform a critical assessment of a large-scale text mining resource, identifying systematic errors and determining their underlying causes through semi-automated analyses and manual evaluations.

Supplementary data


Filtering lists

In the first file, the term *Any_numeric_value refers to any number, such as -9 or 342, which can easily be checked within any programming language or with a regular expression.
In the second file, *All refers to all possible event types, while *All_non_reg exludes the regulatory events, i.e. Regulation, Positive regulation, Negative regulation and Catalysis.


Log files

The first file shows how 3 false-positive predictions could be detected in the winning submission of the BioNLP ST'13 GE challenge.
The second file contains the log of all errors that could automatically be detected and resolved within the EVEX resource, covering around 1.2% of the 40 million events in this dataset.


Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)

Don't hesitate to contact the in case of problems with the website!