Discriminative and informative features for biomolecular text mining with ensemble feature selection
In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results.We show that our feature selection methodology successfully discards a large fraction of machine generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how feature selection can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs.Finally, we discuss a number of insights from our feature selection analyses that will provide the opportunity to considerably improve upon current text-mining tools.
* Van Landeghem, S., * Abeel, T., Saeys, Y., Van de Peer, Y. (2010) Discriminative and informative features for biomolecular text mining with ensemble feature selection. Bioinformatics 26, 554-560. *contributed equally
VIB / UGent
Bioinformatics & Evolutionary Genomics
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)