Research

Gene prediction and annotation

Due to recent developments in sequencing, complete genomes are being determined at an ever increasing rate. However, the correct identification of many genes remains a major challenge. In order to identify and determine the structure of genes, for which no experimental information is available yet, and spliced-alignments between the transcript and the genomic sequence cannot be produced, we need to use predictive in silico methods, based on intrinsic approaches. Our lab is involved in many gene and genome annotation projects, mostly of plants, but also of animals, fungi, and unicellular organisms.

Apart from our extensive expertise in annotating genomes, we also have an interest in developing and improving tools to identify genes. Intrinsic gene finders aim at locating all the gene elements that occur in a genomic sequence, including possible partial gene structures at the border of the sequence, using intrinsic (within the same genome) features and are either based on content sensors (e.g. coding potential, compositional bias, codon usage) or signal sensors (e.g. splice sites, translation initiation codon). In short, features of known genes are being used to find other, previously undetected genes within the same genome. Intrinsic gene prediction relies on combinatorial, statistical and/or Artificial Intelligence methods and preferably integrates several different approaches. Even better is to combine intrinsic approaches with extrinsic approaches, the latter using homology based information from other genomes.

However, since automatic procedures will never capture the complete set of all genes in a genome and/or their correct structure, let alone the functional description of those genes, we also invest in manual curation efforts. To that end we host and maintain an online curation portal for eukaryotic genomes (ORCAE), where researchers from all over the world have everything at hand to complement the automatically generated gene structures and functions with human curation.

The regular update of genome annotations is necessary in order to continually improve its quality. The improvements made to previous annotations are mainly due to the release of novel experimental data, in particular coming from the large-scale sequencing of ESTs, cDNAs, and NGS data and to the development of more efficient gene finding software. One can imagine that visualizing the vast amount of data, especially NGS data that is nowadays being generated in the framework of a genome project is posing issues to visually present this information to annotators. With this in mind we developed a software tool called GenomeView that is capable of displaying all these data in a user-friendly way.

We refer to the Genomes section of our website to see in which genome annotation projects we are (or have been) involved.

Gene and genome duplications

Detailed analyses of the genomes of several model organisms revealed that gene duplications have played a prominent role in the evolutionary history of many eukaryotes. In addition, ancient whole-genome duplications (WGDs), also referred to as paleopolyploidizations, have been reported in most evolutionary lineages. Their evolutionary importance, however, remains a major topic of discussion, ranging from an evolutionary dead end to a road toward evolutionary success, with evidence supporting both fates.

Both angiosperm (flowering plants) and vertebrate ancestors have undergone at least two separate WGDs. In the vertebrate lineage, a third WGD occurred in the ancestor of teleost fish. In the angiosperm lineage, subsequent and sometimes repeated WGDs have been reported in all major clades. All flowering plants and vertebrates have thus descended from an ancestor who doubled its genome. Ancient WGDs have also been documented in other kingdoms. Nevertheless, paleopolyploidy events seem to be extremely rare, and the number of established ancient WGDs very small. The evolutionary success of land vertebrates, fishes, and flowering plants however, would suggest that, although descendants of WGD events do not survive often, when they do survive their evolutionary lineage can be very successful. The observation that WGDs often seem to give rise to very species-rich groups of organisms suggest that polyploid species have outcompeted their diploid progenitors or diploid sister lineages, or alternatively, that polyploidy can facilitate diversification and speciation of organisms. Another question is whether these ancient WGDs have survived by coincidence or whether they could survive only because they did occur, or were selected for, at very specific times, for instance during major ecological upheavals and periods of extinction. To adress these questions we study the patterns and effects of WGDs at different levels.

Using state-of-the-art phylogenetic dating methods we recently analyzed 41 plant genomes and found a strongly nonrandom pattern of genome duplications over time with many WGDs clustering around the Cretaceous–Paleogene (K–Pg) extinction event about 66 million years ago. This suggests that the environmental and ecological conditions during the time of polyploidization may be of crucial importance, and that the establishment of WGDs is potentially promoted during times of environmental stress.

We are also developing computational models of artificial gene regulatory networks and use population-based evolutionary simulations and evolutionary robotics to study the effects and evolutionary fate and significance of small- and large-scale genome duplications. For example, we are investigating their consequences on genome and network evolution, and are examining whether gene and genome duplications could potentially be benefical for adaptation and/or survival and under which specific conditions and scenarios. Finally, we hope that results from our studies on WGDs using simulated evolutionary robots could in turn benefit the field of evolutionary robotics and lead to the development of better evolutionary robots.

Data and text mining

The extraction of knowledge from large and often unstructured biological datasets calls for automated techniques that aim to identify novel connections between different entities. Machine learning (ML) techniques are ideally suited to extract such knowledge, as they are able to analyse complex datasets with a large number of features. The aim of our group is thus to apply existing, or develop new ML techniques that efficiently extract new knowledge for specific bioinformatics tasks. For instance, our group has experience in applying ML techniques for gene prediction and text mining.

In the past, we have mainly focused on ab initio gene prediction, developing predictors that can identify the major structural and functional elements of a gene. Current high-throughput sequencing technologies however require novel approaches, and the huge amounts of available sequence data provide unique opportunities for the development of scalable machine learning techniques that can contribute to novel expert systems for genome annotation.

Another interesting area for the application of machine learning techniques is the automated extraction of knowledge from literature. With over 23 million citations in PubMed, the scientific literature contains a wealth of information that can be automatically extracted and structured through text mining technology. To this end, we have established a state-of-the art text-mining pipeline in close collaboration to the University of Turku, and have applied it to the whole of PubMed, resulting in the EVEX resource. We are further exploring novel ways of combining this text-mined data to a variety of other data types, including experimental data and records from authoritative resources. These data integration efforts are key to support our research in the field of systems biology.

Systems biology

Using a top-down approach, we mainly focus on developing and applying methods for identifying functional modules in integrated regulatory and protein interaction networks, as well as reverse-engineering regulatory networks from transcriptome data. Further, we aim at understanding how physical interaction networks mediate dynamic responses to external stimuli, and we attempt to model the evolution of network modules after whole genome duplications.

To learn module networks from gene expression data, we have previously developed the software package LeMoNe, which applies ensemble-based techniques to predict the condition-dependent expression levels of modules of coexpressed genes. To identify functional modules in integrated networks, we have introduced the CyClus3D Cytoscape plugin.

We are further actively engaged in the development of novel systems biology tools which allow modeling of the rewiring of physical networks when external stimuli such as environmental stress, occur. On a much larger time-scale, we are developing models to study the evolution of regulatory and protein interaction networks following whole genome duplication events, as well as working on methods for identifying conserved modules between multiple species.

Our computational methods are validated on experimental data from a variety of model species, including S. cerevisiae, A. thaliana, C. elegans, and human.

credits

Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)