Gene prediction and annotation
Due to recent developments in sequencing, complete genomes are being determined at an ever increasing rate. However, the correct identification of many genes remains a major challenge. In order to identify and determine the structure of genes, for which no experimental information is available yet, and spliced-alignments between the transcript and the genomic sequence cannot be produced, we need to use predictive in silico methods, based on intrinsic approaches. Our lab is involved in many gene and genome annotation projects, mostly of plants, but also of animals, and unicellular organisms.
Apart from annotating genomes, we also use machine learning techniques to develop and improve tools to identify genes. Intrinsic gene finders aim at locating all the gene elements that occur in a genomic sequence, including possible partial gene structures at the border of the sequence, using intrinsic (within the same genome) features and are either based on content sensors (e.g. coding potential, compositional bias, codon usage) or signal sensors (e.g. splice sites, translation initiation codon). In short, features of known genes are being used to find other, previously undetected genes within the same genome. Intrinsic gene prediction relies on combinatorial, statistical and/or Artificial Intelligence methods and preferably integrates several different approaches. Even better is to combine intrinsic approaches with extrinsic approaches, the latter using homology based information from other genomes.
The regular update of genome annotations is necessary in order to continually improve its quality. The improvements made to previous annotations are mainly due to the release of novel experimental data, in particular coming from the large-scale sequencing of ESTs, cDNAs, and NGS data and to the development of more efficient gene finding software. At the moment, we are developing, applying, and evaluating novel classification techniques in order to further improve gene prediction.
Check the Genomes section of our website to see in which genome annotation projects we are involved.
We are interested in the development of methods for reverse-engineering transcription regulatory networks from transcriptome data, identifying functional modules in integrated regulatory and protein interaction networks, understanding how physical interaction networks mediate the condition-dependent response to external stimuli, and modeling the evolution of network modules after whole genome duplications. Our computational methods are validated on experimental data for Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, and human.
Our main focus to date has been on reverse-engineering transcription regulatory networks from transcriptome data. We have developed a software package LeMoNe for learning module networks which uses ensemble-based techniques to infer a probabilistic model to predict the condition-dependent expression levels of modules of coexpressed genes, based on the combined expression levels of a set of regulators. LeMoNe has been benchmarked and compared with state-of-the-art methods using transcriptome data for yeast and E. coli. Past and ongoing projects have used LeMoNe to infer developmental regulatory modules in C. elegans, microRNA regulatory modules in human cancer cells, regulatory variants underlying heterosis in A. thaliana using SNP and diallel expression data, stress and cell cycle dependent regulatory modules in A. thaliana, and posttranscriptional regulatory modules in yeast.
Functional modules observed in transcriptome data are the result of physical interactions taking place at the protein-DNA or protein-protein level. We have developed a Network Motif Clustering Toolbox to identify modules in integrated networks, which forms a general data integration methodology. It is based on the presence of network motifs, small, frequently occuring subgraphs, which represent functional relationships between heterogeneous data types. We have benchmarked the algorithm on an integrated network in yeast with more than 50,000 transcription factor binding, protein-protein and phosphorylation interactions.
An important question is how physical networks mediate the condition-dependent response to external stimuli that is observed in transcriptome data. We have introduced the notion of regulatory path motifs, short paths in the physical network which occur significantly more often than expected by chance between transcription factors and their targets in the perturbational expression data. A study is yeast has shown that these paths explain a more than five- to ten-fold higher number of perturbed targets compared to using direct transcriptional links only. These paths are organized into functional modules which can be identified using the Network Motif Clustering Toolbox.
The modular organization observed in all biological interaction networks has arisen during evolution by gene and genome duplications followed by interaction gain and loss. While simple duplication-divergence models have been able to explain some large-scale properties of biological networks, such as the appearance of a scale-free-like topology, very little is known about how specific modular subparts have evolved. We have recently started to develop models for the evolution of regulatory and protein interaction networks following whole genome duplication and we are also working on methods for identifying conserved modules between multiple species.
Gene and genome evolution
Detailed analyses of the genomes of several model organisms revealed that gene duplication have played a prominent role in the evolutionary history of many eukaryotes. In addition, many genomes show traces of ancient whole genome duplications. Evidence for large-scale gene duplication or entire genome duplication events often comes from the detection of block or segmental duplications. However, the detection of segmental duplications in genomes is not self-evident. We have developed dedicated software that tries to find colinearity between and within genomes. Apart from looking for evidence for such large-scale gene duplication events, we also try to study the consequences of these events for biological evolution. Other areas of interest are genome evolution in general, the functional divergence of genes, (evolution of) regulation and evolutionary robotics.
Gene and genome duplication have played a prominent role in the evolutionary history of many eukaryotes. Evidence for large-scale gene duplication or entire genome duplication events often comes from the detection of block or segmental duplications. However, the detection of segmental duplications in genomes is often not straightforward. We have developed dedicated software that tries to find colinearity between and within genomes. We also try to study the consequences of these events for biological evolution.
Recently, we have also started to develop mathematical models that simulate the birth and death of genes based on observed age distributions of duplicated genes, considering both small-scale, continuously occurring local duplication events, as well as duplication events affecting the whole genome. Application of our model shows that much of the genetic material in extant plants, i.e., about 60%, has been created by ancient genome duplication events. More importantly, it seems that a major fraction of those genes could have been retained only because it was created in large-scale gene duplication events. In particular transcription factors, signal transducers, and developmental genes seem to have been retained subsequent to large-scale gene duplication events. Since the divergence of regulatory genes is being considered necessary to bring about phenotypic variation and increase in biological complexity, it is tempting to conclude that such large scale gene duplication events have indeed been of major importance for evolution.
It is now generally accepted that duplication of genes has been very important because it increases the genetic material on which evolution can work. Although this idea is not new, recently, novel hypotheses have been put forward to explain the retention and functional divergence of duplicated genes. In our team, we investigate the structure and evolution of duplicated genes in order to elucidate how genes diverge in function. Genomes are also used to study genome evolution in general, as well as the functional divergence of genes, (evolution of) regulation and evolutionary robotics.
Although it is easy to understand that changes in the place or time of gene expression can create new or alternative molecular interactions, little information about the organization and evolution of transcriptional regulation in plants is known. However, this knowledge is essential because each gene is flanked by regulatory sequences which, together with the expression and activity of other proteins, determine the amount, place, and timing of expression. Therefore, characterizing these regulatory motifs is required in order to understand the regulatory interactions between trans-acting proteins and the promoters of thousands of genes within a eukaryotic genome. This information is also essential when studying biological processes from a holistic point of view by integrating complementary functional data sets. We are studying the architecture of plant promoter sequences and try to identify cis-regulatory elements (or TFBS), which play an important role in transcriptional regulation. Fully sequenced plant genomes, together with extensive expression data sets are used to characterize the basic composition of plant promoters (e.g. TFBS, cis-regulatory modules) and to study their evolution within the green plant lineage.
Data mining and machine learning
The application of machine learning techniques has become fundamental in extracting knowledge from biological data and in the process of automating important tasks. In biology, where structures or processes are described by a large number of features, and complex interactions exist between these features, techniques from machine learning are the only practical way to explore these datasets. Amongst other things, we are applying machine learning techniques to gene prediction and annotation and text mining.
The extraction of new knowledge from potentially large biological datasets often calls for automated techniques that aim to find new structures, or important connections between different entities. Data mining and machine learning techniques are ideally suited to extract such knowledge, and the aim of our group is to apply existing, or develop new techniques that efficiently extract new knowledge for specific bioinformatics tasks. A recurring aspect of data mining techniques in bioinformatics, is the ability to gain more insight in the underlying processes by examining important features, and their relations. The subdomain of feature selection is therefore an important research topic in our group, and a lot of expertise in this area has been gained.In the past, a lot of our work has been applied in the context of gene prediction and genome annotation, where we largely contributed to expert systems for ab initio gene prediction, developing predictors that can identify the major functional elements of a gene, such as the promoter region, transcription start site, translation initiation site and splice sites. Recent advances in genome sequencing, such as the next generation sequencing technologies are now revolutionizing the field of genome annotation. These new techniques call for new approaches to gene prediction, and new expert systems for genome annotation. Therefor, we now focus on scalable machine learning techniques that are able to deal with these huge amounts of sequence data. More recently, data mining and machine learning techniques have been identified as key components in systems biology, where data mining is both used in identifying single components, as well as connecting different components through data integration. An interesting new development in the integration of different types of data comes from the application of text mining techniques to the scientific literature. The latter has emerged as a new and important source of knowledge, and in order to automatically extract knowledge from the literature, we develop new text mining techniques that try to extract relevant information in an automatic way.
VIB / UGent
Bioinformatics & Evolutionary Genomics
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)