Publications

Publications

  1. De La Torre, A. R., Li, Z., Van de Peer, Y., & Ingvarsson, P. K. (2017). Contrasting rates of molecular evolution and patterns of selection among gymnosperms and flowering plants. MOLECULAR BIOLOGY AND EVOLUTION, 34(6), 1363–1377.
    The majority of variation in rates of molecular evolution among seed plants remains both unexplored and unexplained. Although some attention has been given to flowering plants, reports of molecular evolutionary rates for their sister plant clade (gymnosperms) are scarce, and to our knowledge differences in molecular evolution among seed plant clades have never been tested in a phylogenetic framework. Angiosperms and gymnosperms differ in a number of features, of which contrasting reproductive biology, life spans, and population sizes are the most prominent. The highly conserved morphology of gymnosperms evidenced by similarity of extant species to fossil records and the high levels of macrosynteny at the genomic level have led scientists to believe that gymnosperms are slow-evolving plants, although some studies have offered contradictory results. Here, we used 31,968 nucleotide sites obtained from orthologous genes across a wide taxonomic sampling that includes representatives of most conifers, cycads, ginkgo, and many angiosperms with a sequenced genome. Our results suggest that angiosperms and gymnosperms differ considerably in their rates of molecular evolution per unit time, with gymnosperm rates being, on average, seven times lower than angiosperm species. Longer generation times and larger genome sizes are some of the factors explaining the slow rates of molecular evolution found in gymnosperms. In contrast to their slow rates of molecular evolution, gymnosperms possess higher substitution rate ratios than angiosperm taxa. Finally, our study suggests stronger and more efficient purifying and diversifying selection in gymnosperm than in angiosperm species, probably in relation to larger effective population sizes.
  2. Li, Zhen, De La Torre, A. R., Sterck, L., Cánovas, F. M., Avila, C., Merino, I., Cabezas, J. A., et al. (2017). Single-copy genes as molecular markers for phylogenomic studies in seed plants. GENOME BIOLOGY AND EVOLUTION, 9(5), 1130–1147.
    Phylogenetic relationships among seed plant taxa, especially within the gymnosperms, remain contested. In contrast to angio-sperms, for which several genomic, transcriptomic and phylogenetic resources are available, there are few, if any, molecular markers that allow broad comparisons among gymnosperm species. With few gymnosperm genomes available, recently obtained transcriptomes in gymnosperms are a great addition to identifying single-copy gene families as molecular markers for phylogenomic analysis in seed plants. Taking advantage of an increasing number of available genomes and transcriptomes, we identified single-copy genes in a broad collection of seed plants and used these to infer phylogenetic relationships between major seed plant taxa. This study aims at extending the current phylogenetic toolkit for seed plants, assessing its ability for resolving seed plant phylogeny, and discussing potential factors affecting phylogenetic reconstruction. In total, we identified 3,072 single-copy genes in 31 gymnosperms and 2,156 single-copy genes in 34 angiosperms. All studied seed plants shared 1,469 single-copy genes, which are generally involved in functions like DNA metabolism, cell cycle, and photosynthesis. A selected set of 106 single-copy genes provided good resolution for the seed plant phylogeny except for gnetophytes. Although some of our analyses support a sister relationship between gnetophytes and other gymnosperms, phylogenetic trees from concatenated alignments without 3rd codon positions and amino acid alignments under the CAT + GTR model, support gnetophytes as a sister group to Pinaceae. Our phylogenomic analyses demonstrate that, in general, single-copy genes can uncover both recent and deep divergences of seed plant phylogeny.
  3. Ruprecht, C., Lohaus, R., Vanneste, K., Mutwil, M., Nikoloski, Z., Van de Peer, Y., & Persson, S. (2017). Revisiting ancestral polyploidy in plants. SCIENCE ADVANCES, 3(7).
    Whole-genome duplications (WGDs) or polyploidy events have been studied extensively in plants. In a now widely cited paper, Jiao et al. presented evidence for two ancient, ancestral plant WGDs predating the origin of flowering and seed plants, respectively. This finding was based primarily on a bimodal age distribution of gene duplication events obtained from molecular dating of almost 800 phylogenetic gene trees. We reanalyzed the phylogenomic data of Jiao et al. and found that the strong bimodality of the age distribution may be the result of technical and methodological issues and may hence not be a "true" signal of two WGD events. By using a state-of-the-art molecular dating algorithm, we demonstrate that the reported bimodal age distribution is not robust and should be interpreted with caution. Thus, there exists little evidence for two ancient WGDs in plants from phylogenomic dating.
  4. Mizrachi, E., Verbeke, L., Van de Peer, Y., Marchal, K., & Myburg, A. A. (2017). Network analysis of woody biomass. Cell Systems.
  5. Christie, Nanette, Myburg, A. A., Joubert, F., Murray, S. L., Carstens, M., Lin, Y.-C., Meyer, J., et al. (2017). Systems genetics reveals a transcriptional network associated with susceptibility in the maize-grey leaf spot pathosystem. PLANT JOURNAL, 89(4), 746–763.
    We used a systems genetics approach to elucidate the molecular mechanisms of the responses of maize to grey leaf spot (GLS) disease caused by Cercosporazeina, a threat to maize production globally. Expression analysis of earleaf samples in a subtropical maize recombinant inbred line population (CML444xSC Malawi) subjected in the field to C. zeina infection allowed detection of 20206 expression quantitative trait loci (eQTLs). Four trans-eQTL hotspots coincided with GLS disease QTLs mapped in the same field experiment. Co-expression network analysis identified three expression modules correlated with GLS disease scores. The module (GY-s) most highly correlated with susceptibility (r=0.71; 179 genes) was enriched for the glyoxylate pathway, lipid metabolism, diterpenoid biosynthesis and responses to pathogen molecules such as chitin. The GY-s module was enriched for genes with trans-eQTLs in hotspots on chromosomes 9 and 10, which also coincided with phenotypic QTLs for susceptibility to GLS. This transcriptional network has significant overlap with the GLS susceptibility response of maize line B73, and may reflect pathogen manipulation for nutrient acquisition and/or unsuccessful defence responses, such as kauralexin production by the diterpenoid biosynthesis pathway. The co-expression module that correlated best with resistance (TQ-r; 1498 genes) was enriched for genes with trans-eQTLs in hotspots coinciding with GLS resistance QTLs on chromosome 9. Jasmonate responses were implicated in resistance to GLS through co-expression of COI1 and enrichment of genes with the Gene Ontology term cullin-RING ubiquitin ligase complex' in the TQ-r module. Consistent with this, JAZ repressor expression was highly correlated with the severity of GLS disease in the GY-s susceptibility network.
  6. Heydari, M., Miclotte, G., Demeester, P., Van de Peer, Y., & Fostier, J. (2017). Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC BIOINFORMATICS, 18.
    Background: Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. Results: For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. Conclusions: We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.
  7. De Tiège, A., Van de Peer, Y., Braeckman, J., & Tanghe, K. (2017). The sociobiology of genes : the gene’s eye view as a unifying behavioural-ecological framework for biological evolution. HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES .
  8. Meysman, P., Saeys, Y., Sabaghian, E., Bittremieux, W., Van de Peer, Y., Goethals, B., & Laukens, K. (2017). Mining the enriched subgraphs for specific vertices in a biological graph. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS.
    In this paper, we present a subgroup discovery method to find subgraphs in a graph that are associated with a given set of vertices. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment based on a Bonferroni-corrected hypergeometric probability value. This interestingness measure requires a dedicated pruning procedure to limit the number of subgraph matches that must be calculated. The presented mining algorithm to find associated subgraph patterns in large graphs is therefore designed to efficiently traverse the search space. We demonstrate the operation of this method by applying it on three biological graph data sets and show that we can find associated subgraphs for a biologically relevant set of vertices and that the found subgraphs themselves are biologically interesting.
  9. Miclotte, G., Plaisance, S., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2017). OMSim : a simulator for optical map data. BIOINFORMATICS, 33(17), 2740–2742.
    Motivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available. Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements.
  10. Yao, Yao, & Van de Peer, Y. (2017). Simulating biological complexity through artificial evolution. Cybernetics (CYBCONF), 2017 3rd IEEE international conference. Presented at the 3rd IEEE International conference on Cybernetics (CYBCONF-2017), New York, NY, USA: IEEE.
  11. Roodt, D., Lohaus, R., Sterck, L., Swanepoel, R. L., Van de Peer, Y., & Mizrachi, E. (2017). Evidence for an ancient whole genome duplication in the cycad lineage. PLOS ONE, 12(9).
    Contrary to the many whole genome duplication events recorded for angiosperms (flowering plants), whole genome duplications in gymnosperms (non-flowering seed plants) seem to be much rarer. Although ancient whole genome duplications have been reported for most gymnosperm lineages as well, some are still contested and need to be confirmed. For instance, data for ginkgo, but particularly cycads have remained inconclusive so far, likely due to the quality of the data available and flaws in the analysis. We extracted and sequenced RNA from both the cycad Encephalartos natalensis and Ginkgo biloba. This was followed by transcriptome assembly, after which these data were used to build paralog age distributions. Based on these distributions, we identified remnants of an ancient whole genome duplication in both cycads and ginkgo. The most parsimonious explanation would be that this whole genome duplication event was shared between both species and had occurred prior to their divergence, about 300 million years ago.
  12. Cormier, A., Avia, K., Sterck, L., Derrien, T., Wucher, V., Andres, G., Monsoor, M., et al. (2017). Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus. NEW PHYTOLOGIST, 214(1), 219–232.
    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles.
  13. Erez, Z., Sorek, R., Shao, S., Hegde, R. S., Schweppe, D. K., Chavez, J. D., Bruce, J. E., et al. (2017). Principles of systems biology, No. 14. CELL SYSTEMS.
    This month: sage advice from phage to their offspring; systematic analyses of protein quality control, mitochondrial respiration, and woody biomass; a continental-scale experiment; and engineered protein tools galore.
  14. Vlastaridis, P., Kyriakidou, P., Chaliotis, A., Van de Peer, Y., Oliver, S. G., & Amoutzias, G. D. (2017). Estimating the total number of phosphoproteins and phosphorylation sites in eukaryotic proteomes. GIGASCIENCE, 6(2), 1–11.
    Background: Phosphorylation is the most frequent post-translational modification made to proteins and may regulate protein activity as either a molecular digital switch or a rheostat. Despite the cornucopia of high-throughput (HTP) phosphoproteomic data in the last decade, it remains unclear how many proteins are phosphorylated and how many phosphorylation sites (p-sites) can exist in total within a eukaryotic proteome. We present the first reliable estimates of the total number of phosphoproteins and p-sites for four eukaryotes (human, mouse, Arabidopsis, and yeast). Results: In all, 187 HTP phosphoproteomic datasets were filtered, compiled, and studied along with two low-throughput (LTP) compendia. Estimates of the number of phosphoproteins and p-sites were inferred by two methods: Capture-Recapture, and fitting the saturation curve of cumulative redundant vs. cumulative non-redundant phosphoproteins/p-sites. Estimates were also adjusted for different levels of noise within the individual datasets and other confounding factors. We estimate that in total, 13 000, 11 000, and 3000 phosphoproteins and 230 000, 156 000, and 40 000 p-sites exist in human, mouse, and yeast, respectively, whereas estimates for Arabidopsis were not as reliable. Conclusions: Most of the phosphoproteins have been discovered for human, mouse, and yeast, while the dataset for Arabidopsis is still far from complete. The datasets for p-sites are not as close to saturation as those for phosphoproteins. Integration of the LTP data suggests that current HTP phosphoproteomics appears to be capable of capturing 70% to 95% of total phosphoproteins, but only 40% to 60% of total p-sites.
  15. Cañas, R. A., Li, Z., Pascual, M. B., Castro-Rodríguez, V., Ávila, C., Sterck, L., Van de Peer, Y., et al. (2017). The gene expression landscape of pine seedling tissues. PLANT JOURNAL, 91(6), 1064–1087.
    Conifers dominate vast regions of the Northern hemisphere. They are the main source of raw materials for timber industry as well as a wide range of biomaterials. Despite their inherent difficulties as experimental models for classical plant biology research, the technological advances in genomics research are enabling fundamental studies on these plants. The use of laser capture microdissection followed by transcriptomic analysis is a powerful tool for unravelling the molecular and functional organization of conifer tissues and specialized cells. In the present work, 14 different tissues from 1-month-old maritime pine (Pinus pinaster) seedlings have been isolated and their transcriptomes analysed. The results increased the sequence information and number of full-length transcripts from a previous reference transcriptome and added 39 841 new transcripts. In total, 2376 transcripts were ubiquitously expressed in all of the examined tissues. These transcripts could be considered the core 'housekeeping genes' in pine. The genes have been clustered in function to their expression profiles. This analysis reduced the number of profiles to 38, most of these defined by their expression in a unique tissue that is much higher than in the other tissues. The expression and localization data are accessible at ConGenIE.org (http://v22.popgenie.org/microdisection/). This study presents an overview of the gene expression distribution in different pine tissues, specifically highlighting the relationships between tissue gene expression and function. This transcriptome atlas is a valuable resource for functional genomics research in conifers.
  16. Zhang, G.-Q., Liu, K.-W., Li, Z., Lohaus, R., Hsiao, Y.-Y., Niu, S.-C., Wang, J.-Y., et al. (2017). The Apostasia genome and the evolution of orchids. NATURE, 549(7672), 379–383.
    Constituting approximately 10% of flowering plant species, orchids (Orchidaceae) display unique flower morphologies, possess an extraordinary diversity in lifestyle, and have successfully colonized almost every habitat on Earth(1-3). Here we report the draft genome sequence of Apostasia shenzhenica(4), a representative of one of two genera that form a sister lineage to the rest of the Orchidaceae, providing a reference for inferring the genome content and structure of the most recent common ancestor of all extant orchids and improving our understanding of their origins and evolution. In addition, we present transcriptome data for representatives of Vanilloideae, Cypripedioideae and Orchidoideae, and novel third-generation genome data for two species of Epidendroideae, covering all five orchid subfamilies. A. shenzhenica shows clear evidence of a whole-genome duplication, which is shared by all orchids and occurred shortly before their divergence. Comparisons between A. shenzhenica and other orchids and angiosperms also permitted the reconstruction of an ancestral orchid gene toolkit. We identify new gene families, gene family expansions and contractions, and changes within MADS-box gene classes, which control a diverse suite of developmental processes, during orchid evolution. This study sheds new light on the genetic mechanisms underpinning key orchid innovations, including the development of the labellum and gynostemium, pollinia, and seeds without endosperm, as well as the evolution of epiphytism; reveals relationships between the Orchidaceae subfamilies; and helps clarify the evolutionary history of orchids within the angiosperms.
  17. Mizrachi, E., Verbeke, L., Christie, N., Fierro Gutierrez, A. C. E., Mansfield, S. D., Davis, M. F., Gjersing, E., et al. (2017). Network-based integration of systems genetics data reveals pathways associated with lignocellulosic biomass accumulation and processing. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 114(5), 1195–1200.
    As a consequence of their remarkable adaptability, fast growth, and superior wood properties, eucalypt tree plantations have emerged as key renewable feedstocks (over 20 million ha globally) for the production of pulp, paper, bioenergy, and other lignocellulosic products. However, most biomass properties such as growth, wood density, and wood chemistry are complex traits that are hard to improve in long-lived perennials. Systems genetics, a process of harnessing multiple levels of component trait information (e.g., transcript, protein, and metabolite variation) in populations that vary in complex traits, has proven effective for dissecting the genetics and biology of such traits. We have applied a network-based data integration (NBDI) method for a systems-level analysis of genes, processes and pathways underlying biomass and bioenergy-related traits using a segregating Eucalyptus hybrid population. We show that the integrative approach can link biologically meaningful sets of genes to complex traits and at the same time reveal the molecular basis of trait variation. Gene sets identified for related woody biomass traits were found to share regulatory loci, cluster in network neighborhoods, and exhibit enrichment for molecular functions such as xylan metabolism and cell wall development. These findings offer a framework for identifying the molecular underpinnings of complex biomass and bioprocessing-related traits. A more thorough understanding of the molecular basis of plant biomass traits should provide additional opportunities for the establishment of a sustainable bio-based economy.
  18. Van de Peer, Y., Mizrachi, E., & Marchal, K. (2017). The evolutionary significance of polyploidy. NATURE REVIEWS GENETICS, 18(7), 411–424.
    Polyploidy, or the duplication of entire genomes, has been observed in prokaryotic and eukaryotic organisms, and in somatic and germ cells. The consequences of polyploidization are complex and variable, and they differ greatly between systems (clonal or non-clonal) and species, but the process has often been considered to be an evolutionary 'dead end'. Here, we review the accumulating evidence that correlates polyploidization with environmental change or stress, and that has led to an increased recognition of its short-term adaptive potential. In addition, we discuss how, once polyploidy has been established, the unique retention profile of duplicated genes following whole-genome duplication might explain key longer-term evolutionary transitions and a general increase in biological complexity.
  19. Van Parys, T., Melckenbeeck, I., Houbraken, M., Audenaert, P., Colle, D., Pickavet, M., Demeester, P., et al. (2017). A Cytoscape app for motif enumeration with ISMAGS. BIOINFORMATICS, 33(3), 461–463.
    We present a Cytoscape app for the ISMAGS algorithm, which can enumerate all instances of a motif in a graph, making optimal use of the motif's symmetries to make the search more efficient. The Cytoscape app provides a handy interface for this algorithm, which allows more efficient network analysis.
  20. Yao, Yao, Marchal, K., & Van de Peer, Y. (2016). Adaptive self-organizing organisms using a bio-inspired gene regulatory network controller: for the aggregation of evolutionary robots under a changing environment. In Ying Tan (Ed.), Handbook of research on design, control and modeling of swarm robotics (pp. 68–82). Hershey, PA, USA: IGI Global.
    This work has explored the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behavior. Using an Alife simulation framework that mimics a changing environment, we have shown that separating the static from the conditionally active part of the network contributes to a better adaptive behavior. This chapter describes the biologically inspired concept of GRNs to develop a distributed robot self-organizing approach. In particular, it shows that by using this approach, multiple swarm robots can aggregate to form a robotic organism that can adapt its configuration as a response to a dynamically changing environment. In addition, through the comparison of several different simulation experiments, the results illustrate the impact of evolutionary operators such as mutations and duplications on improving the adaptability of organisms.
  21. Miclotte, G., Heydari, M., Demeester, P., Rombauts, S., Van de Peer, Y., Audenaert, P., & Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. ALGORITHMS FOR MOLECULAR BIOLOGY, 11, 10.
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
  22. Zhang, G.-Q., Xu, Q., Bian, C., Tsai, W.-C., Yeh, C.-M., Liu, K.-W., Yoshida, K., et al. (2016). The Dendrobium catenatum Lindl. genome sequence provides insights into polysaccharide synthase, floral development and adaptive evolution. SCIENTIFIC REPORTS, 6.
    Orchids make up about 10% of all seed plant species, have great economical value, and are of specific scientific interest because of their renowned flowers and ecological adaptations. Here, we report the first draft genome sequence of a lithophytic orchid, Dendrobium catenatum. We predict 28,910 protein-coding genes, and find evidence of a whole genome duplication shared with Phalaenopsis. We observed the expansion of many resistance-related genes, suggesting a powerful immune system responsible for adaptation to a wide range of ecological niches. We also discovered extensive duplication of genes involved in glucomannan synthase activities, likely related to the synthesis of medicinal polysaccharides. Expansion of MADS-box gene clades ANR1, StMADS11, and MIKC*, involved in the regulation of development and growth, suggests that these expansions are associated with the astonishing diversity of plant architecture in the genus Dendrobium. On the contrary, members of the type I MADS box gene family are missing, which might explain the loss of the endospermous seed. The findings reported here will be important for future studies into polysaccharide synthesis, adaptations to diverse environments and flower architecture of Orchidaceae.
  23. Yao, Yao, Storme, V., Marchal, K., & Van de Peer, Y. (2016). Emergent adaptive behaviour of GRN-controlled simulated robots in a changing environment. PEERJ, 4.
    We developed a bio-inspired robot controller combining an artificial genome with an agent-based control system. The genome encodes a gene regulatory network (GRN) that is switched on by environmental cues and, following the rules of transcriptional regulation, provides output signals to actuators. Whereas the genome represents the full encoding of the transcriptional network, the agent-based system mimics the active regulatory network and signal transduction system also present in naturally occurring biological systems. Using such a design that separates the static from the conditionally active part of the gene regulatory network contributes to a better general adaptive behaviour. Here, we have explored the potential of our platform with respect to the evolution of adaptive behaviour, such as preying when food becomes scarce, in a complex and changing environment and show through simulations of swarm robots in an A-life environment that evolution of collective behaviour likely can be attributed to bio-inspired evolutionary processes acting at different levels, from the gene and the genome to the individual robot and robot population.
  24. LE, P., Makhalanyane, T. P., Guerrero, L. D., Vikram, S., Van de Peer, Y., & Cowan, D. A. (2016). Comparative metagenomic analysis reveals mechanisms for stress response in hypoliths from extreme hyperarid deserts. GENOME BIOLOGY AND EVOLUTION, 8(9), 2737–2747.
    Understanding microbial adaptation to environmental stressors is crucial for interpreting broader ecological patterns. In the most extreme hot and cold deserts, cryptic niche communities are thought to play key roles in ecosystem processes and represent excellent model systems for investigating microbial responses to environmental stressors. However, relatively little is known about the genetic diversity underlying such functional processes in climatically extreme desert systems. This study presents the first comparative metagenome analysis of cyanobacteria-dominated hypolithic communities in hot (Namib Desert, Namibia) and cold (Miers Valley, Antarctica) hyperarid deserts. The most abundant phyla in both hypolith metagenomes were Actinobacteria, Proteobacteria, Cyanobacteria and Bacteroidetes with Cyanobacteria dominating in Antarctic hypoliths. However, no significant differences between the two metagenomes were identified. The Antarctic hypolithic metagenome displayed a high number of sequences assigned to sigma factors, replication, recombination and repair, translation, ribosomal structure, and biogenesis. In contrast, the Namib Desert metagenome showed a high abundance of sequences assigned to carbohydrate transport and metabolism. Metagenome data analysis also revealed significant divergence in the genetic determinants of amino acid and nucleotide metabolism between these two metagenomes and those of soil from other polar deserts, hot deserts, and non-desert soils. Our results suggest extensive niche differentiation in hypolithic microbial communities from these two extreme environments and a high genetic capacity for survival under environmental extremes.
  25. Cao, T. N. P., Greenhalgh, R., Dermauw, W., Rombauts, S., Bajda, S., Zhurov, V., Grbić, M., et al. (2016). Complex evolutionary dynamics of massively expanded chemosensory receptor families in an extreme generalist chelicerate herbivore. GENOME BIOLOGY AND EVOLUTION, 8(11), 3323–3339.
    While mechanisms to detoxify plant produced, anti-herbivore compounds have been associated with plant host use by herbivores, less is known about the role of chemosensory perception in their life histories. This is especially true for generalists, including chelicerate herbivores that evolved herbivory independently from the more studied insect lineages. To shed light on chemosensory perception in a generalist herbivore, we characterized the chemosensory receptors (CRs) of the chelicerate two-spotted spider mite, Tetranychus urticae, an extreme generalist. Strikingly, T. urticae has more CRs than reported in any other arthropod to date. Including pseudogenes, 689 gustatory receptors were identified, as were 136 degenerin/Epithelial Na+ Channels (ENaCs) that have also been implicated as CRs in insects. The genomic distribution of T. urticae gustatory receptors indicates recurring bursts of lineage-specific proliferations, with the extent of receptor clusters reminiscent of those observed in the CR-rich genomes of vertebrates or C. elegans. Although pseudogenization of many gustatory receptors within clusters suggests relaxed selection, a subset of receptors is expressed. Consistent with functions as CRs, the genomic distribution and expression of ENaCs in lineage-specific T. urticae expansions mirrors that observed for gustatory receptors. The expansion of ENaCs in T. urticae to > 3-fold that reported in other animals was unexpected, raising the possibility that ENaCs in T. urticae have been co-opted to fulfill a major role performed by unrelated CRs in other animals. More broadly, our findings suggest an elaborate role for chemosensory perception in generalist herbivores that are of key ecological and agricultural importance.
  26. Li, Zhen, Defoort, J., Tasdighian, S., Maere, S., Van de Peer, Y., & De Smet, R. (2016). Gene duplicability of core genes is highly consistent across all angiosperms. PLANT CELL, 28(2), 326–344.
    Gene duplication is an important mechanism for adding to genomic novelty. Hence, which genes undergo duplication and are preserved following duplication is an important question. It has been observed that gene duplicability, or the ability of genes to be retained following duplication, is a nonrandom process, with certain genes being more amenable to survive duplication events than others. Primarily, gene essentiality and the type of duplication (small-scale versus large-scale) have been shown in different species to influence the (long-term) survival of novel genes. However, an overarching view of "gene duplicability" is lacking, mainly due to the fact that previous studies usually focused on individual species and did not account for the influence of genomic context and the time of duplication. Here, we present a large-scale study in which we investigated duplicate retention for 9178 gene families shared between 37 flowering plant species, referred to as angiosperm core gene families. For most gene families, we observe a strikingly consistent pattern of gene duplicability across species, with gene families being either primarily single-copy or multicopy in all species. An intermediate class contains gene families that are often retained in duplicate for periods extending to tens of millions of years after whole-genome duplication, but ultimately appear to be largely restored to singleton status, suggesting that these genes may be dosage balance sensitive. The distinction between single-copy and multicopy gene families is reflected in their functional annotation, with single-copy genes being mainly involved in the maintenance of genome stability and organelle function and multicopy genes in signaling, transport, and metabolism. The intermediate class was overrepresented in regulatory genes, further suggesting that these represent putative dosage-balance-sensitive genes.
  27. Bolton, M. D., Ebert, M. K., Faino, L., Rivera-Varas, V., de Jonge, R., Van de Peer, Y., Thomma, B. P., et al. (2016). RNA-sequencing of Cercospora beticola DMI-sensitive and -resistant isolates after treatment with tetraconazole identifies common and contrasting pathway induction. FUNGAL GENETICS AND BIOLOGY, 92, 1–13.
    Cercospora beticola causes Cercospora leaf spot of sugar beet. Cercospora leaf spot management measures often include application of the sterol demethylation inhibitor (DMI) class of fungicides. The reliance on DMIs and the consequent selection pressures imposed by their widespread use has led to the emergence of resistance in C. beticola populations. Insight into the molecular basis of tetraconazole resistance may lead to molecular tools to identify DMI-resistant strains for fungicide resistance management programs. Previous work has shown that expression of the gene encoding the DMI target enzyme (CYP51) is generally higher and inducible in DMI-resistant C beticola field strains. In this study, we extended the molecular basis of DMI resistance in this pathosystem by profiling the transcriptional response of two C. beticola strains contrasting for resistance to tetraconazole. A majority of the genes in the ergosterol biosynthesis pathway were induced to similar levels in both strains with the exception of CbCyp51, which was induced several-fold higher in the DMI-resistant strain. In contrast, a secondary metabolite gene cluster was induced in the resistance strain, but repressed in the sensitive strain. Genes encoding proteins with various cell membrane fortification processes were induced in the resistance strain. Site-directed and ectopic mutants of candidate DMI-resistance genes all resulted in significantly higher EC50 values than the wild type strain, suggesting that the cell wall and/or membrane modified as a result of the transformation process increased resistance to tetraconazole. Taken together, this study identifies important cell membrane components and provides insight into the molecular events underlying DMI resistance in C beticola.
  28. Van de Peer, Y., & Pires, J. C. (2016). Editorial overview: Genome studies and molecular genetics: of plant genes, genomes, and genomics. CURRENT OPINION IN PLANT BIOLOGY.
  29. Van Landeghem, S., Van Parys, T., Dubois, M., Inzé, D., & Van de Peer, Y. (2016). Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks. BMC BIOINFORMATICS, 17.
    Background: Differential networks have recently been introduced as a powerful way to study the dynamic rewiring capabilities of an interactome in response to changing environmental conditions or stimuli. Currently, such differential networks are generated and visualised using ad hoc methods, and are often limited to the analysis of only one condition-specific response or one interaction type at a time. Results: In this work, we present a generic, ontology-driven framework to infer, visualise and analyse an arbitrary set of condition-specific responses against one reference network. To this end, we have implemented novel ontology-based algorithms that can process highly heterogeneous networks, accounting for both physical interactions and regulatory associations, symmetric and directed edges, edge weights and negation. We propose this integrative framework as a standardised methodology that allows a unified view on differential networks and promotes comparability between differential network studies. As an illustrative application, we demonstrate its usefulness on a plant abiotic stress study and we experimentally confirmed a predicted regulator. Availability: Diffany is freely available as open-source java library and Cytoscape plugin from http://bioinformatics.psb.ugent.be/supplementary_data/solan/diffany/.
  30. Kerchev, P., Waszczak, C., Lewandowska, A., Willems, P., Shapiguzov, A., Li, Z., Alseekh, S., et al. (2016). Lack of GLYCOLATE OXIDASE1, but not GLYCOLATE OXIDASE2, attenuates the photorespiratory phenotype of CATALASE2-deficient Arabidopsis. PLANT PHYSIOLOGY, 171(3), 1704–1719.
    The genes coding for the core metabolic enzymes of the photorespiratory pathway that allows plants with C3-type photosynthesis to survive in an oxygen-rich atmosphere, have been largely discovered in genetic screens aimed to isolate mutants that are unviable under ambient air. As an exception, glycolate oxidase (GOX) mutants with a photorespiratory phenotype have not been described yet in C3 species. Using Arabidopsis (Arabidopsis thaliana) mutants lacking the peroxisomal CATALASE2 (cat2-2) that display stunted growth and cell death lesions under ambient air, we isolated a second-site loss-of-function mutation in GLYCOLATE OXIDASE1 (GOX1) that attenuated the photorespiratory phenotype of cat2-2. Interestingly, knocking out the nearly identical GOX2 in the cat2-2 background did not affect the photorespiratory phenotype, indicating that GOX1 and GOX2 play distinct metabolic roles. We further investigated their individual functions in single gox1-1 and gox2-1 mutants and revealed that their phenotypes can be modulated by environmental conditions that increase the metabolic flux through the photorespiratory pathway. High light negatively affected the photosynthetic performance and growth of both gox1-1 and gox2-1 mutants, but the negative consequences of severe photorespiration were more pronounced in the absence of GOX1, which was accompanied with lesser ability to process glycolate. Taken together, our results point toward divergent functions of the two photorespiratory GOX isoforms in Arabidopsis and contribute to a better understanding of the photorespiratory pathway.
  31. Xie, Qingjun, Tzfadia, O., Levy, M., Weithorn, E., Peled-Zehavi, H., Van Parys, T., Van de Peer, Y., et al. (2016). hfAIM: a reliable bioinformatics approach for in silico genome-wide identification of autophagy-associated Atg8-interacting motifs in various organisms. AUTOPHAGY, 12(5), 876–887.
    Most of the proteins that are specifically turned over by selective autophagy are recognized by the presence of short Atg8 interacting motifs (AIMs) that facilitate their association with the autophagy apparatus. Such AIMs can be identified by bioinformatics methods based on their defined degenerate consensus F/W/Y-X-X-L/I/V sequences in which X represents any amino acid. Achieving reliability and/or fidelity of the prediction of such AIMs on a genome-wide scale represents a major challenge. Here, we present a bioinformatics approach, high fidelity AIM (hfAIM), which uses additional sequence requirementsthe presence of acidic amino acids and the absence of positively charged amino acids in certain positionsto reliably identify AIMs in proteins. We demonstrate that the use of the hfAIM method allows for in silico high fidelity prediction of AIMs in AIM-containing proteins (ACPs) on a genome-wide scale in various organisms. Furthermore, by using hfAIM to identify putative AIMs in the Arabidopsis proteome, we illustrate a potential contribution of selective autophagy to various biological processes. More specifically, we identified 9 peroxisomal PEX proteins that contain hfAIM motifs, among which AtPEX1, AtPEX6 and AtPEX10 possess evolutionary-conserved AIMs. Bimolecular fluorescence complementation (BiFC) results verified that AtPEX6 and AtPEX10 indeed interact with Atg8 in planta. In addition, we show that mutations occurring within or nearby hfAIMs in PEX1, PEX6 and PEX10 caused defects in the growth and development of various organisms. Taken together, the above results suggest that the hfAIM tool can be used to effectively perform genome-wide in silico screens of proteins that are potentially regulated by selective autophagy. The hfAIM system is a web tool that can be accessed at link: http://bioinformatics.psb.ugent.be/hfAIM/.
  32. Jelen, V., de Jonge, R., Van de Peer, Y., Javornik, B., & Jakše, J. (2016). Complete mitochondrial genome of the Verticillium-wilt causing plant pathogen Verticillium nonalfalfae. PLOS ONE, 11(2).
    Verticillium nonalfalfae is a fungal plant pathogen that causes wilt disease by colonizing the vascular tissues of host plants. The disease induced by hop isolates of V. nonalfalfae manifests in two different forms, ranging from mild symptoms to complete plant dieback, caused by mild and lethal pathotypes, respectively. Pathogenicity variations between the causal strains have been attributed to differences in genomic sequences and perhaps also to differences in their mitochondrial genomes. We used data from our recent Illumina NGS-based project of genome sequencing V. nonalfalfae to study the mitochondrial genomes of its different strains. The aim of the research was to prepare a V. nonalfalfae reference mitochondrial genome and to determine its phylogenetic placement in the fungal kingdom. The resulting 26,139 bp circular DNA molecule contains a full complement of the 14 "standard" fungal mitochondrial protein-coding genes of the electron transport chain and ATP synthase subunits, together with a small rRNA subunit, a large rRNA subunit, which contains ribosomal protein S3 encoded within a type IA-intron and 26 tRNAs. Phylogenetic analysis of this mitochondrial genome placed it in the Verticillium spp. lineage in the Glomerellales group, which is also supported by previous phylogenetic studies based on nuclear markers. The clustering with the closely related Verticillium dahliae mitochondrial genome showed a very conserved synteny and a high sequence similarity. Two distinguishing mitochondrial genome features were also found-a potential long non-coding RNA (orf414) contained only in the Verticillium spp. of the fungal kingdom, and a specific fragment length polymorphism observed only in V. dahliae and V. nubilum of all the Verticillium spp., thus showing potential as a species specific biomarker.
  33. Olsen, J. L., Rouzé, P., Verhelst, B., Lin, Y.-C., Bayer, T., Collen, J., Dattolo, E., et al. (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. NATURE, 530(7590), 331–335.
    Seagrasses colonized the sea(1) on at least three independent occasions to form the basis of one of the most productive and widespread coastal ecosystems on the planet(2). Here we report the genome of Zostera marina (L.), the first, to our knowledge, marine angiosperm to be fully sequenced. This reveals unique insights into the genomic losses and gains involved in achieving the structural and physiological adaptations required for its marine lifestyle, arguably the most severe habitat shift ever accomplished by flowering plants. Key angiosperm innovations that were lost include the entire repertoire of stomatal genes(3), genes involved in the synthesis of terpenoids and ethylene signalling, and genes for ultraviolet protection and phytochromes for far-red sensing. Seagrasses have also regained functions enabling them to adjust to full salinity. Their cell walls contain all of the polysaccharides typical of land plants, but also contain polyanionic, low-methylated pectins and sulfated galactans, a feature shared with the cell walls of all macroalgae(4) and that is important for ion homoeostasis, nutrient uptake and O-2/CO2 exchange through leaf epidermal cells. The Z. marina genome resource will markedly advance a wide range of functional ecological studies from adaptation of marine ecosystems under climate warming(5,6), to unravelling the mechanisms of osmoregulation under high salinities that may further inform our understanding of the evolution of salt tolerance in crop plants(7).
  34. Lohaus, R., & Van de Peer, Y. (2016). Of dups and dinos: evolution at the K/Pg boundary. (Y. Van de Peer & J. C. Pires, Eds.)CURRENT OPINION IN PLANT BIOLOGY, 30, 62–69.
    Fifteen years into sequencing entire plant genomes, more than 30 paleopolyploidy events could be mapped on the tree of flowering plants (and many more when also transcriptome data sets are considered). While some genome duplications are very old and have occurred early in the evolution of dicots and monocots, or even before, others are more recent and seem to have occurred independently in many different plant lineages. Strikingly, a majority of these duplications date somewhere between 55 and 75 million years ago (mya), and thus likely correlate with the K/Pg boundary. If true, this would suggest that plants that had their genome duplicated at that time, had an increased chance to survive the most recent mass extinction event, at 66 mya, which wiped out a majority of plant and animal life, including all non-avian dinosaurs. Here, we review several processes, both neutral and adaptive, that might explain the establishment of polyploid plants, following the K/Pg mass extinction.
  35. Kaewphan, S., Van Landeghem, S., Ohta, T., Van de Peer, Y., Ginter, F., & Pyysalo, S. (2016). Cell line name recognition in support of the identification of synthetic lethality in cancer from text. BIOINFORMATICS, 32(2), 276–282.
    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers.
  36. Vlastaridis, P., Oliver, S. G., Van de Peer, Y., & Amoutzias, G. D. (2016). The challenges of interpreting phosphoproteomics data : a critical view through the bioinformatics lens. In C. Angelini, P. M. Rancoita, & S. Rovetta (Eds.), (Vol. 9874, pp. 196–204). Presented at the 12th International meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2015), Cham, Switzerland: Springer.
    During the last decade, there has been great progress in high-throughput (HTP) phosphoproteomics and hundreds or even thousands of phosphorylation sites (p-sites) can now be detected in a single experiment. This success is attributable to a combination of very sensitive Mass Spectrometry instruments, better phosphopeptide enrichment techniques and bioinformatics software that are capable of detecting peptides and localizing p-sites. These new technologies have opened up a whole new level of gene regulation to be studied, with great potential for therapeutics and synthetic biology. Nevertheless, many challenges remain to be resolved; these concern the biases and noise of these proteomic technologies, the biological noise that is present, as well as the incompleteness of the current datasets. Despite these problems, the datasets published so far appear to represent a good sample of a complete phosphoproteome of some organisms and are capable of revealing their major properties.
  37. Tzfadia, O., Diels, T., De Meyer, S., Vandepoele, K., Aharoni, A., & Van de Peer, Y. (2016). CoExpNetViz: comparative co-expression networks construction and visualization tool. FRONTIERS IN PLANT SCIENCE, 6.
    Motivation: Comparative transcriptomics is a common approach in functional gene discovery efforts. It allows for finding conserved co-expression patterns between orthologous genes in closely related plant species, suggesting that these genes potentially share similar function and regulation. Several efficient co-expression-based tools have been commonly used in plant research but most of these pipelines are limited to data from model systems, which greatly limit their utility. Moreover, in addition, none of the existing pipelines allow plant researchers to make use of their own unpublished gene expression data for performing a comparative co-expression analysis and generate multi-species co-expression networks. Results: We introduce CoExpNetViz, a computational tool that uses a set of query or "bait" genes as an input (chosen by the user) and a minimum of one pre-processed gene expression dataset. The CoExpNetViz algorithm proceeds in three main steps; (i) for every bait gene submitted, co-expression values are calculated using mutual information and Pearson correlation coefficients, (ii) non bait (or target) genes are grouped based on cross-species orthology, and (iii) output files are generated and results can be visualized as network graphs in Cytoscape. Availability: The CoExpNetViz tool is freely available both as a PHP web server (link: http://bioinformatics.psb.ugent.be/webtools/coexpr/) (implemented in C++) and as a Cytoscape plugin (implemented in Java). Both versions of the CoExpNetViz tool support LINUX and Windows platforms.
  38. Perazzolli, M., Herrero, N., Sterck, L., Lenzi, L., Pellegrini, A., Puopolo, G., Van de Peer, Y., et al. (2016). Transcriptomic responses of a simplified soil microcosm to a plant pathogen and its biocontrol agent reveal a complex reaction to harsh habitat. BMC GENOMICS, 17.
    Background: Soil microorganisms are key determinants of soil fertility and plant health. Soil phytopathogenic fungi are one of the most important causes of crop losses worldwide. Microbial biocontrol agents have been extensively studied as alternatives for controlling phytopathogenic soil microorganisms, but molecular interactions between them have mainly been characterised in dual cultures, without taking into account the soil microbial community. We used an RNA sequencing approach to elucidate the molecular interplay of a soil microbial community in response to a plant pathogen and its biocontrol agent, in order to examine the molecular patterns activated by the microorganisms. Results: A simplified soil microcosm containing 11 soil microorganisms was incubated with a plant root pathogen (Armillaria mellea) and its biocontrol agent (Trichoderma atroviride) for 24 h under controlled conditions. More than 46 million paired-end reads were obtained for each replicate and 28,309 differentially expressed genes were identified in total. Pathway analysis revealed complex adaptations of soil microorganisms to the harsh conditions of the soil matrix and to reciprocal microbial competition/cooperation relationships. Both the phytopathogen and its biocontrol agent were specifically recognised by the simplified soil microcosm: defence reaction mechanisms and neutral adaptation processes were activated in response to competitive (T. atroviride) or non-competitive (A. mellea) microorganisms, respectively. Moreover, activation of resistance mechanisms dominated in the simplified soil microcosm in the presence of both A. mellea and T. atroviride. Biocontrol processes of T. atroviride were already activated during incubation in the simplified soil microcosm, possibly to occupy niches in a competitive ecosystem, and they were not further enhanced by the introduction of A. mellea. Conclusions: This work represents an additional step towards understanding molecular interactions between plant pathogens and biocontrol agents within a soil ecosystem. Global transcriptional analysis of the simplified soil microcosm revealed complex metabolic adaptation in the soil environment and specific responses to antagonistic or neutral intruders.
  39. Proost, Sebastian, Van Bel, M., Vaneechoutte, D., Van de Peer, Y., Inzé, D., Mueller-Roeber, B., & Vandepoele, K. (2015). PLAZA 3.0 : an access point for plant comparative genomics. NUCLEIC ACIDS RESEARCH, 43(D1), D974–D981.
    Comparative sequence analysis has significantly altered our view on the complexity of genome organization and gene functions in different kingdoms. PLAZA 3.0 is designed to make comparative genomics data for plants available through a user-friendly web interface. Structural and functional annotation, gene families, protein domains, phylogenetic trees and detailed information about genome organization can easily be queried and visualized. Compared with the first version released in 2009, which featured nine organisms, the number of integrated genomes is more than four times higher, and now covers 37 plant species. The new species provide a wider phylogenetic range as well as a more in-depth sampling of specific clades, and genomes of additional crop species are present. The functional annotation has been expanded and now comprises data from Gene Ontology, MapMan, UniProtKB/Swiss-Prot, PlnTFDB and PlantTFDB. Furthermore, we improved the algorithms to transfer functional annotation from well-characterized plant genomes to other species. The additional data and new features make PLAZA 3.0 (http://bioinformatics.psb.ugent.be/plaza/) a versatile and comprehensible resource for users wanting to explore genome information to study different aspects of plant biology, both in model and non-model organisms.
  40. Soltis, P. S., Marchant, D. B., Van de Peer, Y., & Soltis, D. E. (2015). Polyploidy and genome evolution in plants. CURRENT OPINION IN GENETICS & DEVELOPMENT, 35, 119–125.
    Plant genomes vary in size and complexity, fueled in part by processes of whole-genome duplication (WGD; polyploidy) and subsequent genome evolution. Despite repeated episodes of WGD throughout the evolutionary history of angiosperms in particular, the genomes are not uniformly large, and even plants with very small genomes carry the signatures of ancient duplication events. The processes governing the evolution of plant genomes following these ancient events are largely unknown. Here, we consider mechanisms of diploidization, evidence of genome reorganization in recently formed polyploid species, and macroevolutionary patterns of WGD in plant genomes and propose that the ongoing genomic changes observed in recent polyploids may illustrate the diploidization processes that result in ancient signatures of WGD over geological timescales.
  41. De La Torre, A. R., Lin, Y.-C., Van de Peer, Y., & Ingvarsson, P. K. (2015). Genome-wide analysis reveals diverged patterns of codon bias, gene expression, and rates of sequence evolution in Picea gene families. GENOME BIOLOGY AND EVOLUTION, 7(4), 1002–1015.
    The recent sequencing of several gymnosperm genomes has greatly facilitated studying the evolution of their genes and gene families. In this study, we examine the evidence for expression-mediated selection in the first two fully sequenced representatives of the gymnosperm plant clade (Picea abies and Picea glauca). We use genome-wide estimates of gene expression (> 50,000 expressed genes) to study the relationship between gene expression, codon bias, rates of sequence divergence, protein length, and gene duplication. We found that gene expression is correlated with rates of sequence divergence and codon bias, suggesting that natural selection is acting on Picea protein-coding genes for translational efficiency. Gene expression, rates of sequence divergence, and codon bias are correlated with the size of gene families, with large multicopy gene families having, on average, a lower expression level and breadth, lower codon bias, and higher rates of sequence divergence than single-copy gene families. Tissue-specific patterns of gene expression were more common in large gene families with large gene expression divergence than in single-copy families. Recent family expansions combined with large gene expression variation in paralogs and increased rates of sequence evolution suggest that some Picea gene families are rapidly evolving to cope with biotic and abiotic stress. Our study highlights the importance of gene expression and natural selection in shaping the evolution of protein-coding genes in Picea species, and sets the ground for further studies investigating the evolution of individual gene families in gymnosperms.
  42. Van den Berge, K., De Smet, R., Van de Peer, Y., & Clement, L. (2015). Quantifying expression divergence of duplicated genes with microarrays. Belgian Statistical Society, 23rd Annual meeting, Abstracts. Presented at the 23rd Annual meeting of the Belgian Statistical Society.
    Whole genome duplication (WGD) events are widespread among flowering plants. They result in two redundant genomes within the individual. Most duplicated genes derived from a WGD event (i.e. homeologous genes) will get lost during evolution. Nonetheless, they provide raw material for the evolution of genes with novel functions. Expression divergence is often used to assess the contribution of WGD in this respect. Microarray technology can be used for this purpose. With microarrays, the expression of a gene is measured by multiple 'probes', i.e. a probeset. Quantifying expression divergence involves differential expression analysis between two distinct genes, which is challenging as it involves different probesets, each having different characteristics. We show that standard analysis methods adopted in the evolutionary genomics literature typically lead to an excess of false positives, explaining the high number of reported significantly diverged genes. We propose a novel data analysis strategy to account for these probe effects. An empirical null distribution is established by adopting a test statistic on probes within a probeset. This null distribution can be incorporated in a local fdr estimate for every gene pair, which rigorously defines significant expression divergence. We illustrate our method in a case study on Arabidopsis thaliana.
  43. Cai, J., Liu, X., Vanneste, K., Proost, S., Tsai, W.-C., Liu, K.-W., Chen, L.-J., et al. (2015). The genome sequence of the orchid Phalaenopsis equestris. NATURE GENETICS, 47(1), 65–72.
    Orchidaceae, renowned for its spectacular flowers and other reproductive and ecological adaptations, is one of the most diverse plant families. Here we present the genome sequence of the tropical epiphytic orchid Phalaenopsis equestris, a frequently used parent species for orchid breeding. P. equestris is the first plant with crassulacean acid metabolism (CAM) for which the genome has been sequenced. Our assembled genome contains 29,431 predicted protein-coding genes. We find that contigs likely to be underassembled, owing to heterozygosity, are enriched for genes that might be involved in self-incompatibility pathways. We find evidence for an orchid-specific paleopolyploidy event that preceded the radiation of most orchid clades, and our results suggest that gene duplication might have contributed to the evolution of CAM photosynthesis in P. equestris. Finally, we find expanded and diversified families of MADS-box C/D-class, B-class AP3 and AGL6-class genes, which might contribute to the highly specialized morphology of orchid flowers.
  44. Zhang, Zhonghua, Mao, L., Chen, H., Bu, F., Li, G., Sun, J., Li, S., et al. (2015). Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber. PLANT CELL, 27(6), 1595–1604.
    Structural variations (SVs) represent a major source of genetic diversity. However, the functional impact and formation mechanisms of SVs in plant genomes remain largely unexplored. Here, we report a nucleotide-resolution SV map of cucumber (Cucumis sativas) that comprises 26,788 SVs based on deep resequencing of 115 diverse accessions. The largest proportion of cucumber SVs was formed through nonhomologous end-joining rearrangements, and the occurrence of SVs is closely associated with regions of high nucleotide diversity. These SVs affect the coding regions of 1676 genes, some of which are associated with cucumber domestication. Based on the map, we discovered a copy number variation (CNV) involving four genes that defines the Female (F) locus and gives rise to gynoecious cucumber plants, which bear only female flowers and set fruit at almost every node. The CNV arose from a recent 30.2-kb duplication at a meiotically unstable region, likely via microhomology-mediated break-induced replication. The SV set provides a snapshot of structural variations in plants and will serve as an important resource for exploring genes underlying key traits and for facilitating practical breeding in cucumber.
  45. Crauwels, S, Van Assche, A., de Jonge, R., Borneman, A., Verreth, C., Troels, P., De Samblanx, G., et al. (2015). Comparative phenomics and targeted use of genomics reveals variation in carbon and nitrogen assimilation among different Brettanomyces bruxellensis strains. APPLIED MICROBIOLOGY AND BIOTECHNOLOGY, 99(21), 9123–9134.
    Recent studies have suggested a correlation between genotype groups of Brettanomyces bruxellensis and their source of isolation. To further explore this relationship, the objective of this study was to assess metabolic differences in carbon and nitrogen assimilation between different B. bruxellensis strains from three beverages, including beer, wine, and soft drink, using Biolog Phenotype Microarrays. While some similarities of physiology were noted, many traits were variable among strains. Interestingly, some phenotypes were found that could be linked to strain origin, especially for the assimilation of particular alpha- and beta-glycosides as well as alpha- and beta-substituted monosaccharides. Based upon gene presence or absence, an alpha-glucosidase and beta-glucosidase were found explaining the observed phenotypes. Further, using a PCR screen on a large number of isolates, we have been able to specifically link a genomic deletion to the beer strains, suggesting that this region may have a fitness cost for B. bruxellensis in certain fermentation systems such as brewing. More specifically, none of the beer strains were found to contain a beta-glucosidase, which may have direct impacts on the ability for these strains to compete with other microbes or on flavor production.
  46. Sundell, D., Mannapperuma, C., Netotea, S., Delhomme, N., Lin, Y.-C., Sjödin, A., Van de Peer, Y., et al. (2015). The plant genome integrative explorer resource: PlantGenIE.org. NEW PHYTOLOGIST, 208(4), 1149–1156.
    Accessing and exploring large-scale genomics data sets remains a significant challenge to researchers without specialist bioinformatics training. We present the integrated PlantGenIE.org platform for exploration of Populus, conifer and Arabidopsis genomics data, which includes expression networks and associated visualization tools. Standard features of a model organism database are provided, including genome browsers, gene list annotation, BLAST homology searches and gene information pages. Community annotation updating is supported via integration of WebApollo. We have produced an RNA-sequencing (RNA-Seq) expression atlas for Populus tremula and have integrated these data within the expression tools. An updated version of the COMPLEX resource for performing comparative plant expression analyses of gene coexpression network conservation between species has also been integrated. The PlantGenIE.org platform provides intuitive access to large-scale and genome-wide genomics data from model forest tree species, facilitating both community contributions to annotation improvement and tools supporting use of the included data resources to inform biological insight.
  47. Ranade, S. S., Lin, Y.-C., Van de Peer, Y., & García-Gil, M. R. (2015). Comparative in silico analysis of SSRs in coding regions of high confidence predicted genes in Norway spruce (Picea abies) and Loblolly pine (Pinus taeda). BMC GENETICS, 16.
    Background: Microsatellites or simple sequence repeats (SSRs) are DNA sequences consisting of 1-6 bp tandem repeat motifs present in the genome. SSRs are considered to be one of the most powerful tools in genetic studies. We carried out a comparative study of perfect SSR loci belonging to class I (>= 20) and class II (>= 12 and < 20 bp) types located in coding regions of high confidence genes in Picea abies and Pinus taeda. SSRLocator was used to retrieve SSRs from the full length CDS of predicted genes in both species. Results: Trimers were the most abundant motifs in class I followed by hexamers in Picea abies, while trimers and hexamers were equally abundant in Pinus taeda class I SSRs. Hexamers were most frequent within class II SSRs followed by trimers, in both species. Although the frequency of genes containing SSRs was slightly higher in Pinus taeda, SSR counts per Mbp for class I was similar in both species (P-value = 0.22); while for class II SSRs, it was significantly higher in Picea abies (P-value = 0.00009). AT-rich motifs were higher in abundance than the GC-rich motifs, within class II SSRs in both the species (P-values = 10(-9) and 0). With reference to class I SSRs, AT-rich and GC-rich motifs were detected with equal frequency in Pinus taeda (P-value = 0.24); while in Picea abies, GC-rich motifs were detected with higher frequency than the AT-rich motifs (P-value = 0.0005). Conclusions: Our study gives a comparative overview of the genome SSRs composition based on high confidence genes in the two recently sequenced and economically important conifers and, also provides information on functional molecular markers that can be applied in genetic studies in Pinus and Picea species.
  48. Szakonyi, D., Van Landeghem, S., Baerenfaller, K., Baeyens, L., Blomme, J., Casanova-Sáez, R., De Bodt, S., et al. (2015). The KnownLeaf literature curation system captures knowledge about Arabidopsis leaf growth and development and facilitates integrated data mining. CURRENT PLANT BIOLOGY, 2, 1–11.
    The information that connects genotypes and phenotypes is essentially embedded in research articles written in natural language. To facilitate access to this knowledge, we constructed a framework for the curation of the scientific literature studying the molecular mechanisms that control leaf growth and development in Arabidopsis thaliana (Arabidopsis). Standard structured statements, called relations, were designed to capture diverse data types, including phenotypes and gene expression linked to genotype description, growth conditions, genetic and molecular interactions, and details about molecular entities. Relations were then annotated from the literature, defining the relevant terms according to standard biomedical ontologies. This curation process was supported by a dedicated graphical user interface, called Leaf Knowtator. A total of 283 primary research articles were curated by a community of annotators, yielding 9947 relations monitored for consistency and over 12,500 references to Arabidopsis genes. This information was converted into a relational database (KnownLeaf) and merged with other public Arabidopsis resources relative to transcriptional networks, protein–protein interaction, gene co-expression, and additional molecular annotations. Within KnownLeaf, leaf phenotype data can be searched together with molecular data originating either from this curation initiative or from external public resources. Finally, we built a network (LeafNet) with a portion of the KnownLeaf database content to graphically represent the leaf phenotype relations in a molecular context, offering an intuitive starting point for knowledge mining. Literature curation efforts such as ours provide high quality structured information accessible to computational analysis, and thereby to a wide range of applications. DATA: The presented work was performed in the framework of the AGRON-OMICS project (Arabidopsis GRO wth Network integrating OMICS technologies) supported by European Commission 6th Framework Programme project (Grant number LSHG-CT-2006-037704). This is a data integration and data sharing portal collecting all the all the major results from the consortium. All data presented in our paper is available here. https://agronomics.ethz.ch/.
  49. Vanneste, Kevin, Sterck, L., Myburg, A. A., Van de Peer, Y., & Mizrachi, E. (2015). Horsetails are ancient polyploids: evidence from Equisetum giganteum. PLANT CELL, 27(6), 1567–1578.
    Horsetails represent an enigmatic clade within the land plants. Despite consisting only of one genus (Equisetum) that contains 15 species, they are thought to represent the oldest extant genus within the vascular plants dating back possibly as far as the Triassic. Horsetails have retained several ancient features and are also characterized by a particularly high chromosome count (n = 108). Whole-genome duplications (WGDs) have been uncovered in many angiosperm clades and have been associated with the success of angiosperms, both in terms of species richness and biomass dominance, but remain understudied in nonangiosperm clades. Here, we report unambiguous evidence of an ancient WGD in the fern linage, based on sequencing and de novo assembly of an expressed gene catalog (transcriptome) from the giant horsetail (Equisetum giganteum). We demonstrate that horsetails underwent an independent paleopolyploidy during the Late Cretaceous prior to the diversification of the genus but did not experience any recent polyploidizations that could account for their high chromosome number. We also discuss the specific retention of genes following the WGD and how this may be linked to their long-term survival.
  50. Delhomme, N., Sundstrom, G., Zamani, N., Lantz, H., Lin, Y.-C., Hvidsten, T. R., Hoppner, M. P., et al. (2015). Serendipitous meta-transcriptomics: the fungal community of Norway spruce (Picea abies). PLOS ONE, 10(9).
    After performing de novo transcript assembly of >1 billion RNA-Sequencing reads obtained from 22 samples of different Norway spruce (Picea abies) tissues that were not surface sterilized, we found that assembled sequences captured a mix of plant, lichen, and fungal transcripts. The latter were likely expressed by endophytic and epiphytic symbionts, indicating that these organisms were present, alive, and metabolically active. Here, we show that these serendipitously sequenced transcripts need not be considered merely as contamination, as is common, but that they provide insight into the plant's phyllosphere. Notably, we could classify these transcripts as originating predominantly from Dothideomycetes and Leotiomycetes species, with functional annotation of gene families indicating active growth and metabolism, with particular regards to glucose intake and processing, as well as gene regulation.
  51. Potenza, E., Racchi, M. L., Sterck, L., Coller, E., Asquini, E., Tosatto, S. C., Velasco, R., et al. (2015). Exploration of alternative splicing events in ten different grapevine cultivars. BMC GENOMICS, 16.
    Background: The complex dynamics of gene regulation in plants are still far from being fully understood. Among many factors involved, alternative splicing (AS) in particular is one of the least well documented. For many years, AS has been considered of less relevant in plants, especially when compared to animals, however, since the introduction of next generation sequencing techniques the number of plant genes believed to be alternatively spliced has increased exponentially. Results: Here, we performed a comprehensive high-throughput transcript sequencing of ten different grapevine cultivars, which resulted in the first high coverage atlas of the grape berry transcriptome. We also developed findAS, a software tool for the analysis of alternatively spliced junctions. We demonstrate that at least 44 % of multi-exonic genes undergo AS and a large number of low abundance splice variants is present within the 131.622 splice junctions we have annotated from Pinot noir. Conclusions: Our analysis shows that similar to 70 % of AS events have relatively low expression levels, furthermore alternative splice sites seem to be enriched near the constitutive ones in some extent showing the noise of the splicing mechanisms. However, AS seems to be extensively conserved among the 10 cultivars.
  52. Saltykova, A., Pulido-Tamayo, S., Pazoutova, M., Rensing, S. A., Nishiyama, T., Van de Peer, Y., Marchal, K., et al. (2015). Identifying prokaryotic consortia that live in close interactions with algae. EUROPEAN JOURNAL OF PHYCOLOGY (Vol. 50, pp. 145–146). Presented at the 6th Euopean Phycological congress.
  53. Ghorbani, S., Lin, Y.-C., Parizot, B., Fernandez Salina, A., Njo, M., Van de Peer, Y., Beeckman, T., et al. (2015). Expanding the repertoire of secretory peptides controlling root development with comparative genome analysis and functional assays. JOURNAL OF EXPERIMENTAL BOTANY, 66(17), 5257–5269.
    Plant genomes encode numerous small secretory peptides (SSPs) whose functions have yet to be explored. Based on structural features that characterize SSP families known to take part in postembryonic development, this comparative genome analysis resulted in the identification of genes coding for oligopeptides potentially involved in cell-to-cell communication. Because genome annotation based on short sequence homology is difficult, the criteria for the de novo identification and aggregation of conserved SSP sequences were first benchmarked across five reference plant species. The resulting gene families were then extended to 32 genome sequences, including major crops. The global phylogenetic pattern common to the functionally characterized SSP families suggests that their apparition and expansion coincide with that of the land plants. The SSP families can be searched online for members, sequences and consensus (http://bioinformatics.psb.ugent.be/webtools/PlantSSP/). Looking for putative regulators of root development, Arabidopsis thaliana SSP genes were further selected through transcriptome meta-analysis based on their expression at specific stages and in specific cell types in the course of the lateral root formation. As an additional indication that formerly uncharacterized SSPs may control development, this study showed that root growth and branching were altered by the application of synthetic peptides matching conserved SSP motifs, sometimes in very specific ways. The strategy used in the study, combining comparative genomics, transcriptome meta-analysis and peptide functional assays in planta, pinpoints factors potentially involved in non-cell-autonomous regulatory mechanisms. A similar approach can be implemented in different species for the study of a wide range of developmental programmes.
  54. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2015). Life’s dual nature: a way out of the impasse of the gene-centred “versus” complex systems controversy on life. In P. Pontarotti (Ed.), Evolutionary biology : biodiversification from genotype to phenotype (pp. 35–52). Berlin, Germany: Springer.
    Living cells and organisms are complex physical systems. Does their organization or complexity primarily rely on the intra-molecular crystalline structure of genetic nucleic acid sequences? Or is it, as critics of the ‘gene-centred’ perspective claim, predominantly a result of the inter- and supra-molecular – thus ‘holistic’ – network dynamics of genetic and various extra-genetic factors? The twentieth-century successes in several branches of genetics caused intensive focus on the causal role of genes in the biochemistry, development and evolution of living organisms, resulting in a relative abstraction or even neglect of life’s complex systems dynamics. Today, however, partly due to the success of systems biology, a number of authors defend life’s systems complexity while criticizing the gene-centred approach. Here, we offer a way out of the impasse of the gene-centred ‘versus’ complex systems perspective to arrive at a more balanced and complete understanding of life’s multifaceted nature. After sketching the conceptual and historical background of the controversy, we show how the present state of knowledge in biology vindicates both the holistically complex and gene-centred nature of life on Earth, but decisively falsifies extreme genetic ‘determinism’ and ‘reductionism’ as well as extreme ‘gene-de-centrism’. Contrary to what is often claimed, the fact that genes are one among many extra-genetic causal factors contributing to the biochemistry and development of cells and organisms, only undermines or falsifies genetic determinism and reductionism but not necessarily gene-centrism. Some implications for evolutionary theory, i.e., for the controversy between the Modern Synthesis and an ‘Extended Synthesis’, are outlined.
  55. Morel, G., Sterck, L., Swennen, D., Marcet-Houben, M., Onesime, D., Levasseur, A., Jacques, N., et al. (2015). Differential gene retention as an evolutionary mechanism to generate biodiversity and adaptation in yeasts. SCIENTIFIC REPORTS, 5.
    The evolutionary history of the characters underlying the adaptation of microorganisms to food and biotechnological uses is poorly understood. We undertook comparative genomics to investigate evolutionary relationships of the dairy yeast Geotrichum candidum within Saccharomycotina. Surprisingly, a remarkable proportion of genes showed discordant phylogenies, clustering with the filamentous fungus subphylum (Pezizomycotina), rather than the yeast subphylum (Saccharomycotina), of the Ascomycota. These genes appear not to be the result of Horizontal Gene Transfer (HGT), but to have been specifically retained by G. candidum after the filamentous fungiyeasts split concomitant with the yeasts' genome contraction. We refer to these genes as SRAGs (Specifically Retained Ancestral Genes), having been lost by all or nearly all other yeasts, and thus contributing to the phenotypic specificity of lineages. SRAG functions include lipases consistent with a role in cheese making and novel endoglucanases associated with degradation of plant material. Similar gene retention was observed in three other distantly related yeasts representative of this ecologically diverse subphylum. The phenomenon thus appears to be widespread in the Saccharomycotina and argues that, alongside neo-functionalization following gene duplication and HGT, specific gene retention must be recognized as an important mechanism for generation of biodiversity and adaptation in yeasts.
  56. Pajoro, A., Biewers, S., Dougali, E., Valentim, F. L., Mendes, M. A., Porri, A., Coupland, G., et al. (2014). The (r)evolution of gene regulatory networks controlling Arabidopsis plant reproduction: a two-decade history. JOURNAL OF EXPERIMENTAL BOTANY, 65(17), 4731–4745.
    Successful plant reproduction relies on the perfect orchestration of singular processes that culminate in the product of reproduction: the seed. The floral transition, floral organ development, and fertilization are well-studied processes and the genetic regulation of the various steps is being increasingly unveiled. Initially, based predominantly on genetic studies, the regulatory pathways were considered to be linear, but recent genome-wide analyses, using high-throughput technologies, have begun to reveal a different scenario. Complex gene regulatory networks underlie these processes, including transcription factors, microRNAs, movable factors, hormones, and chromatin-modifying proteins. Here we review recent progress in understanding the networks that control the major steps in plant reproduction, showing how new advances in experimental and computational technologies have been instrumental. As these recent discoveries were obtained using the model species Arabidopsis thaliana, we will restrict this review to regulatory networks in this important model species. However, more fragmentary information obtained from other species reveals that both the developmental processes and the underlying regulatory networks are largely conserved, making this review also of interest to those studying other plant species.
  57. Yao, Yao, Marchal, K., & Van de Peer, Y. (2014). Improving the adaptability of simulated evolutionary swarm robots in dynamically changing environments. PLOS ONE, 9(3).
    One of the important challenges in the field of evolutionary robotics is the development of systems that can adapt to a changing environment. However, the ability to adapt to unknown and fluctuating environments is not straightforward. Here, we explore the adaptive potential of simulated swarm robots that contain a genomic encoding of a bio-inspired gene regulatory network (GRN). An artificial genome is combined with a flexible agent-based system, representing the activated part of the regulatory network that transduces environmental cues into phenotypic behaviour. Using an artificial life simulation framework that mimics a dynamically changing environment, we show that separating the static from the conditionally active part of the network contributes to a better adaptive behaviour. Furthermore, in contrast with most hitherto developed ANN-based systems that need to re-optimize their complete controller network from scratch each time they are subjected to novel conditions, our system uses its genome to store GRNs whose performance was optimized under a particular environmental condition for a sufficiently long time. When subjected to a new environment, the previous condition-specific GRN might become inactivated, but remains present. This ability to store 'good behaviour' and to disconnect it from the novel rewiring that is essential under a new condition allows faster re-adaptation if any of the previously observed environmental conditions is reencountered. As we show here, applying these evolutionary-based principles leads to accelerated and improved adaptive evolution in a non-stable environment.
  58. Lin, Y.-C., Boone, M., Meuris, L., Lemmens, I., Van Roy, N., Soete, A., Reumers, J., et al. (2014). Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations. NATURE COMMUNICATIONS, 5.
    The HEK293 human cell lineage is widely used in cell biology and biotechnology. Here we use whole-genome resequencing of six 293 cell lines to study the dynamics of this aneuploid genome in response to the manipulations used to generate common 293 cell derivatives, such as transformation and stable clone generation (293T); suspension growth adaptation (293S); and cytotoxic lectin selection (293SG). Remarkably, we observe that copy number alteration detection could identify the genomic region that enabled cell survival under selective conditions (i.c. ricin selection). Furthermore, we present methods to detect human/vector genome breakpoints and a user-friendly visualization tool for the 293 genome data. We also establish that the genome structure composition is in steady state for most of these cell lines when standard cell culturing conditions are used. This resource enables novel and more informed studies with 293 cells, and we will distribute the sequenced cell lines to this effect.
  59. Mushthofa, M., Torres Torres, G., Van de Peer, Y., Marchal, K., & De Cock, M. (2014). ASP-G: an ASP-based method for finding attractors in genetic regulatory networks. BIOINFORMATICS, 30(21), 3086–3092.
    Motivation: Boolean network models are suitable to simulate GRNs in the absence of detailed kinetic information. However, reducing the biological reality implies making assumptions on how genes interact (interaction rules) and how their state is updated during the simulation (update scheme). The exact choice of the assumptions largely determines the outcome of the simulations. In most cases, however, the biologically correct assumptions are unknown. An ideal simulation thus implies testing different rules and schemes to determine those that best capture an observed biological phenomenon. This is not trivial because most current methods to simulate Boolean network models of GRNs and to compute their attractors impose specific assumptions that cannot be easily altered, as they are built into the system. Results: To allow for a more flexible simulation framework, we developed ASP-G. We show the correctness of ASP-G in simulating Boolean network models and obtaining attractors under different assumptions by successfully recapitulating the detection of attractors of previously published studies. We also provide an example of how performing simulation of network models under different settings help determine the assumptions under which a certain conclusion holds. The main added value of ASP-G is in its modularity and declarativity, making it more flexible and less error-prone than traditional approaches. The declarative nature of ASP-G comes at the expense of being slower than the more dedicated systems but still achieves a good efficiency with respect to computational time. Availability and implementation: The source code of ASP-G is available at http://bioinformatics.intec.ugent.be/kmarchal/Supplementary_Information_Musthofa_2014/asp-g.zip.
  60. Chaves, I., Lin, Y.-C., Pinto-Ricardo, C., Van de Peer, Y., & Miguel, C. (2014). miRNA profiling in leaf and cork tissues of Quercus suber reveals novel miRNAs and tissue-specific expression patterns. TREE GENETICS & GENOMES, 10(3), 721–737.
    The differentiation of cork (phellem) cells from the phellogen (cork cambium) is a secondary growth process observed in the cork oak tree conferring a unique ability to produce a thick layer of cork. At present, the molecular regulators of phellem differentiation are unknown. The previously documented involvement of microRNAs (miRNAs) in the regulation of developmental processes, including secondary growth, motivated the search for these regulators in cork oak tissues. We performed deep sequencing of the small RNA fraction obtained from cork oak leaves and differentiating phellem. RNA sequences with lengths of 19-25 nt derived from the two libraries were analysed, leading to the identification of 41 families of conserved miRNAs, of which the most abundant were miR167, miR165/166, miR396 and miR159. Thirty novel miRNA candidates were also unveiled, 11 of which were unique to leaves and 13 to phellem. Northern blot detection of a set of conserved and novel miRNAs confirmed their differential expression profile. Prediction and analysis of putative miRNA target genes provided clues regarding processes taking place in leaf and phellem tissues, but further experimental work will be needed for functional characterization. In conclusion, we here provide a first characterization of the miRNA population in a Fagacea species, and the comparative analysis of miRNA expression in leaf and phellem libraries represents an important step to uncovering specific regulatory networks controlling phellem differentiation.
  61. Ahmed, S., Cock, J. M., Pessia, E., Luthringer, R., Cormier, A., Robuchon, M., Sterck, L., et al. (2014). A haploid system of sex determination in the brown alga Ectocarpus sp. CURRENT BIOLOGY, 24(17), 1945–1957.
    Background: A common feature of most genetic sex-determination systems studied so far is that sex is determined by nonrecombining genomic regions, which can be of various sizes depending on the species. These regions have evolved independently and repeatedly across diverse groups. A number of such sex-determining regions (SDRs) have been studied in animals, plants, and fungi, but very little is known about the evolution of sexes in other eukaryotic lineages. Results: We report here the sequencing and genomic analysis of the SDR of Ectocarpus, a brown alga that has been evolving independently from plants, animals, and fungi for over one giga-annum. In Ectocarpus, sex is expressed during the haploid phase of the life cycle, and both the female (U) and the male (V) sex chromosomes contain nonrecombining regions. The U and V of this species have been diverging for more than 70 mega-annum, yet gene degeneration has been modest, and the SDR is relatively small, with no evidence for evolutionary strata. These features may be explained by the occurrence of strong purifying selection during the haploid phase of the life cycle and the low level of sexual dimorphism. V is dominant over U, suggesting that femaleness may be the default state, adopted when the male haplotype is absent. Conclusions: The Ectocarpus UV system has clearly had a distinct evolutionary trajectory not only to the well-studied XY and ZW systems but also to the UV systems described so far. Nonetheless, some striking similarities exist, indicating remarkable universality of the underlying processes shaping sex chromosome evolution across distant lineages.
  62. De Tiège, A., Tanghe, K., Braeckman, J., & Van de Peer, Y. (2014). From DNA- to NA-centrism and the conditions for gene-centrism revisited. BIOLOGY & PHILOSOPHY, 29(1), 55–69.
    First the 'Weismann barrier' and later on Francis Crick's 'central dogma' of molecular biology nourished the gene-centric paradigm of life, i.e., the conception of the gene/genome as a 'central source' from which hereditary specificity unidirectionally flows or radiates into cellular biochemistry and development. Today, due to advances in molecular genetics and epigenetics, such as the discovery of complex post-genomic and epigenetic processes in which genes are causally integrated, many theorists argue that a gene-centric conception of the organism has become problematic. Here, we first explore the causal implications of the following two central dogma-related issues: (1) widespread reverse transcription-arguing for an extension from 'DNA-genome' to RNA-encompassing 'NA-genome' and, thus, from traditional DNA-centrism to a broader 'NA-centrism'; and (2) the absence of a mechanism of reverse translation-arguing for the 'structural primacy' of NA-sequence over protein in cellular biochemistry. Secondly, we explore whether this latter conclusion can be extended to a 'functional primacy' of NA-sequence over protein in cellular biochemistry, which would imply a limited kind of 'gene/NA-centrism' confined to the subcellular level of NA/protein-based biochemistry. Finally, we explore the conditions-and their (non)fulfilment-for a more generalised form of gene-centrism extendable to higher levels of biological organisation. We conclude that the higher we go in the biological hierarchy, the more dubious gene-centric claims become.
  63. Zhurov, V., Navarro, M., Bruinsma, K. A., Arbona, V., Santamaria, M. E., Cazaux, M., Wybouw, N., et al. (2014). Reciprocal responses in the interaction between Arabidopsis and the cell-content feeding chelicerate herbivore spider mite. PLANT PHYSIOLOGY, 164(1), 384–399.
    Most molecular-genetic studies of plant defense responses to arthropod herbivores have focused on insects. However, plant-feeding mites are also pests of diverse plants, and mites induce different patterns of damage to plant tissues than do well-studied insects (e.g. lepidopteran larvae or aphids). The two-spotted spidermite (Tetranychus urticae) is among the most significant mite pests in agriculture, feeding on a staggering number of plant hosts. To understand the interactions between spider mite and a plant at the molecular level, we examined reciprocal genome-wide responses of mites and its host Arabidopsis (Arabidopsis thaliana). Despite differences in feeding guilds, we found that transcriptional responses of Arabidopsis to mite herbivory resembled those observed for lepidopteran herbivores. Mutant analysis of induced plant defense pathways showed functionally that only a subset of induced programs, including jasmonic acid signaling and biosynthesis of indole glucosinolates, are central to Arabidopsis's defense to mite herbivory. On the herbivore side, indole glucosinolates dramatically increased mite mortality and development times. We identified an indole glucosinolate dose-dependent increase in the number of differentially expressedmite genes belonging to pathways associated with detoxification of xenobiotics. This demonstrates that spider mite is sensitive to Arabidopsis defenses that have also been associated with the deterrence of insect herbivores that are very distantly related to chelicerates. Our findings provide molecular insights into the nature of, and response to, herbivory for a representative of a major class of arthropod herbivores.
  64. Bolton, M. D., de Jonge, R., Inderbitzin, P., Liu, Z., Birla, K., Van de Peer, Y., Subbarao, K. V., et al. (2014). The heterothallic sugarbeet pathogen Cercospora beticola contains exon fragments of both MAT genes that are homogenized by concerted evolution. FUNGAL GENETICS AND BIOLOGY, 62, 43–54.
    Dothideomycetes is one of the most ecologically diverse and economically important classes of fungi. Sexual reproduction in this group is governed by mating type (MAT) genes at the MAT1 locus. Self-sterile (heterothallic) species contain one of two genes at MAT1 (MAT1-1-1 or MAT1-2-1) and only isolates of opposite mating type are sexually compatible. In contrast, self-fertile (homothallic) species contain both MAT genes at MAT1. Knowledge of the reproductive capacities of plant pathogens are of particular interest because recombining populations tend to be more difficult to manage in agricultural settings. In this study, we sequenced MAT1 in the heterothallic Dothideomycete fungus Cercospora beticola to gain insight into the reproductive capabilities of this important plant pathogen. In addition to the expected MAT gene at MAT1, each isolate contained fragments of both MAT1-1-1 and MAT1-2-1 at ostensibly random loci across the genome. When MAT fragments from each locus were manually assembled, they reconstituted MAT1-1-1 and MAT1-2-1 exons with high identity, suggesting a retroposition event occurred in a homothallic ancestor in which both MAT genes were fused. The genome sequences of related taxa revealed that MAT gene fragment pattern of Cercospora zeae-maydis was analogous to C beticola. In contrast, the genome of more distantly related Mycosphaerella graminicola did not contain MAT fragments. Although fragments occurred in syntenic regions of the C bed cola and C zeae-maydis genomes, each MAT fragment was more closely related to the intact MAT gene of the same species. Taken together, these data suggest MAT genes fragmented after divergence of M. graminicola from the remaining taxa, and concerted evolution functioned to homogenize MAT fragments and MAT genes in each species.
  65. Morreel, K., Saeys, Y., Dima, O., Lu, F., Van de Peer, Y., Vanholme, R., Ralph, J., et al. (2014). Systematic structural characterization of metabolites in Arabidopsis via candidate substrate-product pair networks. PLANT CELL, 26(3), 929–945.
    Plant metabolomics is increasingly used for pathway discovery and to elucidate gene function. However, the main bottleneck is the identification of the detected compounds. This is more pronounced for secondary metabolites as many of their pathways are still underexplored. Here, an algorithm is presented in which liquid chromatography-mass spectrometry profiles are searched for pairs of peaks that have mass and retention time differences corresponding with those of substrates and products from well-known enzymatic reactions. Concatenating the latter peak pairs, called candidate substrate-product pairs (CSPP), into a network displays tentative (bio) synthetic routes. Starting from known peaks, propagating the network along these routes allows the characterization of adjacent peaks leading to their structure prediction. As a proof-of-principle, this high-throughput cheminformatics procedure was applied to the Arabidopsis thaliana leaf metabolome where it allowed the characterization of the structures of 60% of the profiled compounds. Moreover, based on searches in the Chemical Abstract Service database, the algorithm led to the characterization of 61 compounds that had never been described in plants before. The CSPP-based annotation was confirmed by independent MSn experiments. In addition to being high throughput, this method allows the annotation of low-abundance compounds that are otherwise not amenable to isolation and purification. This method will greatly advance the value of metabolomics in systems biology.
  66. Myburg, A. A., Grattapaglia, D., Tuskan, G. A., Hellsten, U., Hayes, R. D., Grimwood, J., Jenkins, J., et al. (2014). The genome of Eucalyptus grandis. NATURE, 510(7505), 356–362.
    Eucalypts are the world's most widely planted hardwood trees. Their outstanding diversity, adaptability and growth have made them a global renewable resource of fibre and energy. We sequenced and assembled >94% of the 640-megabase genome of Eucalyptus grandis. Of 36,376 predicted protein-coding genes, 34% occur in tandem duplications, the largest proportion thus far in plant genomes. Eucalyptus also shows the highest diversity of genes for specialized metabolites such as terpenes that act as chemical defence and provide unique pharmaceutical oils. Genome sequencing of the E. grandis sister species E. globulus and a set of inbred E. grandis tree genomes reveals dynamic genome evolution and hotspots of inbreeding depression. The E. grandis genome is the first reference for the eudicot order Myrtales and is placed here sister to the eurosids. This resource expands our understanding of the unique biology of large woody perennials and provides a powerful tool to accelerate comparative biology, breeding and biotechnology.
  67. Ranade, S. S., Lin, Y.-C., Zuccolo, A., Van de Peer, Y., & Garcia-Gil, M. del R. (2014). Comparative in silico analysis of EST-SSRs in angiosperm and gymnosperm tree genera. BMC PLANT BIOLOGY, 14.
    Background: Simple Sequence Repeats (SSRs) derived from Expressed Sequence Tags (ESTs) belong to the expressed fraction of the genome and are important for gene regulation, recombination, DNA replication, cell cycle and mismatch repair. Here, we present a comparative analysis of the SSR motif distribution in the 5'UTR, ORF and 3'UTR fractions of ESTs across selected genera of woody trees representing gymnosperms (17 species from seven genera) and angiosperms (40 species from eight genera). Results: Our analysis supports a modest contribution of EST-SSR length to genome size in gymnosperms, while EST-SSR density was not associated with genome size in neither angiosperms nor gymnosperms. Multiple factors seem to have contributed to the lower abundance of EST-SSRs in gymnosperms that has resulted in a non-linear relationship with genome size diversity. The AG/CT motif was found to be the most abundant in SSRs of both angiosperms and gymnosperms, with a relative increase in AT/AT in the latter. Our data also reveals a higher abundance of hexamers across the gymnosperm genera. Conclusions: Our analysis provides the foundation for future comparative studies at the species level to unravel the evolutionary processes that control the SSR genesis and divergence between angiosperm and gymnosperm tree species.
  68. Blanc-Mathieu, R., Verhelst, B., Derelle, E., Rombauts, S., Bouget, F.-Y., Carre, I., Chateau, A., et al. (2014). An improved genome of the model marine alga Ostreococcus tauri unfolds by assessing Illumina de novo assemblies. BMC GENOMICS, 15.
    Background: Cost effective next generation sequencing technologies now enable the production of genomic datasets for many novel planktonic eukaryotes, representing an understudied reservoir of genetic diversity. O. tauri is the smallest free-living photosynthetic eukaryote known to date, a coccoid green alga that was first isolated in 1995 in a lagoon by the Mediterranean sea. Its simple features, ease of culture and the sequencing of its 13 Mb haploid nuclear genome have promoted this microalga as a new model organism for cell biology. Here, we investigated the quality of genome assemblies of Illumina GAIIx 75 bp paired end reads from Ustreococcus touri, thereby also improving the existing assembly and showing the genome to be stably maintained in culture. Results: The 3 assemblers used, ABySS, CLCBio and Velvet, produced 95% complete genomes in 1402 to 2080 scaffolds with a very low rate of misassembly. Reciprocally, these assemblies improved the original genome assembly by filling in 930 gaps. Combined with additional analysis of raw reads and PCR sequencing effort, 1194 gaps have been solved in total adding up to 460 kb of sequence. Mapping of RNAseq II lumina data on this updated genome led to a twofold reduction in the proportion of multi-exon protein coding genes, representing 19% of the total 7699 protein coding genes. The comparison of the DNA extracted in 2001 and 2009 revealed the fixation of 8 single nucleotide substitutions and 2 deletions during the approximately 6000 generations in the lab. The deletions either knocked out or truncated two predicted transmembrane proteins, including a glutamate receptor like gene. Conclusion: High coverage (>80 fold) paired end Illumina sequencing enables a high quality 95% complete genome assembly of a compact 13 Mb haploid eukaryote. This genome sequence has remained stable for 6000 generations of lab culture.
  69. Vanneste, Kevin, Baele, G., Maere, S., & Van de Peer, Y. (2014). Analysis of 41 plant genomes supports a wave of successful genome duplications in association with the Cretaceous-Paleogene boundary. GENOME RESEARCH, 24(8), 1334–1347.
    Ancient whole-genome duplications (WGDs), also referred to as paleopolyploidizations, have been reported in most evolutionary lineages. Their attributed role remains a major topic of discussion, ranging from an evolutionary dead end to a road toward evolutionary success, with evidence supporting both fates. Previously, based on dating WGDs in a limited number of plant species, we found a clustering of angiosperm paleopolyploidizations around the Cretaceous Paleogene (K-Pg) extinction event about 66 million years ago. Here we revisit this finding, which has proven controversial, by combining genome sequence information for many more plant lineages and using more sophisticated analyses. We include 38 full genome sequences and three transcriptome assemblies in a Bayesian evolutionary analysis framework that incorporates uncorrelated relaxed clock methods and fossil uncertainty. In accordance with earlier findings, we demonstrate a strongly nonrandom pattern of genome duplications over time with many WGDs clustering around the K-Pg boundary. We interpret these results in the context of recent studies on invasive polyploid plant species, and suggest that polyploid establishment is promoted during times of environmental stress. We argue that considering the evolutionary potential of polyploids in light of the environmental and ecological conditions present around the time of polyploidization could mitigate the stark contrast in the proposed evolutionary fates of polyploids.
  70. Ciesielska, K., Van Bogaert, I., Chevineau, S., Li, B., Groeneboer, S., Soetaert, W., Van de Peer, Y., et al. (2014). Exoproteome analysis of Starmerella bombicola results in the discovery of an esterase required for lactonization of sophorolipids. JOURNAL OF PROTEOMICS, 98, 159–174.
  71. Bracken-Grissom, H., Collins, A. G., Collins, T., Crandall, K., Distel, D., Dunn, C., Giribet, G., et al. (2014). The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes. JOURNAL OF HEREDITY, 105(1), 1–18.
    Over 95% of all metazoan (animal) species comprise the invertebrates, but very few genomes from these organisms have been sequenced. We have, therefore, formed a Global Invertebrate Genomics Alliance (GIGA). Our intent is to build a collaborative network of diverse scientists to tackle major challenges (e.g., species selection, sample collection and storage, sequence assembly, annotation, analytical tools) associated with genome/transcriptome sequencing across a large taxonomic spectrum. We aim to promote standards that will facilitate comparative approaches to invertebrate genomics and collaborations across the international scientific community. Candidate study taxa include species from Porifera, Ctenophora, Cnidaria, Placozoa, Mollusca, Arthropoda, Echinodermata, Annelida, Bryozoa, and Platyhelminthes, among others. GIGA will target 7000 noninsect/nonnematode species, with an emphasis on marine taxa because of the unrivaled phyletic diversity in the oceans. Priorities for selecting invertebrates for sequencing will include, but are not restricted to, their phylogenetic placement; relevance to organismal, ecological, and conservation research; and their importance to fisheries and human health. We highlight benefits of sequencing both whole genomes (DNA) and transcriptomes and also suggest policies for genomic-level data access and sharing based on transparency and inclusiveness. The GIGA Web site () has been launched to facilitate this collaborative venture.
  72. Vermeirssen, Vanessa, De Clercq, I., Van Parys, T., Van Breusegem, F., & Van de Peer, Y. (2014). Arabidopsis ensemble reverse-engineered gene regulatory network discloses interconnected transcription factors in oxidative stress. PLANT CELL, 26(12), 4656–4679.
    The abiotic stress response in plants is complex and tightly controlled by gene regulation. We present an abiotic stress gene regulatory network of 200,014 interactions for 11,938 target genes by integrating four complementary reverse-engineering solutions through average rank aggregation on an Arabidopsis thaliana microarray expression compendium. This ensemble performed the most robustly in benchmarking and greatly expands upon the availability of interactions currently reported. Besides recovering 1182 known regulatory interactions, cis-regulatory motifs and coherent functionalities of target genes corresponded with the predicted transcription factors. We provide a valuable resource of 572 abiotic stress modules of coregulated genes with functional and regulatory information, from which we deduced functional relationships for 1966 uncharacterized genes and many regulators. Using gain-and loss-of-function mutants of seven transcription factors grown under control and salt stress conditions, we experimentally validated 141 out of 271 predictions (52% precision) for 102 selected genes and mapped 148 additional transcription factor-gene regulatory interactions (49% recall). We identified an intricate core oxidative stress regulatory network where NAC13, NAC053, ERF6, WRKY6, and NAC032 transcription factors interconnect and function in detoxification. Our work shows that ensemble reverse-engineering can generate robust biological hypotheses of gene regulation in a multicellular eukaryote that can be tested by medium-throughput experimental validation.
  73. Vanneste, Kevin, Maere, S., & Van de Peer, Y. (2014). Tangled up in two: a burst of genome duplications at the end of the Cretaceous and the consequences for plant evolution. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 369(1648).
    Genome sequencing has demonstrated that besides frequent small-scale duplications, large-scale duplication events such as whole genome duplications (WGDs) are found on many branches of the evolutionary tree of life. Especially in the plant lineage, there is evidence for recurrent WGDs, and the ancestor of all angiosperms was in fact most likely a polyploid species. The number of WGDs found in sequenced plant genomes allows us to investigate questions about the roles of WGDs that were hitherto impossible to address. An intriguing observation is that many plant WGDs seem associated with periods of increased environmental stress and/or fluctuations, a trend that is evident for both present-day polyploids and palaeopolyploids formed around the Cretaceous-Palaeogene (K-Pg) extinction at 66 Ma. Here, we revisit the WGDs in plants that mark the K-Pg boundary, and discuss some specific examples of biological innovations and/or diversifications that may be linked to these WGDs. We review evidence for the processes that could have contributed to increased polyploid establishment at the K-Pg boundary, and discuss the implications on subsequent plant evolution in the Cenozoic.
  74. Van Landeghem, S., De Bodt, S., Drebert, Z., Inzé, D., & Van de Peer, Y. (2013). The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis. PLANT CELL, 25(3), 794–807.
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
  75. Ruttink, T., Sterck, L., Rohde, A., Bendixen, C., Rouzé, P., Asp, T., Van de Peer, Y., et al. (2013). Orthology Guided Assembly in highly heterozygous crops: creating a reference transcriptome to uncover genetic diversity in Lolium perenne. PLANT BIOTECHNOLOGY JOURNAL, 11(5), 605–617.
    Despite current advances in next-generation sequencing data analysis procedures, de novo assembly of a reference sequence required for SNP discovery and expression analysis is still a major challenge in genetically uncharacterized, highly heterozygous species. High levels of polymorphism inherent to outbreeding crop species hamper De Bruijn Graph-based de novo assembly algorithms, causing transcript fragmentation and the redundant assembly of allelic contigs. If multiple genotypes are sequenced to study genetic diversity, primary de novo assembly is best performed per genotype to limit the level of polymorphism and avoid transcript fragmentation. Here, we propose an Orthology Guided Assembly procedure that first uses sequence similarity (tBLASTn) to proteins of a model species to select allelic and fragmented contigs from all genotypes and then performs CAP3 clustering on a gene-by-gene basis. Thus, we simultaneously annotate putative orthologues for each protein of the model species, resolve allelic redundancy and fragmentation and create a de novo transcript sequence representing the consensus of all alleles present in the sequenced genotypes. We demonstrate the procedure using RNA-seq data from 14 genotypes of Lolium perenne to generate a reference transcriptome for gene discovery and translational research, to reveal the transcriptome-wide distribution and density of SNPs in an outbreeding crop and to illustrate the effect of polymorphisms on the assembly procedure. The results presented here illustrate that constructing a non-redundant reference sequence is essential for comparative genomics, orthology-based annotation and candidate gene selection but also for read mapping and subsequent polymorphism discovery and/or read count-based gene expression analysis.
  76. Ciesielska, K., Li, B., Groeneboer, S., Van Bogaert, I., Lin, Y.-C., Soetaert, W., Van de Peer, Y., et al. (2013). SILAC-based proteome analysis of Starmerella bombicola sophorolipid production. JOURNAL OF PROTEOME RESEARCH, 12(10), 4376–4392.
    Starmerella (Candida) bombicola is the biosurfactant-producing species that caught the greatest deal of attention in the academic and industrial world due to its ability of producing large amounts of sophorolipids. Despite its high economic potential, the biochemistry behind the sophorolipid biosynthesis is still poorly understood. Here we present the first proteomic characterization of S. bombicola for which we created a lys1 Delta. mutant to allow the use of SILAC for quantitative analysis. To characterize the processes behind the production of these biosurfactants, we compared the proteome of sophorolipid producing (early stationary phase) and nonproducing cells (exponential phase). We report the simultaneous production of all known enzymes involved in sophorolipid biosynthesis including a predicted sophorolipid transporter. In addition, we identified the heme binding protein Dap1 as a possible regulator for Cyp52M1. Our results further indicate that ammonium and phosphate limitation are not the sole limiting factors inducing sophorolipid biosynthesis.
  77. Zimmer, A. D., Lang, D., Buchta, K., Rombauts, S., Nishiyama, T., Hasebe, M., Van de Peer, Y., et al. (2013). Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC GENOMICS, 14.
    Background: The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results: Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant "flagship" genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the www.cosmoss.org model organism database. Conclusions: Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5'-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
  78. Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., & Vandepoele, K. (2013). TRAPID: an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes. GENOME BIOLOGY, 14(12).
    Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system.
  79. Read, B. A., Kegel, J., Klute, M. J., Kuo, A., Lefebvre, S. C., Maumus, F., Mayer, C., et al. (2013). Pan genome of the phytoplankton Emiliania underpins its global distribution. NATURE, 499(7457), 209–213.
    Coccolithophores have influenced the global climate for over 200 million years(1). These marine phytoplankton can account for 20 per cent of total carbon fixation in some systems(2). They form blooms that can occupy hundreds of thousands of square kilometres and are distinguished by their elegantly sculpted calcium carbonate exoskeletons (coccoliths), rendering them visible from space(3). Although coccolithophores export carbon in the form of organic matter and calcite to the sea floor, they also release CO2 in the calcification process. Hence, they have a complex influence on the carbon cycle, driving either CO2 production or uptake, sequestration and export to the deep ocean(4). Here we report the first haptophyte reference genome, from the coccolithophore Emiliania huxleyi strain CCMP1516, and sequences from 13 additional isolates. Our analyses reveal a pan genome (core genes plus genes distributed variably between strains) probably supported by an atypical complement of repetitive sequence in the genome. Comparisons across strains demonstrate that E. huxleyi, which has long been considered a single species, harbours extensive genome variability reflected in different metabolic repertoires. Genome variability within this species complex seems to underpin its capacity both to thrive in habitats ranging from the equator to the subarctic and to form large-scale episodic blooms under a wide variety of environmental conditions.
  80. Vandepoele, K., Van Bel, M., Richard, G., Van Landeghem, S., Verhelst, B., Moreau, H., Van de Peer, Y., et al. (2013). pico-PLAZA, a genome database of microbial photosynthetic eukaryotes. ENVIRONMENTAL MICROBIOLOGY, 15(8), 2147–2153.
    With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. PLAZA can be used to functionally characterize large-scale ES /RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylumtricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains.
  81. De Clercq, I., Vermeirssen, V., Van Aken, O., Vandepoele, K., Murcha, M. W., Law, S. R., Inzé, A., et al. (2013). The membrane-bound NAC transcription factor ANAC013 functions in mitochondrial retrograde regulation of the oxidative stress response in Arabidopsis. PLANT CELL, 25(9), 3472–3490.
    Upon disturbance of their function by stress, mitochondria can signal to the nucleus to steer the expression of responsive genes. This mitochondria-to-nucleus communication is often referred to as mitochondrial retrograde regulation (MRR). Although reactive oxygen species and calcium are likely candidate signaling molecules for MRR, the protein signaling components in plants remain largely unknown. Through meta-analysis of transcriptome data, we detected a set of genes that are common and robust targets of MRR and used them as a bait to identify its transcriptional regulators. In the upstream regions of these mitochondrial dysfunction stimulon (MDS) genes, we found a cis-regulatory element, the mitochondrial dysfunction motif (MDM), which is necessary and sufficient for gene expression under various mitochondrial perturbation conditions. Yeast one-hybrid analysis and electrophoretic mobility shift assays revealed that the transmembrane domain-containing NO APICAL MERISTEM/ARABIDOPSIS TRANSCRIPTION ACTIVATION FACTOR/CUP-SHAPED COTYLEDON transcription factors (ANAC013, ANAC016, ANAC017, ANAC053, and ANAC078) bound to the MDM cis-regulatory element. We demonstrate that ANAC013 mediates MRRinduced expression of the MDS genes by direct interaction with the MDMcis-regulatory element and triggers increased oxidative stress tolerance. In conclusion, we characterized ANAC013 as a regulator of MRR upon stress in Arabidopsis thaliana.
  82. Andolfo, G., Sanseverino, W., Rombauts, S., Van de Peer, Y., Bradeen, J., Carputo, D., Frusciante, L., et al. (2013). Overview of tomato (Solanum lycopersicum) candidate pathogen recognition genes reveals important Solanum R locus dynamics. NEW PHYTOLOGIST, 197(1), 223–237.
    To investigate the genome-wide spatial arrangement of R loci, a complete catalogue of tomato (Solanum lycopersicum) and potato (Solanum tuberosum) nucleotide-binding site (NBS) NBS, receptor-like protein (RLP) and receptor-like kinase (RLK) gene repertories was generated. Candidate pathogen recognition genes were characterized with respect to structural diversity, phylogenetic relationships and chromosomal distribution. NBS genes frequently occur in clusters of related gene copies that also include RLP or RLK genes. This scenario is compatible with the existence of selective pressures optimizing coordinated transcription. A number of duplication events associated with lineage-specific evolution were discovered. These findings suggest that different evolutionary mechanisms shaped pathogen recognition gene cluster architecture to expand and to modulate the defence repertoire. Analysis of pathogen recognition gene clusters associated with documented resistance function allowed the identification of adaptive divergence events and the reconstruction of the evolution history of these loci. Differences in candidate pathogen recognition gene number and organization were found between tomato and potato. Most candidate pathogen recognition gene orthologues were distributed at less than perfectly matching positions, suggesting an ongoing lineage-specific rearrangement. Indeed, a local expansion of Toll/Interleukin-1 receptor (TIR)-NBS-leucine-rich repeat (LRR) (TNL) genes in the potato genome was evident. Taken together, these findings have implications for improved understanding of the mechanisms of molecular adaptive selection at Solanum R loci.
  83. De Smet, Riet, Adams, K. L., Vandepoele, K., Van Montagu, M., Maere, S., & Van de Peer, Y. (2013). Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 110(8), 2898–2903.
    The importance of gene gain through duplication has long been appreciated. In contrast, the importance of gene loss has only recently attracted attention. Indeed, studies in organisms ranging from plants to worms and humans suggest that duplication of some genes might be better tolerated than that of others. Here we have undertaken a large-scale study to investigate the existence of duplication-resistant genes in the sequenced genomes of 20 flowering plants. We demonstrate that there is a large set of genes that is convergently restored to single-copy status following multiple genome-wide and smaller scale duplication events. We rule out the possibility that such a pattern could be explained by random gene loss only and therefore propose that there is selection pressure to preserve such genes as singletons. This is further substantiated by the observation that angiosperm single-copy genes do not comprise a random fraction of the genome, but instead are often involved in essential housekeeping functions that are highly conserved across all eukaryotes. Furthermore, single-copy genes are generally expressed more highly and in more tissues than non-single-copy genes, and they exhibit higher sequence conservation. Finally, we propose different hypotheses to explain their resistance against duplication.
  84. Van Landeghem, S., Bjorne, J., Wei, C.-H., Hakala, K., Pyysalo, S., Ananiadou, S., Kao, H.-Y., et al. (2013). Large-scale event extraction from literature with multi-level gene normalization. PLOS ONE, 8(4).
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license.
  85. Roelants, S., Saerens, K., Derycke, T., Li, B., Lin, Y.-C., Van de Peer, Y., De Maeseneire, S., et al. (2013). Candida bombicola as a platform organism for the production of tailor-made biomolecules. BIOTECHNOLOGY AND BIOENGINEERING, 110(9), 2494–2503.
  86. Galagan, J. E., Minch, K., Peterson, M., Lyubetskaya, A., Azizi, E., Sweet, L., Gomes, A., et al. (2013). The Mycobacterium tuberculosis regulatory network and hypoxia. NATURE, 499(7457), 178–183.
    We have taken the first steps towards a complete reconstruction of the Mycobacterium tuberculosis regulatory network based on ChIP-Seq and combined this reconstruction with system-wide profiling of messenger RNAs, proteins, metabolites and lipids during hypoxia and re-aeration. Adaptations to hypoxia are thought to have a prominent role in M. tuberculosis pathogenesis. Using ChIP-Seq combined with expression data from the induction of the same factors, we have reconstructed a draft regulatory network based on 50 transcription factors. This network model revealed a direct interconnection between the hypoxic response, lipid catabolism, lipid anabolism and the production of cell wall lipids. As a validation of this model, in response to oxygen availability we observe substantial alterations in lipid content and changes in gene expression and metabolites in corresponding metabolic pathways. The regulatory network reveals transcription factors underlying these changes, allows us to computationally predict expression changes, and indicates that Rv0081 is a regulatory hub.
  87. Fawcett, J., Van de Peer, Y., & Maere, S. (2013). Significance and biological consequences of polyploidization in land plant evolution. In J. Greilhuber, J. Doležel, & J. F. Wendel (Eds.), Physical structure, behaviour and evolution of plant genomes (Vol. 2, pp. 277–293). Vienna, Austria: Springer.
  88. Vanneste, Kevin, Van de Peer, Y., & Maere, S. (2013). Inference of genome duplications from age distributions revisited. MOLECULAR BIOLOGY AND EVOLUTION, 30(1), 177–190.
    Whole-genome duplications (WGDs), thought to facilitate evolutionary innovations and adaptations, have been uncovered in many phylogenetic lineages. WGDs are frequently inferred from duplicate age distributions, where they manifest themselves as peaks against a small-scale duplication background. However, the interpretation of duplicate age distributions is complicated by the use of K-S, the number of synonymous substitutions per synonymous site, as a proxy for the age of paralogs. Two particular concerns are the stochastic nature of synonymous substitutions leading to increasing uncertainty in K-S with increasing age since duplication and K-S saturation caused by the inability of evolutionary models to fully correct for the occurrence of multiple substitutions at the same site. K-S stochasticity is expected to erode the signal of older WGDs, whereas K-S saturation may lead to artificial peaks in the distribution. Here, we investigate the consequences of these effects on K-S-based age distributions and WGD inference by simulating the evolution of duplicated sequences according to predefined real age distributions and re-estimating the corresponding K-S distributions. We show that, although K-S estimates can be used for WGD inference far beyond the commonly accepted K-S threshold of 1, K-S saturation effects can cause artificial peaks at higher ages. Moreover, K-S stochasticity and saturation may lead to confounded peaks encompassing multiple WGD events and/or saturation artifacts. We argue that K-S effects need to be properly accounted for when inferring WGDs from age distributions and that the failure to do so could lead to false inferences.
  89. Van Bogaert, Inge, Holvoet, K., Roelants, S., Li, B., Lin, Y.-C., Van de Peer, Y., & Soetaert, W. (2013). The biosynthetic gene cluster for sophorolipids: a biotechnological interesting biosurfactant produced by Starmerella bombicola. MOLECULAR MICROBIOLOGY, 88(3), 501–509.
    Sophorolipids are promising biological derived surfactants or detergents which find application in household cleaning, personal care and cosmetics. They are produced by specific yeast species and among those, Starmerella bombicola (former Candida bombicola) is the most widely used and studied one. Despite the commercial interest in sophorolipids, the biosynthetic pathway of these secondary metabolites remained hitherto partially unsolved. In this manuscript we present the sophorolipid gene cluster consisting of five genes directly involved in sophorolipid synthesis: a cytochrome P450 monooxygenase, two glucosyltransferases, an acetyltransferase and a transporter. It was demonstrated that disabling the first step of the pathway cytochrome P450 monooxygenase mediated terminal or subterminal hydroxylation of a common fatty acid results in complete abolishment of sophorolipid production. This phenotype could be complemented by supplying the yeast with hydroxylated fatty acids. On the other hand, knocking out the transporter gene yields mutants still able to secrete sophorolipids, though only at levels of 10% as compared with the wild type, suggesting alternative routes for secretion. Finally, it was proved that hampering sophorolipid production does not affect cell growth or cell viability in laboratory conditions, as can be expected for secondary metabolites.
  90. Verhelst, Bram, Van de Peer, Y., & Rouzé, P. (2013). The complex intron landscape and massive intron invasion in a picoeukaryote provides insights into intron evolution. GENOME BIOLOGY AND EVOLUTION, 5(12), 2393–2401.
    Genes in pieces and spliceosomal introns are a landmark of eukaryotes, with intron invasion usually assumed to have happened early on in evolution. Here, we analyse the intron landscape of Micromonas, a unicellular green alga in the Mamiellophyceae lineage, demonstrating the co-existence of several classes of introns and the occurrence of recent massive intron invasion. This study focuses on two strains, CCMP1545 and RCC299, and their related individuals from ocean samplings, showing that they not only harbour different classes of introns depending on their location in the genome, as for other Mamiellophyceae, but uniquely carry several classes of repeat introns. These introns, dubbed introner elements (IEs), are found at novel positions in genes and have conserved sequences, contrary to canonical introns. This IE invasion has a huge impact on the genome, doubling the number of introns in the CCMP1545 strain. We hypothesize that each IE class originated from a single ancestral IE that has been colonizing the genome after strain divergence by inserting copies of itself into genes by intron transposition, likely involving reverse splicing. Along with similar cases recently observed in other organisms, our observations in Micromonas strains shed a new light on the evolution of introns, suggesting that intron gain is more widespread than previously thought.
  91. Nystedt, B., Street, N. R., Wetterbom, A., Zuccolo, A., Lin, Y.-C., Scofield, D. G., Vezzi, F., et al. (2013). The Norway spruce genome sequence and conifer genome evolution. NATURE, 497(7451), 579–584.
    Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.
  92. De Smet, Riet, & Van de Peer, Y. (2012). Redundancy and rewiring of genetic networks following genome-wide duplication events. CURRENT OPINION IN PLANT BIOLOGY, 15(2), 168–176.
    Polyploidy or whole-genome duplication is a frequent phenomenon within the plant kingdom and has been associated with the occurrence of evolutionary novelty and increase in biological complexity. Because genome-wide duplication events duplicate whole molecular networks it is of interest to investigate how these networks evolve subsequent to such events. Although genome duplications are generally followed by massive gene loss, at least part of the network is usually retained in duplicate and can rewire to execute novel functions. Alternatively, the network can remain largely redundant and as such confer robustness against mutations. The increasing availability of high-throughput data makes it possible to study evolution following whole genome duplication events at the network level. Here we discuss how the use of 'omics' data in network analysis can provide novel insights on network redundancy and rewiring and conclude with some directions for future research.
  93. Van de Peer, Y., & ChrisPires, J. (2012). Getting up to speed. CURRENT OPINION IN PLANT BIOLOGY.
  94. Fawcett, J., Rouzé, P., & Van de Peer, Y. (2012). Higher intron loss rate in Arabidopsis thaliana than A. lyrata is consistent with stronger selection for a smaller genome. MOLECULAR BIOLOGY AND EVOLUTION, 29(2), 849–859.
    The number of introns varies considerably among different organisms. This can be explained by the differences in the rates of intron gain and loss. Two factors that are likely to influence these rates are selection for or against introns and the mutation rate that generates the novel intron or the intronless copy. Although it has been speculated that stronger selection for a compact genome might result in a higher rate of intron loss and a lower rate of intron gain, clear evidence is lacking, and the role of selection in determining these rates has not been established. Here, we studied the gain and loss of introns in the two closely related species Arabidopsis thaliana and A. lyrata as it was recently shown that A. thaliana has been undergoing a faster genome reduction driven by selection. We found that A. thaliana has lost six times more introns than A. lyrata since the divergence of the two species but gained very few introns. We suggest that stronger selection for genome reduction probably resulted in the much higher intron loss rate in A. thaliana, although further analysis is required as we could not find evidence that the loss rate increased in A. thaliana as opposed to having decreased in A. lyrata compared with the rate in the common ancestor. We also examined the pattern of the intron gains and losses to better understand the mechanisms by which they occur. Microsimilarity was detected between the splice sites of several gained and lost introns, suggesting that nonhomologous end joining repair of double-strand breaks might be a common pathway not only for intron gain but also for intron loss.
  95. Hacquard, S., Joly, D. L., Lin, Y.-C., Tisserant, E., Feau, N., Delaruelle, C., Legué, V., et al. (2012). A comprehensive analysis of genes encoding small secreted proteins identifies candidate effectors in Melampsora larici-populina (poplar leaf rust). MOLECULAR PLANT-MICROBE INTERACTIONS, 25(3), 279–293.
    The obligate biotrophic rust fungus Melampsora larici-populina is the most devastating and widespread pathogen of poplars. Studies over recent years have identified various small secreted proteins (SSP) from plant biotrophic filamentous pathogens and have highlighted their role as effectors in host-pathogen interactions. The recent analysis of the M. larici-populina genome sequence has revealed the presence of 1,184 SSP-encoding genes in this rust fungus. In the present study, the expression and evolutionary dynamics of these SSP were investigated to pinpoint the arsenal of putative effectors that could be involved in the interaction between the rust fungus and poplar. Similarity with effectors previously described in Melampsora spp., richness in cysteines, and organization in large families were extensively detailed and discussed. Positive selection analyses conducted over clusters of paralogous genes revealed fast-evolving candidate effectors. Transcript profiling of selected M. laricipopulina SSP showed a timely coordinated expression during leaf infection, and the accumulation of four candidate effectors in distinct rust infection structures was demonstrated by immunolocalization. This integrated and multifaceted approach helps to prioritize candidate effector genes for functional studies
  96. Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer, Y., & Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA comparative genomics platform. PLANT PHYSIOLOGY, 158(2), 590–600.
    With the arrival of low-cost, next-generation sequencing, a multitude of new plant genomes are being publicly released, providing unseen opportunities and challenges for comparative genomics studies. Here, we present PLAZA 2.5, a user-friendly online research environment to explore genomic information from different plants. This new release features updates to previous genome annotations and a substantial number of newly available plant genomes as well as various new interactive tools and visualizations. Currently, PLAZA hosts 25 organisms covering a broad taxonomic range, including 13 eudicots, five monocots, one lycopod, one moss, and five algae. The available data consist of structural and functional gene annotations, homologous gene families, multiple sequence alignments, phylogenetic trees, and colinear regions within and between species. A new Integrative Orthology Viewer, combining information from different orthology prediction methodologies, was developed to efficiently investigate complex orthology relationships. Cross-species expression analysis revealed that the integration of complementary data types extended the scope of complex orthology relationships, especially between more distantly related species. Finally, based on phylogenetic profiling, we propose a set of core gene families within the green plant lineage that will be instrumental to assess the gene space of draft or newly sequenced plant genomes during the assembly or annotation phase.
  97. Van Landeghem, S., Hakala, K., Rönnqvist, S., Salakoski, T., Van de Peer, Y., & Ginter, F. (2012). Exploring biomolecular literature with EVEX: connecting genes through events, homology, and indirect associations. ADVANCES IN BIOINFORMATICS, 2012.
    Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
  98. Amoutzias, G. D., He, Y., Lilley, K. S., Van de Peer, Y., & Oliver, S. G. (2012). Evaluation and properties of the budding yeast phosphoproteome. MOLECULAR & CELLULAR PROTEOMICS, 11(6).
    We have assembled a reliable phosphoproteomic data set for budding yeast Saccharomyces cerevisiae and have investigated its properties. Twelve publicly available phosphoproteome data sets were triaged to obtain a subset of high-confidence phosphorylation sites (p-sites), free of "noisy" phosphorylations. Analysis of this combined data set suggests that the inventory of phosphoproteins in yeast is close to completion, but that these proteins may have many undiscovered p-sites. Proteins involved in budding and protein kinase activity have high numbers of p-sites and are highly over-represented in the vast majority of the yeast phosphoproteome data sets. The yeast phosphoproteome is characterized by a few proteins with many p-sites and many proteins with a few p-sites. We confirm a tendency for p-sites to cluster together and find evidence that kinases may phosphorylate off-target amino acids that are within one or two residues of their cognate target. This suggests that the precise position of the phosphorylated amino acid is not a stringent requirement for regulatory fidelity. Compared with nonphosphorylated proteins, phosphoproteins are more ancient, more abundant, have longer unstructured regions, have more genetic interactions, more protein interactions, and are under tighter post-translational regulation. It appears that phosphoproteins constitute the raw material for pathway rewiring and adaptation at various evolutionary rates.
  99. Murat, F., Van de Peer, Y., & Salse, J. (2012). Decoding plant and animal genome plasticity from differential paleo-evolutionary patterns and processes. GENOME BIOLOGY AND EVOLUTION, 4(9), 917–928.
    Continuing advances in genome sequencing technologies and computational methods for comparative genomics currently allow inferring the evolutionary history of entire plant and animal genomes. Based on the comparison of the plant and animal genome paleohistory, major differences are unveiled in 1) evolutionary mechanisms (i.e., polyploidization versus diploidization processes), 2) genome conservation (i.e., coding versus noncoding sequence maintenance), and 3) modern genome architecture (i.e., genome organization including repeats expansion versus contraction phenomena). This article discusses how extant animal and plant genomes are the result of inherently different rates and modes of genome evolution resulting in relatively stable animal and much more dynamic and plastic plant genomes.
  100. Torres Torres, G., Marchal, K., Van de Peer, Y., & De Cock, M. (2012). An ASP-based simulation method for finding all synchronous and asynchronous attractors in genetic regulatory networks. ISBC Student Council, 8th Symposium, Abstracts. Presented at the 8th ISBC Student Council Symposium, International Society for Computational Biology (ISCB).
  101. Torres Torres, G., Marchal, K., Van de Peer, Y., & De Cock, M. (2012). Predicting long term behavior of genetic regulatory networks with answer set programming. In Bernard De Baets, B. Manderick, M. Rademaker, & W. Waegeman (Eds.), Proceedings of the 21st Belgian-Dutch conference on machine learning. Presented at the 21st Annual Belgian-Dutch conference on Machine Learning (BeNeLearn & PMLS 2012).
  102. Sterck, L., Billiau, K., Abeel, T., Rouzé, P., & Van de Peer, Y. (2012). ORCAE: online resource for community annotation of eukaryotes. NATURE METHODS, 9(11), 1041–1041.
  103. Björne, J., Van Landeghem, S., Pyysalo, S., Ohta, T., Ginter, F., Van de Peer, Y., Ananiadou, S., et al. (2012). PubMed-scale event extraction for post-translational modifications, epigenetics and protein structural relations. Proceedings of the 2012 workshop on biomedical natural language processing (pp. 82–90). Presented at the 2012 Workshop on Biomedical Natural Language Processing (BioNLP 2012), Association for Computational Linguistics (ACL).
    Recent efforts in biomolecular event extraction have mainly focused on core event types involving genes and proteins, such as gene expression, protein-protein interactions, and protein catabolism. The BioNLP’11 Shared Task extended the event extraction approach to sub-protein events and relations in the Epigenetics and Post-translational Modifications (EPI) and Protein Relations (REL) tasks. In this study, we apply the Turku Event Extraction System, the best-performing system for these tasks, to all PubMed abstracts and all available PMC full-text articles, extracting 1.4M EPI events and 2.2M REL relations from 21M abstracts and 372K articles. We introduce several entity normalization algorithms for genes, proteins, protein complexes and protein components, aiming to uniquely identify these biological entities. This normalization effort allows direct mapping of the extracted events and relations with posttranslational modifications from UniProt, epigenetics from PubMeth, functional domains from InterPro and macromolecular structures from PDB. The extraction of such detailed protein information provides a unique text mining dataset, offering the opportunity to further deepen the information provided by existing PubMed-scale event extraction efforts. The methods and data introduced in this study are freely available from bionlp.utu.fi
  104. Abeel, T., Van Parys, T., Saeys, Y., Galagan, J., & Van de Peer, Y. (2012). GenomeView: a next-generation genome browser. NUCLEIC ACIDS RESEARCH, 40(2).
    Due to ongoing advances in sequencing technologies, billions of nucleotide sequences are now produced on a daily basis. A major challenge is to visualize these data for further downstream analysis. To this end, we present GenomeView, a stand-alone genome browser specifically designed to visualize and manipulate a multitude of genomics data. GenomeView enables users to dynamically browse high volumes of aligned short-read data, with dynamic navigation and semantic zooming, from the whole genome level to the single nucleotide. At the same time, the tool enables visualization of whole genome alignments of dozens of genomes relative to a reference sequence. GenomeView is unique in its capability to interactively handle huge data sets consisting of tens of aligned genomes, thousands of annotation features and millions of mapped short reads both as viewer and editor. GenomeView is freely available as an open source software package.
  105. Malacarne, G., Perazzolli, M., Cestaro, A., Sterck, L., Fontana, P., Van de Peer, Y., Viola, R., et al. (2012). Deconstruction of the (paleo)polyploid grapevine genome based on the analysis of transposition events involving NBS resistance genes. PLOS ONE, 7(1).
    Plants have followed a reticulate type of evolution and taxa have frequently merged via allopolyploidization. A polyploid structure of sequenced genomes has often been proposed, but the chromosomes belonging to putative component genomes are difficult to identify. The 19 grapevine chromosomes are evolutionary stable structures: their homologous triplets have strongly conserved gene order, interrupted by rare translocations. The aim of this study is to examine how the grapevine nucleotide-binding site (NBS)-encoding resistance (NBS-R) genes have evolved in the genomic context and to understand mechanisms for the genome evolution. We show that, in grapevine, i) helitrons have significantly contributed to transposition of NBS-R genes, and ii) NBS-R gene cluster similarity indicates the existence of two groups of chromosomes (named as Va and Vc) that may have evolved independently. Chromosome triplets consist of two Va and one Vc chromosomes, as expected from the tetraploid and diploid conditions of the two component genomes. The hexaploid state could have been derived from either allopolyploidy or the separation of the Va and Vc component genomes in the same nucleus before fusion, as known for Rosaceae species. Time estimation indicates that grapevine component genomes may have fused about 60 mya, having had at least 40-60 mya to evolve independently. Chromosome number variation in the Vitaceae and related families, and the gap between the time of eudicot radiation and the age of Vitaceae fossils, are accounted for by our hypothesis.
  106. Klochendler, A., Weinberg-Corem, N., Moran, M., Swisa, A., Pochet, N., Savova, V., Vikeså, J., et al. (2012). A transgenic mouse marking live replicating cells reveals in vivo transcriptional program of proliferation. DEVELOPMENTAL CELL, 23(4), 681–690.
    Most adult mammalian tissues are quiescent, with rare cell divisions serving to maintain homeostasis. At present, the isolation and study of replicating cells from their in vivo niche typically involves immunostaining for intracellular markers of proliferation, causing the loss of sensitive biological material. We describe a transgenic mouse strain, expressing a CyclinB1-GFP fusion reporter, that marks replicating cells in the S/G2/M phases of the cell cycle. Using flow cytometry, we isolate live replicating cells from the liver and compare their transcriptome to that of quiescent cells to reveal gene expression programs associated with cell proliferation in vivo. We find that replicating hepatocytes have reduced expression of genes characteristic of liver differentiation. This reporter system provides a powerful platform for gene expression and metabolic and functional studies of replicating cells in their in vivo niche.
  107. Moreau, H., Verhelst, B., Couloux, A., Derelle, E., Rombauts, S., Grimsley, N., Van Bel, M., et al. (2012). Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage. GENOME BIOLOGY, 13(8).
    Background: Bathycoccus prasinos is an extremely small cosmopolitan marine green alga whose cells are covered with intricate spider's web patterned scales that develop within the Golgi cisternae before their transport to the cell surface. The objective of this work is to sequence and analyze its genome, and to present a comparative analysis with other known genomes of the green lineage. Research: Its small genome of 15 Mb consists of 19 chromosomes and lacks transposons. Although 70% of all B. prasinos genes share similarities with other Viridiplantae genes, up to 428 genes were probably acquired by horizontal gene transfer, mainly from other eukaryotes. Two chromosomes, one big and one small, are atypical, an unusual synapomorphic feature within the Mamiellales. Genes on these atypical outlier chromosomes show lower GC content and a significant fraction of putative horizontal gene transfer genes. Whereas the small outlier chromosome lacks colinearity with other Mamiellales and contains many unknown genes without homologs in other species, the big outlier shows a higher intron content, increased expression levels and a unique clustering pattern of housekeeping functionalities. Four gene families are highly expanded in B. prasinos, including sialyltransferases, sialidases, ankyrin repeats and zinc ion-binding genes, and we hypothesize that these genes are associated with the process of scale biogenesis. Conclusion: The minimal genomes of the Mamiellophyceae provide a baseline for evolutionary and functional analyses of metabolic processes in green plants.
  108. Cock, J. M., Sterck, L., Ahmed, S., Allen, A. E., Amoutzias, G., Anthouard, V., Artiguenave, F., et al. (2012). The Ectocarpus genome and brown algal genomics: the Ectocarpus Genome Consortium. (G Piganeau, Ed.)Advances in Botanical Research, 64, 141–184.
    Brown algae are important organisms both because of their key ecological roles in coastal ecosystems and because of the remarkable biological features that they have acquired during their unusual evolutionary history. The recent sequencing of the complete genome of the filamentous brown alga Ectocarpus has provided unprecedented access to the molecular processes that underlie brown algal biology. Analysis of the genome sequence, which exhibits several unusual structural features, identified genes that are predicted to play key roles in several aspects of brown algal metabolism, in the construction of the multicellular bodyplan and in resistance to biotic and abiotic stresses. Information from the genome sequence is currently being used in combination with other genomic, genetic and biochemical tools to further investigate these and other aspects of brown algal biology at the molecular level. Here, we review some of the major discoveries that emerged from the analysis of the Ectocarpus genome sequence, with a particular focus on the unusual genome structure, inferences about brown algal evolution and novel aspects of brown algal metabolism.
  109. Milner, D. A., Jr, Pochet, N., Krupka, M., Williams, C., Seydel, K., Taylor, T., Van de Peer, Y., et al. (2012). Transcriptional profiling of Plasmodium falciparum parasites from patients with severe malaria identifies distinct low vs. high parasitemic clusters. PLOS ONE, 7(7).
    Background: In the past decade, estimates of malaria infections have dropped from 500 million to 225 million per year; likewise, mortality rates have dropped from 3 million to 791,000 per year. However, approximately 90% of these deaths continue to occur in sub-Saharan Africa, and 85% involve children less than 5 years of age. Malaria mortality in children generally results from one or more of the following clinical syndromes: severe anemia, acidosis, and cerebral malaria. Although much is known about the clinical and pathological manifestations of CM, insights into the biology of the malaria parasite, specifically transcription during this manifestation of severe infection, are lacking. Methods and Findings: We collected peripheral blood from children meeting the clinical case definition of cerebral malaria from a cohort in Malawi, examined the patients for the presence or absence of malaria retinopathy, and performed whole genome transcriptional profiling for Plasmodium falciparum using a custom designed Affymetrix array. We identified two distinct physiological states that showed highly significant association with the level of parasitemia. We compared both groups of Malawi expression profiles with our previously acquired ex vivo expression profiles of parasites derived from infected patients with mild disease; a large collection of in vitro Plasmodium falciparum life cycle gene expression profiles; and an extensively annotated compendium of expression data from Saccharomyces cerevisiae. The high parasitemia patient group demonstrated a unique biology with elevated expression of Hrd1, a member of endoplasmic reticulum-associated protein degradation system. Conclusions: The presence of a unique high parasitemia state may be indicative of the parasite biology of the clinically recognized hyperparasitemic severe disease syndrome.
  110. Brown, J. R., Hanna, M., Tesar, B., Werner, L., Pochet, N., Asara, J. M., Wang, Y. E., et al. (2012). Integrative genomic analysis implicates gain of PIK3CA at 3q26 and MYC at 8q24 in chronic lymphocytic leukemia. CLINICAL CANCER RESEARCH, 18(14), 3791–3802.
    Purpose: The disease course of chronic lymphocytic leukemia (CLL) varies significantly within cytogenetic groups. We hypothesized that high-resolution genomic analysis of CLL would identify additional recurrent abnormalities associated with short time-to-first therapy (TTFT). Experimental Design: We undertook high-resolution genomic analysis of 161 prospectively enrolled CLLs using Affymetrix 6.0 SNP arrays, and integrated analysis of this data set with gene expression profiles. Results: Copy number analysis (CNA) of nonprogressive CLL reveals a stable genotype, with a median of only 1 somatic CNA per sample. Progressive CLL with 13q deletion was associated with additional somatic CNAs, and a greater number of CNAs was predictive of TTFT. We identified other recurrent CNAs associated with short TTFT: 8q24 amplification focused on the cancer susceptibility locus near MYC in 3.7%; 3q26 amplifications focused on PIK3CA in 5.6%; and 8p deletions in 5% of patients. Sequencing of MYC further identified somatic mutations in two CLLs. We determined which catalytic subunits of phosphoinositide 3-kinase (PI3K) were in active complex with the p85 regulatory subunit and showed enrichment for the a subunit in three CLLs carrying PIK3CA amplification. Conclusions: Our findings implicate amplifications of 3q26 focused on PIK3CA and 8q24 focused on MYC in CLL.
  111. Vekemans, D., Proost, S., Vanneste, K., Coenen, H., Viaene, T., Ruelens, P., Maere, S., et al. (2012). Gamma paleohexaploidy in the stem lineage of core eudicots: significance for MADS-box gene and species diversification. MOLECULAR BIOLOGY AND EVOLUTION, 29(12), 3793–3806.
    Comparative genome biology has unveiled the polyploid origin of all angiosperms and the role of recurrent polyploidization in the amplification of gene families and the structuring of genomes. Which species share certain ancient polyploidy events, and which do not, is ill defined because of the limited number of sequenced genomes and transcriptomes and their uneven phylogenetic distribution. Previously, it has been suggested that most, but probably not all, of the eudicots have shared an ancient hexaploidy event, referred to as the gamma triplication. In this study, detailed phylogenies of subfamilies of MADS-box genes suggest that the gamma triplication has occurred before the divergence of Gunnerales but after the divergence of Buxales and Trochodendrales. Large-scale phylogenetic and K-S-based approaches on the inflorescence transcriptomes of Gunnera manicata (Gunnerales) and Pachysandra terminalis (Buxales) provide further support for this placement, enabling us to position the gamma triplication in the stem lineage of the core eudicots. This triplication likely initiated the functional diversification of key regulators of reproductive development in the core eudicots, comprising 75% of flowering plants. Although it is possible that the gamma event triggered early core eudicot diversification, our dating estimates suggest that the event occurred early in the stem lineage, well before the rapid speciation of the earliest core eudicot lineages. The evolutionary significance of this paleopolyploidy event may thus rather lie in establishing a species lineage that was resilient to extinction, but with the genomic potential for later diversification. We consider that the traits generated from this potential characterize extant core eudicots both chemically and morphologically.
  112. Sato, S., Tabata, S., Hirakawa, H., Asamizu, E., Shirasawa, K., Isobe, S., Kaneko, T., et al. (2012). The tomato genome sequence provides insights into fleshy fruit evolution. NATURE, 485(7400), 635–641.
    Tomato (Solanum lycopersicum) is a major crop plant and a model system for fruit development. Solanum is one of the largest angiosperm genera(1) and includes annual and perennial plants from diverse habitats. Here we present a high-quality genome sequence of domesticated tomato, a draft sequence of its closest wild relative, Solanum pimpinellifolium(2), and compare them to each other and to the potato genome (Solanum tuberosum). The two tomato genomes show only 0.6% nucleotide divergence and signs of recent admixture, but show more than 8% divergence from potato, with nine large and several smaller inversions. In contrast to Arabidopsis, but similar to soybean, tomato and potato small RNAs map predominantly to gene-rich chromosomal regions, including gene promoters. The Solanum lineage has experienced two consecutive genome triplications: one that is ancient and shared with rosids, and a more recent one. These triplications set the stage for the neofunctionalization of genes controlling fruit characteristics, such as colour and fleshiness.
  113. Van Landeghem, S., Björne, J., Abeel, T., De Baets, B., Salakoski, T., & Van de Peer, Y. (2012). Semantically linking molecular entities in literature through entity relationships. BMC BIOINFORMATICS, 13. Presented at the Conference on BioNLP Shared Task.
    Background: Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. Results: We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score >90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. Conclusions: The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale.
  114. Brown, JR, Hanna, M., Tesar, B., Pochet, N., Vartanov, A., Fernandes, S., Werner, L., et al. (2012). Germline copy number variation associated with Mendelian inheritance of CLL in two families. LEUKEMIA, 26(7), 1710–1713.
  115. Proost, Sebastian, Fostier, J., De Witte, D., Dhoedt, B., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2012). i-ADHoRe 3.0 : fast and sensitive detection of genomic homology in extremely large data sets. NUCLEIC ACIDS RESEARCH, 40(2).
  116. Whitford, R., Fernandez Salina, A., Tejos Ulloa, R., Cuéllar Pérez, A., Kleine-Vehn, J., Vanneste, S., Drozdzecki, A., et al. (2012). GOLVEN secretory peptides regulate auxin carrier turnover during plant gravitropic responses. DEVELOPMENTAL CELL, 22(3), 678–685.
  117. Kano, Y., Bjorne, J., Ginter, F., Salakoski, T., Buyko, E., Hahn, U., Cohen, K., et al. (2011). U-Compare bio-event meta-service: compatible BioNLP event extraction services. BMC BIOINFORMATICS, 12.
    Background: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. Results: We have integrated nine event extraction systems in the U-Compare framework, making them inter-compatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. Conclusions: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.
  118. Proost, Sebastian, Pattyn, P., Gerats, T., & Van de Peer, Y. (2011). Journey through the past: 150 million years of plant genome evolution. PLANT JOURNAL, 66(1), 58–65.
  119. Van de Peer, Y. (2011). Genomes: the truth is in there. EMBO REPORTS.
  120. Fostier, J., Proost, S., Dhoedt, B., Saeys, Y., Demeester, P., Van de Peer, Y., & Vandepoele, K. (2011). A greedy, graph-based algorithm for the alignment of multiple homologous gene lists. BIOINFORMATICS, 27(6), 749–756.
  121. Van de Peer, Y. (2011). A mystery unveiled. GENOME BIOLOGY.
    A recent phylogenomic study has provided new evidence for two ancient whole genome duplications in plants, with potential importance for the evolution of seed and flowering plants.
  122. Van Landeghem, S., Ginter, F., Van de Peer, Y., & Salakoski, T. (2011). EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions. Proceedings of the 2011 workshop on biomedical natural language processing (pp. 28–37). Presented at the Workshop on Biomedical Natural Language Processing (ACL-HLT 2011), Association for Computational Linguistics (ACL).
    In comparative genomics, functional annotations are transferred from one organism to another relying on sequence similarity. With more than 20 million citations in PubMed, text mining provides the ideal tool for generating additional large-scale homology-based predictions. To this end, we have refined a recent dataset of biomolecular events extracted from text, and integrated these predictions with records from public gene databases. Accounting for lexical variation of gene symbols, we have implemented a disambiguation algorithm that uniquely links the arguments of 11.2 million biomolecular events to well-defined gene families, providing interesting opportunities for query expansion and hypothesis generation. The resulting MySQL database, including all 19.2 million original events as well as their homology-based variants, is publicly available at http://bionlp.utu.fi/.
  123. Joshi, Anagha, Van de Peer, Y., & Michoel, T. (2011). Structural and functional organization of RNA regulons in the post-transcriptional regulatory network of yeast. NUCLEIC ACIDS RESEARCH, 39(21), 9108–9117.
    Post-transcriptional control of mRNA transcript processing by RNA binding proteins (RBPs) is an important step in the regulation of gene expression and protein production. The post-transcriptional regulatory network is similar in complexity to the transcriptional regulatory network and is thought to be organized in RNA regulons, coherent sets of functionally related mRNAs combinatorially regulated by common RBPs. We integrated genome-wide transcriptional and translational expression data in yeast with large-scale regulatory networks of transcription factor and RBP binding interactions to analyze the functional organization of post-transcriptional regulation and RNA regulons at a system level. We found that post-transcriptional feedback loops and mixed bifan motifs are overrepresented in the integrated regulatory network and control the coordinated translation of RNA regulons, manifested as clusters of functionally related mRNAs which are strongly coexpressed in the translatome data. These translatome clusters are more functionally coherent than transcriptome clusters and are expressed with higher mRNA and protein levels and less noise. Our results show how the post-transcriptional network is intertwined with the transcriptional network to regulate gene expression in a coordinated way and that the integration of heterogeneous genome-wide datasets allows to relate structure to function in regulatory networks at a system level.
  124. Duplessis, S., Cuomo, C. A., Lin, Y.-C., Aerts, A., Tisserant, E., Veneault-Fourrey, C., Joly, D. L., et al. (2011). Obligate biotrophy features unraveled by the genomic analysis of rust fungi. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 108(22), 9166–9171.
    Rust fungi are some of the most devastating pathogens of crop plants. They are obligate biotrophs, which extract nutrients only from living plant tissues and cannot grow apart from their hosts. Their lifestyle has slowed the dissection of molecular mechanisms underlying host invasion and avoidance or suppression of plant innate immunity. We sequenced the 101-Mb genome of Melampsora larici-populina, the causal agent of poplar leaf rust, and the 89-Mb genome of Puccinia graminis f. sp. tritici, the causal agent of wheat and barley stem rust. We then compared the 16,399 predicted proteins of M. larici-populina with the 17,773 predicted proteins of P. graminis f. sp tritici. Genomic features related to their obligate biotrophic lifestyle include expanded lineage-specific gene families, a large repertoire of effector-like small secreted proteins, impaired nitrogen and sulfur assimilation pathways, and expanded families of amino acid and oligopeptide membrane transporters. The dramatic up-regulation of transcripts coding for small secreted proteins, secreted hydrolytic enzymes, and transporters in planta suggests that they play a role in host infection and nutrient acquisition. Some of these genomic hallmarks are mirrored in the genomes of other microbial eukaryotes that have independently evolved to infect plants, indicating convergent adaptation to a biotrophic existence inside plant cells.
  125. Grbić, M., Van Leeuwen, T., Clark, R. M., Rombauts, S., Rouzé, P., Grbić, V., Osborne, E. J., et al. (2011). The genome of Tetranychus urticae reveals herbivorous pest adaptations. NATURE, 479(7374), 487–492.
    The spider mite Tetranychus urticae is a cosmopolitan agricultural pest with an extensive host plant range and an extreme record of pesticide resistance. Here we present the completely sequenced and annotated spider mite genome, representing the first complete chelicerate genome. At 90 megabases T. urticae has the smallest sequenced arthropod genome. Compared with other arthropods, the spider mite genome shows unique changes in the hormonal environment and organization of the Hox complex, and also reveals evolutionary innovation of silk production. We find strong signatures of polyphagy and detoxification in gene families associated with feeding on different hosts and in new gene families acquired by lateral gene transfer. Deep transcriptome analysis of mites feeding on different plants shows how this pest responds to a changing host environment. The T. urticae genome thus offers new insights into arthropod evolution and plant-herbivore interactions, and provides unique opportunities for developing novel plant protection strategies.
  126. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2011). Context-dependent codon partition models provide significant increases in model fit in atpB and rbcL protein-coding genes. BMC EVOLUTIONARY BIOLOGY, 11.
    Background: Accurate modelling of substitution processes in protein-coding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of context-dependent substitution patterns at four-fold degenerate sites, we take those indications into account in this paper.Results: We present so-called context-dependent codon partition models to assess previous empirical claims that the evolution of four-fold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and context-dependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such context-dependent codon partition models employ a full dependency scheme for four-fold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions. Conclusions: We show that, both in the atpB and rbcL alignments of a collection of land plants, these context-dependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same context-dependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Context-dependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a context-dependent point of view.
  127. Hu, T. T., Pattyn, P., Bakker, E. G., Cao, J., Cheng, J.-F., Clark, R. M., Fahlgren, N., et al. (2011). The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. NATURE GENETICS, 43(5), 476–481.
    We report the 207-Mb genome sequence of the North American Arabidopsis lyrata strain MN47 based on 8.3x dideoxy sequence coverage. We predict 32,670 genes in this outcrossing species compared to the 27,025 genes in the selfing species Arabidopsis thaliana. The much smaller 125-Mb genome of A. thaliana, which diverged from A. lyrata 10 million years ago, likely constitutes the derived state for the family. We found evidence for DNA loss from large-scale rearrangements, but most of the difference in genome size can be attributed to hundreds of thousands of small deletions, mostly in noncoding DNA and transposons. Analysis of deletions and insertions still segregating in A. thaliana indicates that the process of DNA loss is ongoing, suggesting pervasive selection for a smaller genome. The high-quality reference genome sequence for A. lyrata will be an important resource for functional, evolutionary and ecological studies in the genus Arabidopsis.
  128. Chancerel, E., Lepoittevin, C., Le Provost, G., Lin, Y.-C., Jaramillo-Correa, J. P., Eckert, A. J., Wegrzyn, J. L., et al. (2011). Development and implementation of a highly-multiplexed SNP array for genetic mapping in maritime pine and comparative mapping with loblolly pine. BMC GENOMICS, 12.
    Background: Single nucleotide polymorphisms (SNPs) are the most abundant source of genetic variation among individuals of a species. New genotyping technologies allow examining hundreds to thousands of SNPs in a single reaction for a wide range of applications such as genetic diversity analysis, linkage mapping, fine QTL mapping, association studies, marker-assisted or genome-wide selection. In this paper, we evaluated the potential of highly-multiplexed SNP genotyping for genetic mapping in maritime pine (Pinus pinaster Ait.), the main conifer used for commercial plantation in southwestern Europe. Results: We designed a custom GoldenGate assay for 1,536 SNPs detected through the resequencing of gene fragments (707 in vitro SNPs/Indels) and from Sanger-derived Expressed Sequenced Tags assembled into a unigene set (829 in silico SNPs/Indels). Offspring from three-generation outbred (G2) and inbred (F2) pedigrees were genotyped. The success rate of the assay was 63.6% and 74.8% for in silico and in vitro SNPs, respectively. A genotyping error rate of 0.4% was further estimated from segregating data of SNPs belonging to the same gene. Overall, 394 SNPs were available for mapping. A total of 287 SNPs were integrated with previously mapped markers in the G2 parental maps, while 179 SNPs were localized on the map generated from the analysis of the F2 progeny. Based on 98 markers segregating in both pedigrees, we were able to generate a consensus map comprising 357 SNPs from 292 different loci. Finally, the analysis of sequence homology between mapped markers and their orthologs in a Pinus taeda linkage map, made it possible to align the 12 linkage groups of both species. Conclusions: Our results show that the GoldenGate assay can be used successfully for high-throughput SNP genotyping in maritime pine, a conifer species that has a genome seven times the size of the human genome. This SNP-array will be extended thanks to recent sequencing effort using new generation sequencing technologies and will include SNPs from comparative orthologous sequences that were identified in the present study, providing a wider collection of anchor points for comparative genomics among the conifers.
  129. Young, N. D., Debellé, F., Oldroyd, G. E., Geurts, R., Cannon, S. B., Udvardi, M. K., Benedito, V. A., et al. (2011). The Medicago genome provides insight into the evolution of rhizobial symbioses. NATURE, 480(7378), 520–524.
    Legumes (Fabaceae or Leguminosae) are unique among cultivated plants for their ability to carry out endosymbiotic nitrogen fixation with rhizobial bacteria, a process that takes place in a specialized structure known as the nodule. Legumes belong to one of the two main groups of eurosids, the Fabidae, which includes most species capable of endosymbiotic nitrogen fixation(1). Legumes comprise several evolutionary lineages derived from a common ancestor 60 million years ago (Myr ago). Papilionoids are the largest clade, dating nearly to the origin of legumes and containing most cultivated species(2). Medicago truncatula is a long-established model for the study of legume biology. Here we describe the draft sequence of the M. truncatula euchromatin based on a recently completed BAC assembly supplemented with Illumina shotgun sequence, together capturing similar to 94% of all M. truncatula genes. A whole-genome duplication (WGD) approximately 58 Myr ago had a major role in shaping the M. truncatula genome and thereby contributed to the evolution of endosymbiotic nitrogen fixation. Subsequent to the WGD, the M. truncatula genome experienced higher levels of rearrangement than two other sequenced legumes, Glycine max and Lotus japonicus. M. truncatula is a close relative of alfalfa (Medicago sativa), a widely cultivated crop with limited genomics tools and complex autotetraploid genetics. As such, the M. truncatula genome sequence provides significant opportunities to expand alfalfa's genomic toolbox.
  130. Audenaert, P., Van Parys, T., Brondel, F., Pickavet, M., Demeester, P., Van de Peer, Y., & Michoel, T. (2011). CyClus3D: a Cytoscape plugin for clustering network motifs in integrated networks. BIOINFORMATICS, 27(11), 1587–1588.
    Network motifs in integrated molecular networks represent functional relationships between distinct data types. They aggregate to form dense topological structures corresponding to functional modules which cannot be detected by traditional graph clustering algorithms. We developed CyClus3D, a Cytoscape plugin for clustering composite three-node network motifs using a 3D spectral clustering algorithm.
  131. Armananzas, R., Saeys, Y., Inza, I., Garcia-Torres, M., Bielza, C., Van de Peer, Y., & Larranaga, P. (2011). Peakbin selection in mass spectrometry data using a consensus approach with estimation of distribution algorithms. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 8(3), 760–774.
    Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at http://www.sc.ehu.es/ccwbayes/members/ruben/ms, includes extended info and results, in addition to Matlab scripts and references.
  132. Coyne, R. S., Hannick, L., Shanmugam, D., Hostetler, J. B., Brami, D., Joardar, V. S., Johnson, J., et al. (2011). Comparative genomics of the pathogenic ciliate Ichthyophthirius multifiliis, its free-living relatives and a host species provide insights into adoption of a parasitic lifestyle and prospects for disease control. GENOME BIOLOGY, 12(10).
    BACKGROUND: Ichthyophthirius multifiliis, commonly known as Ich, is a highly pathogenic ciliate responsible for 'white spot', a disease causing significant economic losses to the global aquaculture industry. Options for disease control are extremely limited, and Ich's obligate parasitic lifestyle makes experimental studies challenging. Unlike most well-studied protozoan parasites, Ich belongs to a phylum composed primarily of free-living members. Indeed, it is closely related to the model organism Tetrahymena thermophila. Genomic studies represent a promising strategy to reduce the impact of this disease and to understand the evolutionary transition to parasitism. RESULTS: We report the sequencing, assembly and annotation of the Ich macronuclear genome. Compared with its free-living relative T. thermophila, the Ich genome is reduced approximately two-fold in length and gene density and three-fold in gene content. We analyzed in detail several gene classes with diverse functions in behavior, cellular function and host immunogenicity, including protein kinases, membrane transporters, proteases, surface antigens and cytoskeletal components and regulators. We also mapped by orthology Ich's metabolic pathways in comparison with other ciliates and a potential host organism, the zebrafish Danio rerio. CONCLUSIONS: Knowledge of the complete protein-coding and metabolic potential of Ich opens avenues for rational testing of therapeutic drugs that target functions essential to this parasite but not to its fish hosts. Also, a catalog of surface protein-encoding genes will facilitate development of more effective vaccines. The potential to use T. thermophila as a surrogate model offers promise toward controlling 'white spot' disease and understanding the adaptation to a parasitic lifestyle.
  133. Movahedi, S., Van de Peer, Y., & Vandepoele, K. (2011). Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice. PLANT PHYSIOLOGY, 156(3), 1316–1330.
    Microarray experiments have yielded massive amounts of expression information measured under various conditions for the model species Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa). Expression compendia grouping multiple experiments make it possible to define correlated gene expression patterns within one species and to study how expression has evolved between species. We developed a robust framework to measure expression context conservation (ECC) and found, by analyzing 4,630 pairs of orthologous Arabidopsis and rice genes, that 77% showed conserved coexpression. Examples of nonconserved ECC categories suggested a link between regulatory evolution and environmental adaptations and included genes involved in signal transduction, response to different abiotic stresses, and hormone stimuli. To identify genomic features that influence expression evolution, we analyzed the relationship between ECC, tissue specificity, and protein evolution. Tissue-specific genes showed higher expression conservation compared with broadly expressed genes but were fast evolving at the protein level. No significant correlation was found between protein and expression evolution, implying that both modes of gene evolution are not strongly coupled in plants. By integration of cis-regulatory elements, many ECC conserved genes were significantly enriched for shared DNA motifs, hinting at the conservation of ancestral regulatory interactions in both model species. Surprisingly, for several tissue-specific genes, patterns of concerted network evolution were observed, unveiling conserved coexpression in the absence of conservation of tissue specificity. These findings demonstrate that orthologs inferred through sequence similarity in many cases do not share similar biological functions and highlight the importance of incorporating expression information when comparing genes across species.
  134. Michoel, T., Joshi, A., Nachtergaele, B., & Van de Peer, Y. (2011). Enrichment and aggregation of topological motifs are independent organizational principles of integrated interaction networks. MOLECULAR BIOSYSTEMS, 7(10), 2769–2778.
    Topological network motifs represent functional relationships within and between regulatory and protein-protein interaction networks. Enriched motifs often aggregate into self-contained units forming functional modules. Theoretical models for network evolution by duplication-divergence mechanisms and for network topology by hierarchical scale-free networks have suggested a one-to-one relation between network motif enrichment and aggregation, but this relation has never been tested quantitatively in real biological interaction networks. Here we introduce a novel method for assessing the statistical significance of network motif aggregation and for identifying clusters of overlapping network motifs. Using an integrated network of transcriptional, posttranslational and protein-protein interactions in yeast we show that network motif aggregation reflects a local modularity property which is independent of network motif enrichment. In particular our method identified novel functional network themes for a set of motifs which are not enriched yet aggregate significantly and challenges the conventional view that network motif enrichment is the most basic organizational principle of complex networks.
  135. Van Landeghem, S., De Baets, B., Van de Peer, Y., & Saeys, Y. (2011). High-precision bio-molecular event extraction from text using parallel binary classifiers. COMPUTATIONAL INTELLIGENCE, 27(4), 645–664.
    We have developed a machine learning framework to accurately extract complex genetic interactions from text. Employing type-specific classifiers, this framework processes research articles to extract various biological events. Subsequently, the algorithm identifies regulation events that take other events as arguments, allowing a nested structure of predictions. All predictions are merged into an integrated network, useful for visualization and for deduction of new biological knowledge. In this paper, we discuss several design choices for an event-based extraction framework. These detailed studies help improving on existing systems, which is illustrated by the relative performance gain of 10% of our system compared to the official results in the recent BioNLP'09 Shared Task. Our framework now achieves state-of-the-art performance with 37.43 recall, 54.81 precision and 44.48 F-score. We further present the first study of feature selection for bio-molecular event extraction from text. While producing more cost-effective models, feature selection can also lead to a better insight into the complexity of the challenge. Finally, this paper tries to bridge the gap between theoretical relation extraction from text and experimental work on bio-molecular interactions by discussing interesting opportunities to employ event-based text mining tools for real-life tasks such as hypothesis generation, database curation and knowledge discovery.
  136. Maere, S., & Van de Peer, Y. (2010). Duplicate retention after small- and large-scale duplications. In K. Dittmar & D. Liberles (Eds.), Evolution after gene duplication (pp. 31–56). Hoboken, NJ, USA: John Wiley & Sons.
  137. Van Landeghem, S., Abeel, T., Saeys, Y., & Van de Peer, Y. (2010). Discriminative and informative features for biomolecular text mining with ensemble feature selection. BIOINFORMATICS, 26(18), i554–i560. Presented at the 9th European Conference on Computational Biology.
    Motivation: In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. Results: We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. Availability: The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).
  138. Van Leene, J., Hollunder, J., Eeckhout, D., Persiau, G., Van De Slijke, E., Stals, H., Van Isterdael, G., et al. (2010). Targeted interactomics reveals a complex core cell cycle machinery in Arabidopsis thaliana. MOLECULAR SYSTEMS BIOLOGY, 6.
  139. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., & Saeys, Y. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. BIOINFORMATICS, 26(3), 392–398.
    Motivation: Biomarker discovery is an important topic in biomedical applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. Surprisingly, the stability with respect to sampling variation or robustness of such selection processes has received attention only recently. However, robustness of biomarkers is an important issue, as it may greatly influence subsequent biological validations. In addition, a more robust set of markers may strengthen the confidence of an expert in the results of a selection method. Results: Our first contribution is a general framework for the analysis of the robustness of a biomarker selection algorithm. Secondly, we conducted a large-scale analysis of the recently introduced concept of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features. We focus on selection methods that are embedded in the estimation of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-the- art performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offered good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving upon classification performances. The proposed methodology is evaluated on four microarray datasets showing increases of up to almost 30% in robustness of the selected biomarkers, along with an improvement of similar to 15% in classification performance. The stability improvement with ensemble methods is particularly noticeable for small signature sizes (a few tens of genes), which is most relevant for the design of a diagnosis or prognosis model from a gene signature.
  140. Sanchez-Rodriguez, A., Martens, C., Engelen, K., Van de Peer, Y., & Marchal, K. (2010). The potential for pathogenicity was present in the ancestor of the Ascomycete subphylum Pezizomycotina. BMC EVOLUTIONARY BIOLOGY, 10.
  141. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2010). Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences. BMC EVOLUTIONARY BIOLOGY, 10.
    Background: Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence. Results: We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences. Conclusions: We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.
  142. Amoutzias, G., & Van de Peer, Y. (2010). Single-gene and whole-genome duplications and the evolution of protein-protein interaction networks. In G. Caetano-Anollés (Ed.), Evolutionary genomics and systems biology (pp. 413–429). Hoboken, NJ, USA: Wiley-Blackwell.
  143. Huysman, M., Martens, C., Vandepoele, K., Gillard, J., Rayko, E., Heijde, M., Bowler, C., et al. (2010). Genome-wide analysis of the diatom cell cycle unveils a novel type of cyclins involved in environmental signaling. GENOME BIOLOGY, 11(2).
    Background : Despite the enormous importance of diatoms in aquatic ecosystems and their broad industrial potential, little is known about their life cycle control. Diatoms typically inhabit rapidly changing and unstable environments, suggesting that cell cycle regulation in diatoms must have evolved to adequately integrate various environmental signals. The recent genome sequencing of Thalassiosira pseudonana and Phaeodactylum tricornutum allows us to explore the molecular conservation of cell cycle regulation in diatoms. Results : By profile-based annotation of cell cycle genes, counterparts of conserved as well as new regulators were identified in T. pseudonana and P. tricornutum. In particular, the cyclin gene family was found to be expanded extensively compared to that of other eukaryotes and a novel type of cyclins was discovered, the diatom-specific cyclins. We established a synchronization method for P. tricornutum that enabled assignment of the different annotated genes to specific cell cycle phase transitions. The diatom-specific cyclins are predominantly expressed at the G1-to-S transition and some respond to phosphate availability, hinting at a role in connecting cell division to environmental stimuli. Conclusion : The discovery of highly conserved and new cell cycle regulators suggests the evolution of unique control mechanisms for diatom cell division, probably contributing to their ability to adapt and survive under highly fluctuating environmental conditions.
  144. Van de Peer, Y., Maere, S., & Meyer, A. (2010). 2R or not 2R is not the question anymore. NATURE REVIEWS GENETICS, 11(2), 166–166.
  145. Abeel, T., Van Landeghem, S., Morante, R., Van Asch, V., Van de Peer, Y., Daelemans, W., & Saeys, Y. (2010). Highlights of the BioTM 2010 workshop on advances in bio text mining. BMC BIOINFORMATICS.
    This meeting report gives an overview of the keynote lectures, the panel discussion and a selection of the contributed presentations. The workshop was held in Gent, Belgium on May 10-11. It featured a tutorial aimed towards a broad audience of (computational) biologists, (computational) linguists and researchers working purely on text mining.
  146. Bonnet, E., Michoel, T., & Van de Peer, Y. (2010). Prediction of a gene regulatory network linked to prostate cancer from gene expression, microRNA and clinical data. BIOINFORMATICS, 26(18), i638–i644. Presented at the 9th European Conference on Computational Biology.
    Motivation: Cancer is a complex disease, triggered by mutations in multiple genes and pathways. There is a growing interest in the application of systems biology approaches to analyze various types of cancer-related data to understand the overwhelming complexity of changes induced by the disease. Results: We reconstructed a regulatory module network using gene expression, microRNA expression and a clinical parameter, all measured in lymphoblastoid cell lines derived from patients having aggressive or non-aggressive forms of prostate cancer. Our analysis identified several modules enriched in cell cycle-related genes as well as novel functional categories that might be linked to prostate cancer. Almost one-third of the regulators predicted to control the expression levels of the modules are microRNAs. Several of them have already been characterized as causal in various diseases, including cancer. We also predicted novel microRNAs that have never been associated to this type of tumor. Furthermore, the condition-dependent expression of several modules could be linked to the value of a clinical parameter characterizing the aggressiveness of the prostate cancer. Taken together, our results help to shed light on the consequences of aggressive and non-aggressive forms of prostate cancer.
  147. Velasco, R., Zharkikh, A., Affourtit, J., Dhingra, A., Cestaro, A., Kalyanaraman, A., Fontana, P., et al. (2010). The genome of the domesticated apple (Malus x domestica Borkh.). NATURE GENETICS, 42(10), 833–839.
    We report a high-quality draft genome sequence of the domesticated apple (Malus x domestica). We show that a relatively recent (> 50 million years ago) genome-wide duplication (GWD) has resulted in the transition from nine ancestral chromosomes to 17 chromosomes in the Pyreae. Traces of older GWDs partly support the monophyly of the ancestral paleohexaploidy of eudicots. Phylogenetic reconstruction of Pyreae and the genus Malus, relative to major Rosaceae taxa, identified the progenitor of the cultivated apple as M. sieversii. Expansion of gene families reported to be involved in fruit development may explain formation of the pome, a Pyreae-specific false fruit that develops by proliferation of the basal part of the sepals, the receptacle. In apple, a subclade of MADS-box genes, normally involved in flower and fruit development, is expanded to include 15 members, as are other gene families involved in Rosaceae-specific metabolism, such as transport and assimilation of sorbitol.
  148. Bonnet, E., He, Y., Billiau, K., & Van de Peer, Y. (2010). TAPIR, a web server for the prediction of plant microRNA targets, including target mimics. BIOINFORMATICS, 26(12), 1566–1568.
    We present a new web server called TAPIR, designed for the prediction of plant microRNA targets. The server offers the possibility to search for plant miRNA targets using a fast and a precise algorithm. The precise option is much slower but guarantees to find less perfectly paired miRNA-target duplexes. Furthermore, the precise option allows the prediction of target mimics, which are characterized by a miRNA-target duplex having a large loop, making them undetectable by traditional tools.
  149. Saeys, Yvan, Van Landeghem, S., & Van de Peer, Y. (2010). Event based text mining for integrated network construction. In S. Džeroski, P. Geurts, & J. Rousu (Eds.), JMLR Workshop and Conference Proceedings (Vol. 8, pp. 112–121). Presented at the 3rd International workshop on Machine Learning in Systems Biology (MLSB 2009), Brookline, MA, USA: Microtome Publishing.
    The scientific literature is a rich and challenging data source for research in systems biology, providing numerous interactions between biological entities. Text mining techniques have been increasingly useful to extract such information from the literature in an automatic way, but up to now the main focus of text mining in the systems biology field has been restricted mostly to the discovery of protein-protein interactions. Here, we take this approach one step further, and use machine learning techniques combined with text mining to extract a much wider variety of interactions between biological entities. Each particular interaction type gives rise to a separate network, represented as a graph, all of which can be subsequently combined to yield a so-called integrated network representation. This provides a much broader view on the biological system as a whole, which can then be used in further investigations to analyse specific properties of the network
  150. Cock, J. M., Sterck, L., Rouzé, P., Scornet, D., Allen, A. E., Amoutzias, G., Anthouard, V., et al. (2010). The Ectocarpus genome and the independent evolution of multicellularity in brown algae. NATURE, 465(7298), 617–621.
    Brown algae (Phaeophyceae) are complex photosynthetic organisms with a very different evolutionary history to green plants, to which they are only distantly related(1). These seaweeds are the dominant species in rocky coastal ecosystems and they exhibit many interesting adaptations to these, often harsh, environments. Brown algae are also one of only a small number of eukaryotic lineages that have evolved complex multicellularity (Fig. 1). We report the 214 million base pair (Mbp) genome sequence of the filamentous seaweed Ectocarpus siliculosus (Dillwyn) Lyngbye, a model organism for brown algae(2-5), closely related to the kelps(6,7) (Fig. 1). Genome features such as the presence of an extended set of light-harvesting and pigment biosynthesis genes and new metabolic processes such as halide metabolism help explain the ability of this organism to cope with the highly variable tidal environment. The evolution of multicellularity in this lineage is correlated with the presence of a rich array of signal transduction genes. Of particular interest is the presence of a family of receptor kinases, as the independent evolution of related molecules has been linked with the emergence of multicellularity in both the animal and green plant lineages. The Ectocarpus genome sequence represents an important step towards developing this organism as a model species, providing the possibility to combine genomic and genetic(2) approaches to explore these and other(4,5) aspects of brown algal biology further.
  151. Bonnet, E., Tatari, M., Joshi, A. M., Michoel, T., Marchal, K., Berx, G., & Van de Peer, Y. (2010). Module network inference from a cancer gene expression data set identifies microRNA regulated modules. PLOS ONE, 5(4).
    Background: MicroRNAs (miRNAs) are small RNAs that recognize and regulate mRNA target genes. Multiple lines of evidence indicate that they are key regulators of numerous critical functions in development and disease, including cancer. However, defining the place and function of miRNAs in complex regulatory networks is not straightforward. Systems approaches, like the inference of a module network from expression data, can help to achieve this goal. Methodology/Principal Findings: During the last decade, much progress has been made in the development of robust and powerful module network inference algorithms. In this study, we analyze and assess experimentally a module network inferred from both miRNA and mRNA expression data, using our recently developed module network inference algorithm based on probabilistic optimization techniques. We show that several miRNAs are predicted as statistically significant regulators for various modules of tightly co-expressed genes. A detailed analysis of three of those modules demonstrates that the specific assignment of miRNAs is functionally coherent and supported by literature. We further designed a set of experiments to test the assignment of miR-200a as the top regulator of a small module of nine genes. The results strongly suggest that miR-200a is regulating the module genes via the transcription factor ZEB1. Interestingly, this module is most likely involved in epithelial homeostasis and its dysregulation might contribute to the malignant process in cancer cells. Conclusions/Significance: Our results show that a robust module network analysis of expression data can provide novel insights of miRNA function in important cellular processes. Such a computational approach, starting from expression data alone, can be helpful in the process of identifying the function of miRNAs by suggesting modules of co-expressed genes in which they play a regulatory role. As shown in this study, those modules can then be tested experimentally to further investigate and refine the function of the miRNA in the regulatory network.
  152. Amoutzias, G., He, Y., Gordon, J., Mossialos, D., Oliver, S. G., & Van de Peer, Y. (2010). Posttranslational regulation impacts the fate of duplicated genes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 107(7), 2967–2971.
    Gene and genome duplications create novel genetic material on which evolution can work and have therefore been recognized as a major source of innovation for many eukaryotic lineages. Following duplication, the most likely fate is gene loss; however, a considerable fraction of duplicated genes survive. Not all genes have the same probability of survival, but it is not fully understood what evolutionary forces determine the pattern of gene retention. Here, we use genome sequence data as well as large-scale phosphoproteomics data from the baker's yeast Saccharomyces cerevisiae, which underwent a whole-genome duplication similar to 100 mya, and show that the number of phosphorylation sites on the proteins they encode is a major determinant of gene retention. Protein phosphorylation motifs are short amino acid sequences that are usually embedded within unstructured and rapidly evolving protein regions. Reciprocal loss of those ancestral sites and the gain of new ones are major drivers in the retention of the two surviving duplicates and in their acquisition of distinct functions. This way, small changes in the sequences of unstructured regions in proteins can contribute to the rapid rewiring and adaptation of regulatory networks.
  153. Rehrauer, H., Aquino, C., Gruissem, W., Henz, S. R., Hilson, P., Laubinger, S., Naouar, N., et al. (2010). AGRONOMICS1: A New Resource for Arabidopsis Transcriptome Profiling. PLANT PHYSIOLOGY, 152(2), 487–499.
  154. Martens, Cindy, & Van de Peer, Y. (2010). The hidden duplication past of the plant pathogen Phytophthora and its consequences for infection. BMC GENOMICS, 11.
    Background: Oomycetes of the genus Phytophthora are pathogens that infect a wide range of plant species. For dicot hosts such as tomato, potato and soybean, Phytophthora is even the most important pathogen. Previous analyses of Phytophthora genomes uncovered many genes, large gene families and large genome sizes that can partially be explained by significant repeat expansion patterns. Results: Analysis of the complete genomes of three different Phytophthora species, using a newly developed approach, unveiled a large number of small duplicated blocks, mainly consisting of two or three consecutive genes. Further analysis of these duplicated genes and comparison with the known gene and genome duplication history of ten other eukaryotes including parasites, algae, plants, fungi, vertebrates and invertebrates, suggests that the ancestor of P. infestans, P. sojae and P. ramorum most likely underwent a whole genome duplication (WGD). Genes that have survived in duplicate are mainly genes that are known to be preferentially retained following WGDs, but also genes important for pathogenicity and infection of the different hosts seem to have been retained in excess. As a result, the WGD might have contributed to the evolutionary and pathogenic success of Phytophthora. Conclusions: The fact that we find many small blocks of duplicated genes indicates that the genomes of Phytophthora species have been heavily rearranged following the WGD. Most likely, the high repeat content in these genomes have played an important role in this rearrangement process. As a consequence, the paucity of retained larger duplicated blocks has greatly complicated previous attempts to detect remnants of a large-scale duplication event in Phytophthora. However, as we show here, our newly developed strategy to identify very small duplicated blocks might be a useful approach to uncover ancient polyploidy events, in particular for heavily rearranged genomes.
  155. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2010). Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences. JOURNAL OF MOLECULAR EVOLUTION, 71(1), 34–50.
    We discuss the importance of non-reversible evolutionary models when analyzing context-dependence. Given the inherent non-reversible nature of the well-known CpG-methylation-deamination process in mammalian evolution, non-reversible context-dependent evolutionary models may be well able to accurately model such a process. In particular, the lack of constraints on non-reversible substitution models might allow for more accurate estimation of context-dependent substitution parameters. To demonstrate this, we have developed different time-homogeneous context-dependent evolutionary models to analyze a large genomic dataset of primate ancestral repeats based on existing independent evolutionary models. We have calculated the difference in model fit for each of these models using Bayes Factors obtained via thermodynamic integration. We find that non-reversible context-dependent models can drastically increase model fit when compared to independent models and this on two primate non-coding datasets. Further, we show that further improvements are possible by clustering similar parameters across contexts.
  156. Fawcett, J., & Van de Peer, Y. (2010). Angiosperm polyploids and their road to evolutionary success. TRENDS IN EVOLUTIONARY BIOLOGY, 2(1), 16–21.
    The abundance of polyploidy among flowering plants has long been recognized, and recent studies have uncovered multiple ancient polyploidization events in the evolutionary history of several angiosperm lineages. Once polyploids are formed they must get locally established and then propagate and survive while adapting to different environments and avoiding extinction. This might ultimately lead to their long-term evolutionary success, where their descendant lineages survive for tens of millions of years. Along this road to evolutionary success, polyploids must overcome several obstacles, to which several genetic and ecological factors are likely to contribute. One recurrent observation, based on present-day polyploids, has been the high frequency of polyploids in harsh environments. Also, recent studies proposed that the success of certain ancient polyploids might be linked to periods of climatic change. Although we are still in the early stages of unraveling the factors that resulted in the long-term evolutionary success of ancient polyploids, the advances in genomic sequencing and molecular dating methods promise to enhance our understanding. It, therefore, seems timely to review our current knowledge of what determines the success of polyploids. Here, we discuss especially how harsh conditions or periods of climatic change might affect the rate of formation, establishment, persistence and long-term evolutionary success of polyploids in angiosperms.
  157. Van Landeghem, S., Pyysalo, S., Ohta, T., & Van de Peer, Y. (2010). Integration of static relations to enhance event extraction from text. Proceedings of the 2010 workshop on biomedical natural language processing (pp. 144–152). Presented at the 2010 Workshop on Biomedical Natural Language Processing (ACL 2010), Association for Computational Linguistics (ACL).
  158. Joshi, A. M., Van Parys, T., Van de Peer, Y., & Michoel, T. (2010). Characterizing regulatory path motifs in integrated networks using perturbational data. GENOME BIOLOGY, 11(3).
    We introduce Pathicular http://bioinformatics.psb.ugent.be/software/details/Pathicular, a Cytoscape plugin for studying the cellular response to perturbations of transcription factors by integrating perturbational expression data with transcriptional, protein-protein and phosphorylation networks. Pathicular searches for 'regulatory path motifs', short paths in the integrated physical networks which occur significantly more often than expected between transcription factors and their targets in the perturbational data. A case study in Saccharomyces cerevisiae identifies eight regulatory path motifs and demonstrates their biological significance.
  159. Fawcett, J., Maere, S., & Van de Peer, Y. (2009). Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 106(14), 5737–5742.
    Most flowering plants have been shown to be ancient polyploids that have undergone one or more whole genome duplications early in their evolution. Furthermore, many different plant lineages seem to have experienced an additional, more recent genome duplication. Starting from paralogous genes lying in duplicated segments or identified in large expressed sequence tag collections, we dated these youngest duplication events through penalized likelihood phylogenetic tree inference. We show that a majority of these independent genome duplications are clustered in time and seem to coincide with the Cretaceous-Tertiary (KT) boundary. The KT extinction event is the most recent mass extinction caused by one or more catastrophic events such as a massive asteroid impact and/or increased volcanic activity. These events are believed to have generated global wildfires and dust clouds that cut off sunlight during long periods of time resulting in the extinction of approximate to 60% of plant species, as well as a majority of animals, including dinosaurs. Recent studies suggest that polyploid species can have a higher adaptability and increased tolerance to different environmental conditions. We propose that polyploidization may have contributed to the survival and propagation of several plant lineages during or following the KT extinction event. Due to advantages such as altered gene expression leading to hybrid vigor and an increased set of genes and alleles available for selection, polyploid plants might have been better able to adapt to the drastically changed environment 65 million years ago.
  160. Worden, A. Z., Lee, J.-H., Mock, T., Rouzé, P., Simmons, M. P., Aerts, A. L., Allen, A. E., et al. (2009). Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas. SCIENCE, 324(5924), 268–272.
    Picoeukaryotes are a taxonomically diverse group of organisms less than 2 micrometers in diameter. Photosynthetic marine picoeukaryotes in the genus Micromonas thrive in ecosystems ranging from tropical to polar and could serve as sentinel organisms for biogeochemical fluxes of modern oceans during climate change. These broadly distributed primary producers belong to an anciently diverged sister clade to land plants. Although Micromonas isolates have high 18S ribosomal RNA gene identity, we found that genomes from two isolates shared only 90% of their predicted genes. Their independent evolutionary paths were emphasized by distinct riboswitch arrangements as well as the discovery of intronic repeat elements in one isolate, and in metagenomic data, but not in other genomes. Divergence appears to have been facilitated by selection and acquisition processes that actively shape the repertoire of genes that are mutually exclusive between the two isolates differently than the core genes. Analyses of the Micromonas genomes offer valuable insights into ecological differentiation and the dynamic nature of early plant evolution.
  161. Van Landeghem, S., Saeys, Y., De Baets, B., & Van de Peer, Y. (2009). Analyzing text in search of bio-molecular events: a high-precision machine learning framework. Proceedings of the workshop on BioNLP : shared task (pp. 128–136). Presented at the Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop, Association for Computational Linguistics (ACL).
    The BioNLP'09 Shared Task on Event Extraction is a challenge which concerns the detection of bio-molecular events from text. In this paper, we present a detailed account of the challenges encountered during the construction of a machine learning framework for participation in this task. We have focused our work mainly around the filtering of false positives, creating a high-precision extraction method. We have tested techniques such as SVMs, feature selection and various filters for data pre- and post-processing, and report on the influence on performance for each of them. To detect negation and speculation in text, we describe a custom-made rule-based system which is simple in design, but effective in performance.
  162. Vermeirssen, Vanessa, Joshi, A. M., Michoel, T., Bonnet, E., Casneuf, T., & Van de Peer, Y. (2009). Transcription regulatory networks in Caenorhabditis elegans inferred through reverse-engineering of gene expression profiles constitute biological hypotheses for metazoan development. MOLECULAR BIOSYSTEMS, 5(12), 1817–1830.
    Differential gene expression governs the development, function and pathology of multicellular organisms. Transcription regulatory networks study differential gene expression at a systems level by mapping the interactions between regulatory proteins and target genes. While microarray transcription profiles are the most abundant data for gene expression, it remains challenging to correctly infer the underlying transcription regulatory networks. The reverse-engineering algorithm LeMoNe (learning module networks) uses gene expression profiles to extract ensemble transcription regulatory networks of coexpression modules and their prioritized regulators. Here we apply LeMoNe to a compendium of microarray studies of the worm Caenorhabditis elegans. We obtain 248 modules with a regulation program for 5020 genes and 426 regulators and a total of 24 012 predicted transcription regulatory interactions. Through GO enrichment analysis, comparison with the gene-gene association network WormNet and integration of other biological data, we show that LeMoNe identifies functionally coherent coexpression modules and prioritizes regulators that relate to similar biological processes as the module genes. Furthermore, we can predict new functional relationships for uncharacterized genes and regulators. Based on modules involved in molting, meiosis and oogenesis, ciliated sensory neurons and mitochondrial metabolism, we illustrate the value of LeMoNe as a biological hypothesis generator for differential gene expression in greater detail. In conclusion, through reverse-engineering of C. elegans expression data, we obtained transcription regulatory networks that can provide further insight into metazoan development.
  163. Van de Peer, Y., Maere, S., & Meyer, A. (2009). The evolutionary significance of ancient genome duplications. Nature Reviews Genetics, 10(10), 725–732.
    Many organisms are currently polyploid, or have a polyploid ancestry and now have secondarily 'diploidized' genomes. This finding is surprising because retained whole-genome duplications (WGDs) are exceedingly rare, suggesting that polyploidy is usually an evolutionary dead end. We argue that ancient genome doublings could probably have survived only under very specific conditions, but that, whenever established, they might have had a pronounced impact on species diversification, and led to an increase in biological complexity and the origin of evolutionary novelties.
  164. Dittami, S., Scornet, D., Petit, J.-L., Ségurens, B., Da Silva, C., Corre, E., Dondrup, M., et al. (2009). Global expression analysis of the brown alga Ectocarpus siliculosus (Phaeophyceae) reveals large-scale reprogramming of the transcriptome in response to abiotic stress. Genome Biology, 10, R66.1–R66.20.
    Background: Brown algae (Phaeophyceae) are phylogenetically distant from red and green algae and an important component of the coastal ecosystem. They have developed unique mechanisms that allow them to inhabit the intertidal zone, an environment with high levels of abiotic stress. Ectocarpus siliculosus is being established as a genetic and genomic model for the brown algal lineage, but little is known about its response to abiotic stress. Results: Here we examine the transcriptomic changes that occur during the short term acclimation of E. siliculosus to three different abiotic stress conditions (hyposaline, hypersaline and oxidative stress). Our results show that almost 70% of the expressed genes are regulated in response to at least one of these stressors. Although there are several common elements with terrestrial plants, such as repression of growth-related genes, switching from primary production to protein and nutrient recycling processes, and induction of genes involved in vesicular trafficking, many of the stress-regulated genes are either not known to respond to stress in other organisms or are have been found exclusively in E. siliculosus. Conclusions: This first large-scale transcriptomic study of a brown alga demonstrates that, unlike terrestrial plants, E. siliculosus undergoes extensive reprogramming of its transcriptome during the acclimation to mild abiotic stress. We identify several new genes and pathways with a putative function in the stress response and thus pave the way for more detailed investigations of the mechanisms underlying the stress tolerance of brown algae.
  165. Piganeau, Gwenael, Vandepoele, K., Gourbière, S., Van de Peer, Y., & Moreau, H. (2009). Unravelling cis-Regulatory Elements in the Genome of the Smallest Photosynthetic Eukaryote: Phylogenetic Footprinting in Ostreococcus. Journal of Molecular Evolution, 69(3), 249–259.
    We used a phylogenetic footprinting approach, adapted to high levels of divergence, to estimate the level of constraint in intergenic regions of the extremely gene dense Ostreococcus algae genomes (Chlorophyta, Prasinophyceae). We first benchmarked our method against the Saccharomyces sensu stricto genome data and found that the proportion of conserved non-coding sites was consistent with those obtained with methods using calibration by the neutral substitution rate. We then applied our method to the complete genomes of Ostreococcus tauri and O. lucimarinus, which are the most divergent species from the same genus sequenced so far. We found that 77% of intergenic regions in Ostreococcus still contain some phylogenetic footprints, as compared to 88% for Saccharomyces, corresponding to an average rate of constraint on intergenic region of 17% and 30%, respectively. A comparison with some known functional cis-regulatory elements enabled us to investigate whether some transcriptional regulatory pathways were conserved throughout the green lineage. Strikingly, the size of the phylogenetic footprints depends on gene orientation of neighboring genes, and appears to be genus-specific. In Ostreococcus, 5' intergenic regions contain four times more conserved sites than 3' intergenic regions, whereas in yeast a higher frequency of constrained sites in intergenic regions between genes on the same DNA strand suggests a higher frequency of bidirectional regulatory elements. The phylogenetic footprinting approach can be used despite high levels of divergence in the ultrasmall Ostreococcus algae, to decipher structure of constrained regulatory motifs, and identify putative regulatory pathways conserved within the green lineage.
  166. Abeel, T., Van de Peer, Y., & Saeys, Y. (2009). Java-ML: a machine learning library. JOURNAL OF MACHINE LEARNING RESEARCH, 10, 931–934.
    Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily usable and easily extensible API for both software developers and research scientists. The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. Comparing different classifiers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. The library is written in Java and is available from http://java-ml.sourceforge.net/ under the GNU GPL license.
  167. De Bodt, Stefanie, Proost, S., Vandepoele, K., Rouzé, P., & Van de Peer, Y. (2009). Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression. BMC Genomics, 10(288), 1–15.
    Background: Large-scale identification of the interrelationships between different components of the cell, such as the interactions between proteins, has recently gained great interest. However, unraveling large-scale protein-protein interaction maps is laborious and expensive. Moreover, assessing the reliability of the interactions can be cumbersome. Results: In this study, we have developed a computational method that exploits the existing knowledge on protein-protein interactions in diverse species through orthologous relations on the one hand, and functional association data on the other hand to predict and filter protein-protein interactions in Arabidopsis thaliana. A highly reliable set of protein-protein interactions is predicted through this integrative approach making use of existing protein-protein interaction data from yeast, human, C. elegans and D. melanogaster. Localization, biological process, and co-expression data are used as powerful indicators for protein-protein interactions. The functional repertoire of the identified interactome reveals interactions between proteins functioning in well-conserved as well as plant-specific biological processes. We observe that although common mechanisms (e.g. actin polymerization) and components (e.g. ARPs, actin-related proteins) exist between different lineages, they are active in specific processes such as growth, cancer metastasis and trichome development in yeast, human and Arabidopsis, respectively. Conclusion: We conclude that the integration of orthology with functional association data is adequate to predict protein-protein interactions. Through this approach, a high number of novel protein-protein interactions with diverse biological roles is discovered. Overall, we have predicted a reliable set of protein-protein interactions suitable for further computational as well as experimental analyses.
  168. Kernbach, S., Hamann, H., Stradner, J., Thenius, R., Schmickl, T., Crailsheim, K., van Rossum, A. C., et al. (2009). On adaptive self-organization in artificial robot organisms. 2009 Computation world : future computing, service computation, cognitive, adaptive, content, patterns conference (pp. 33–43). Presented at the 2009 Computation World : Future computing, service computation, cognitive, adaptive, content, patterns conference, New York, NY, USA: IEEE.
    Self-organization in natural systems demonstrates very reliable and scalable collective behavior without using any central elements. When providing collective robotic systems with self-organizing principles, we are facing new problems of making self-organization purposeful, self-adapting to changing environments and faster, in order to meet requirements from a technical perspective. This paper describes on-going work of creating such an artificial self-organization within artificial robot organisms, performed in the framework of several European projects.
  169. Joshi, A. M., De Smet, R., Marchal, K., Van de Peer, Y., & Michoel, T. (2009). Module networks revisited: computational assessment and prioritization of model predictions. BIOINFORMATICS, 25(4), 490–496.
    Motivation: The solution of high-dimensional inference and prediction problems in computational biology is almost always a compromise between mathematical theory and practical constraints, such as limited computational resources. As time progresses, computational power increases but well-established inference methods often remain locked in their initial suboptimal solution. Results: We revisit the approach of Segal et al. to infer regulatory modules and their condition-specific regulators from gene expression data. In contrast to their direct optimization-based solution, we use a more representative centroid-like solution extracted from an ensemble of possible statistical models to explain the data. The ensemble method automatically selects a subset of most informative genes and builds a quantitatively better model for them. Genes which cluster together in the majority of models produce functionally more coherent modules. Regulators which are consistently assigned to a module are more often supported by literature, but a single model always contains many regulator assignments not supported by the ensemble. Reliably detecting condition-specific or combinatorial regulation is particularly hard in a single optimum but can be achieved using ensemble averaging.
  170. Van de Peer, Y., Fawcett, J., Proost, S., Sterck, L., & Vandepoele, K. (2009). The flowering world: a tale of duplications. TRENDS IN PLANT SCIENCE, 14(12), 680–688.
    Flowering plants contain many genes, most of which were created during the past 200 or so million years through small- and large-scale duplications. Paleo-polyploidy events, in particular, have been the subject of much recent research. There is a growing consensus that one or more genome doubling or merging events occurred early during the evolution of the flowering plants, and that many lineages have since undergone additional, independent and more recent duplication events. Here, we review the difficulties in determining the number of genome duplications and discuss how the completion of some additional genome sequences of species occupying key phylogenetic positions has led to a better understanding of the timing of certain duplication events. This is important if we want to demonstrate the significance of genome duplications for the evolution and radiation of (different groups of) flowering plants.
  171. Abeel, T., Van de Peer, Y., & Saeys, Y. (2009). Toward a gold standard for promoter prediction evaluation. BIOINFORMATICS, 25(12), I313–I320. Presented at the Joint conference of Intelligent Systems for Molecular Biology and the European conference on Computational Biology.
    Motivation: Promoter prediction is an important task in genome annotation projects, and during the past years many new promoter prediction programs (PPPs) have emerged. However, many of these programs are compared inadequately to other programs. In most cases, only a small portion of the genome is used to evaluate the program, which is not a realistic setting for whole genome annotation projects. In addition, a common evaluation design to properly compare PPPs is still lacking. Results: We present a large-scale benchmarking study of 17 state-of-the-art PPPs. A multi-faceted evaluation strategy is proposed that can be used as a gold standard for promoter prediction evaluation, allowing authors of promoter prediction software to compare their method to existing methods in a proper way. This evaluation strategy is subsequently used to compare the chosen promoter predictors, and an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of the predictions is conducted.
  172. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2009). Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evolutionary Biology, 9, 87.1–87.23.
    Background: Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations. Results: We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies. Conclusion: While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
  173. Vandepoele, K., Quimbaya Gomez, M. A., Casneuf, T., De Veylder, L., & Van de Peer, Y. (2009). Unraveling Transcriptional Control in Arabidopsis Using cis-Regulatory Elements and Coexpression Networks. Plant Physiology, 150(2), 535–546.
    Analysis of gene expression data generated by high-throughput microarray transcript profiling experiments has demonstrated that genes with an overall similar expression pattern are often enriched for similar functions. This guilt-by-association principle can be applied to define modular gene programs, identify cis-regulatory elements, or predict gene functions for unknown genes based on their coexpression neighborhood. We evaluated the potential to use Gene Ontology (GO) enrichment of a gene's coexpression neighborhood as a tool to predict its function but found overall low sensitivity scores (13%-34%). This indicates that for many functional categories, coexpression alone performs poorly to infer known biological gene functions. However, integration of cis-regulatory elements shows that 46% of the gene coexpression neighborhoods are enriched for one or more motifs, providing a valuable complementary source to functionally annotate genes. Through the integration of coexpression data, GO annotations, and a set of known cis-regulatory elements combined with a novel set of evolutionarily conserved plant motifs, we could link many genes and motifs to specific biological functions. Application of our coexpression framework extended with cis-regulatory element analysis on transcriptome data from the cell cycle-related transcription factor OBP1 yielded several coexpressed modules associated with specific cis-regulatory elements. Moreover, our analysis strongly suggests a feed-forward regulatory interaction between OBP1 and the E2F pathway. The ATCOECIS resource (http:// bioinformatics.psb.ugent.be/ATCOECIS/) makes it possible to query coexpression data and GO and cis-regulatory element annotations and to submit user-defined gene sets for motif analysis, providing an access point to unravel the regulatory code underlying transcriptional control in Arabidopsis (Arabidopsis thaliana).
  174. Mueller, Lukas, Klein Lankhorst, R., Tanksley, S. D., Giovannoni, J. J., White, R., Vrebalov, J., Fei, Z., et al. (2009). A snapshot of the emerging tomato genome sequence. PLANT GENOME, 2(1), 78–92.
    The genome of tomato (Solanum lycopersicum L.) is being sequenced by an international consortium of 10 countries (Korea, China, the United Kingdom, India, the Netherlands, France, Japan, Spain, Italy, and the United States) as part of the larger “International Solanaceae Genome Project (SOL): Systems Approach to Diversity and Adaptation” initiative. The tomato genome sequencing project uses an ordered bacterial artificial chromosome (BAC) approach to generate a high-quality tomato euchromatic genome sequence for use as a reference genome for the Solanaceae and euasterids. Sequence is deposited at GenBank and at the SOL Genomics Network (SGN). Currently, there are around 1000 BACs finished or in progress, representing more than a third of the projected euchromatic portion of the genome. An annotation effort is also underway by the International Tomato Annotation Group. The expected number of genes in the euchromatin is ∼40,000, based on an estimate from a preliminary annotation of 11% of finished sequence. Here, we present this first snapshot of the emerging tomato genome and its annotation, a short comparison with potato (Solanum tuberosum L.) sequence data, and the tools available for the researchers to exploit this new resource are also presented. In the future, whole-genome shotgun techniques will be combined with the BAC-by-BAC approach to cover the entire tomato genome. The high-quality reference euchromatic tomato sequence is expected to be near completion by 2010.
  175. Van de Peer, Y. (2009). Phylogenetic inference based on distance methods: theory. In P. Lemey, M. Salemi, & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to phylogenetic analysis and hypothesis testing (pp. 142–160). Cambridge, UK: Cambridge University Press.
  176. Michoel, T., De Smet, R., Joshi, A. M., Marchal, K., & Van de Peer, Y. (2009). Reverse-engineering transcriptional modules from gene expression data. (Gustavo Stolovitzky, P. Kahlem, & A. Califano, Eds.)Annals of the New York Academy of Sciences, 1158, 36–43. Presented at the ENFIN-DREAM Conference on the Assessment of Computational Methods in Systems Biology (DREAM2 Conference).
    "Module networks" are a framework to learn gene regulatory networks from expression data using a probabilistic model in which coregulated genes share the same parameters and conditional distributions. We present a method to infer ensembles of such networks and an averaging procedure to extract the statistically most significant modules and their regulators. We show that the inferred probabilistic models extend beyond the dataset used to learn the models.
  177. Baele, Guy, Bredeche, N., Haasdijk, E., Maere, S., Michiels, N., Van de Peer, Y., Schmickl, T., et al. (2009). Open-ended on-board evolutionary robotics for robot swarms. IEEE Congress on Evolutionary Computation (pp. 1123–1130). Presented at the 2009 IEEE Congress on Evolutionary Computation (CEC 2009), New York, NY, USA: IEEE.
    The SYMBRION project stands at the crossroads of artificial life and evolutionary robotics: a swarm of real robots undergoes online evolution by exchanging information in a decentralized Evolutionary Robotics Scheme: the diffusion of each individual's genotype depends both on its ability to survive in an unknown environment as well as its ability to maximize mating opportunities during its lifetime, which suggests an implicit fitness. This paper presents early research and prospective ideas in the context of large-scale swarm robotics projects, focusing on the open-ended evolutionary approach in the SYMBRION project. One key issue of this work is to perform on-board evolution in a spatially distributed population of robots. A real-world experiment is also described which yields important considerations regarding open-ended evolution with real autonomous robots.
  178. Proost, Sebastian, Van Bel, M., Sterck, L., Billiau, K., Van Parys, T., Van de Peer, Y., & Vandepoele, K. (2009). PLAZA: a comparative genomics resource to study gene and genome evolution in plants. PLANT CELL, 21(12), 3718–3731.
    The number of sequenced genomes of representatives within the green lineage is rapidly increasing. Consequently, comparative sequence analysis has significantly altered our view on the complexity of genome organization, gene function, and regulatory pathways. To explore all this genome information, a centralized infrastructure is required where all data generated by different sequencing initiatives is integrated and combined with advanced methods for data mining. Here, we describe PLAZA, an online platform for plant comparative genomics (http://bioinformatics.psb.ugent.be/plaza/). This resource integrates structural and functional annotation of published plant genomes together with a large set of interactive tools to study gene function and gene and genome evolution. Precomputed data sets cover homologous gene families, multiple sequence alignments, phylogenetic trees, intraspecies whole-genome dot plots, and genomic colinearity between species. Through the integration of high confidence Gene Ontology annotations and tree-based orthology between related species, thousands of genes lacking any functional description are functionally annotated. Advanced query systems, as well as multiple interactive visualization tools, are available through a user-friendly and intuitive Web interface. In addition, detailed documentation and tutorials introduce the different tools, while the workbench provides an efficient means to analyze user-defined gene sets through PLAZA's interface. In conclusion, PLAZA provides a comprehensible and up-to-date research environment to aid researchers in the exploration of genome information within the green plant lineage.
  179. De Schutter, Kristof, Lin, Y.-C., Tiels, P., Van Hecke, A., Glinka, S., Weber-Lehmann, J., Rouzé, P., et al. (2009). Genome sequence of the recombinant protein production host Pichia pastoris. NATURE BIOTECHNOLOGY, 27(6), 561–U104.
    The methylotrophic yeast Pichia pastoris is widely used for the production of proteins and as a model organism for studying peroxisomal biogenesis and methanol assimilation. P. pastoris strains capable of human-type N-glycosylation are now available, which increases the utility of this organism for biopharmaceutical production. Despite its biotechnological importance, relatively few genetic tools or engineered strains have been generated for P. pastoris. To facilitate progress in these areas, we present the 9.43 Mbp genomic sequence of the GS115 strain of P. pastoris. We also provide manually curated annotation for its 5,313 protein-coding genes.
  180. Michoel, T., De Smet, R., Joshi, A. M., Van de Peer, Y., & Marchal, K. (2009). Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Systems Biology, 3(49), 1–13.
    Background: A myriad of methods to reverse-engineer transcriptional regulatory networks have been developed in recent years. Direct methods directly reconstruct a network of pairwise regulatory interactions while module-based methods predict a set of regulators for modules of coexpressed genes treated as a single unit. To date, there has been no systematic comparison of the relative strengths and weaknesses of both types of methods. Results: We have compared a recently developed module-based algorithm, LeMoNe (Learning Module Networks), to a mutual information based direct algorithm, CLR (Context Likelihood of Relatedness), using benchmark expression data and databases of known transcriptional regulatory interactions for Escherichia coli and Saccharomyces cerevisiae. A global comparison using recall versus precision curves hides the topologically distinct nature of the inferred networks and is not informative about the specific subtasks for which each method is most suited. Analysis of the degree distributions and a regulator specific comparison show that CLR is 'regulator-centric', making true predictions for a higher number of regulators, while LeMoNe is 'target-centric', recovering a higher number of known targets for fewer regulators, with limited overlap in the predicted interactions between both methods. Detailed biological examples in E. coli and S. cerevisiae are used to illustrate these differences and to prove that each method is able to infer parts of the network where the other fails. Biological validation of the inferred networks cautions against over-interpreting recall and precision values computed using incomplete reference networks. Conclusion: Our results indicate that module-based and direct methods retrieve largely distinct parts of the underlying transcriptional regulatory networks. The choice of algorithm should therefore be based on the particular biological problem of interest and not on global metrics which cannot be transferred between organisms. The development of sound statistical methods for integrating the predictions of different reverse-engineering strategies emerges as an important challenge for future research.
  181. Tzika, A. C., Helaers, R., Van de Peer, Y., & Milinkovitch, M. C. (2008). MANTIS: a phylogenetic framework for multi-species genome comparisons. BIOINFORMATICS, 24(2), 151–157.
    Motivation: Practitioners of comparative genomics face huge analytical challenges as whole genome sequences and functional/expression data accumulate. Furthermore, the field would greatly benefit from a better integration of this wealth of data with evolutionary concepts. Results: Here, we present MANTIS, a relational database for the analysis of (i) gains and losses of genes on specific branches of the metazoan phylogeny, (ii) reconstructed genome content of ancestral species and (iii) over- or under-representation of functions/processes and tissue specificity of gained, duplicated and lost genes. MANTIS estimates the most likely positions of gene losses on the true phylogeny using a maximum-likelihood function. A user-friendly interface and an extensive query system allow to investigate questions pertaining to gene identity, phylogenetic mapping and function/expression parameters.
  182. Amoutzias, G., Van de Peer, Y., & Mossialos, D. (2008). Evolution and taxonomic distribution of nonribosomal peptide and polyketide synthases. FUTURE MICROBIOLOGY, 3(3), 361–370.
    The majority of nonribosomal peptide synthases and type I polyketide synthases are multimodular megasynthases of oligopeptide and polyketide secondary metabolites, respectively. Owing to their multimodular architecture, they synthesize their metabolites in assembly line logic. The ongoing genomic revolution together with the application of computational tools has provided the opportunity to mine the various genomes for these enzymes and identify those organisms that produce many oligopeptide and polyketide metabolites. In addition, scientists have started to comprehend the molecular mechanisms of megasynthase evolution, by duplication, recombination, point mutation and module skipping. This knowledge and computational analyses have been implemented towards predicting the specificity of these megasynthases and the structure of their end products. It is an exciting field, both for gaining deeper insight into their basic molecular mechanisms and exploiting them biotechnologically.
  183. Van Bel, M., Saeys, Y., & Van de Peer, Y. (2008). FunSiP: a modular and extensible classifier for the prediction of functional sites in DNA. BIOINFORMATICS, 24(13), 1532–1533.
    Motivation: Many problems in genome annotation are tackled by using a classification model to predict functional sites such as splice sites, translation start sites or stop codons. Locating the correct position of these sites remains one of the most important but also one of the most difficult issues in the structural annotation of genomes. Most of the software currently in use is written for a very specific problem, thereby limiting the possibilities for reuse. Summary: We developed a software platform that uses a very general approach towards the classification of functional sites in DNA sequences. The program uses an ab initio approach towards the identification of these sites, and extends SpliceMachine, a previously developed splice site predictor that shows state-of-the art performance for both donor and acceptor splice site recognition in the human and Arabidopsis thaliana genome.
  184. Van Landeghem, S., Saeys, Y., De Baets, B., & Van de Peer, Y. (2008). Extracting protein-protein interactions from text using rich feature vectors and feature selection. In T. Salakoski, D. Rebholz-Schuhmann, & S. Pyysalo (Eds.), SMBM  ’08 : proceedings of the third symposium on semantic mining in biomedicine (pp. 77–84). Presented at the 3rd International symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland: Turku Centre for Computer Sciences (TUCS).
    Because of the intrinsic complexity of natural language, automatically extracting accurate information from text remains a challenge. We have applied rich featurevectors derived from dependency graphs to predict protein-protein interactions using machine learning techniques. We present the first extensive analysis of applyingfeature selection in this domain, and show that it can produce more cost-effective models. For the first time, our technique was also evaluated on several large-scalecross-dataset experiments, which offers a more realistic view on model performance. During benchmarking, we encountered several fundamental problems hindering comparability with other methods. We present a set of practical guidelines to set up ameaningful evaluation. Finally, we have analysed the feature sets from our experiments before and after feature selection, and evaluated the contribution of both lexical and syntacticinformation to our method. The gained insight will be useful to develop better performing methods in this domain.
  185. Martin, F., Aerts, A., Ahrén, D., Brun, A., Danchin, E., Duchaussoy, F., Gibon, J., et al. (2008). The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. NATURE, 452(7183), 88–92.
    Mycorrhizal symbioses - the union of roots and soil fungi - are universal in terrestrial ecosystems and may have been fundamental to land colonization by plants(1,2). Boreal, temperate and montane forests all depend on ectomycorrhizae(1). Identification of the primary factors that regulate symbiotic development and metabolic activity will therefore open the door to understanding the role of ectomycorrhizae in plant development and physiology, allowing the full ecological significance of this symbiosis to be explored. Here we report the genome sequence of the ectomycorrhizal basidiomycete Laccaria bicolor ( Fig. 1) and highlight gene sets involved in rhizosphere colonization and symbiosis. This 65- megabase genome assembly contains 20,000 predicted protein- encoding genes and a very large number of transposons and repeated sequences. We detected unexpected genomic features, most notably a battery of effector- type small secreted proteins ( SSPs) with unknown function, several of which are only expressed in symbiotic tissues. The most highly expressed SSP accumulates in the proliferating hyphae colonizing the host root. The ectomycorrhizae- specific SSPs probably have a decisive role in the establishment of the symbiosis. The unexpected observation that the genome of L. bicolor lacks carbohydrate- active enzymes involved in degradation of plant cell walls, but maintains the ability to degrade non- plant cell wall polysaccharides, reveals the dual saprotrophic and biotrophic lifestyle of the mycorrhizal fungus that enables it to grow within both soil and living plant roots. The predicted gene inventory of the L. bicolor genome, therefore, points to previously unknown mechanisms of symbiosis operating in biotrophic mycorrhizal fungi. The availability of this genome provides an unparalleled opportunity to develop a deeper understanding of the processes by which symbionts interact with plants within their ecosystem to perform vital functions in the carbon and nitrogen cycles that are fundamental to sustainable plant productivity.
  186. John, U., Beszteri, B., Derelle, E., Van de Peer, Y., Read, B., Moreau, H., & Cembella, A. (2008). Novel insights into evolution of protistan polyketide synthases through phylogenomic analysis. PROTIST, 159(1), 21–30.
  187. Robbens, S., Rouzé, P., Cock, J. M., Spring, J., Worden, A. Z., & Van de Peer, Y. (2008). The FTO gene, implicated in human obesity, is found only in vertebrates and marine algae. JOURNAL OF MOLECULAR EVOLUTION, 66(1), 80–84.
    Human obesity is a main cause of morbidity and mortality. Recently, several studies have demonstrated an association between the FTO gene locus and early onset and severe obesity. To date, the FTO gene has only been discovered in vertebrates. We identified FTO homologs in the complete genome sequences of various evolutionary diverse marine eukaryotic algae, ranging from unicellular photosynthetic picoplankton to a multicellular seaweed. However, FTO homologs appear to be absent from all other completely sequenced genomes of plants, fungi, and invertebrate animals. Although the biological roles of these marine algal FTO homologs are still unknown, these genes will be useful for exploring basic protein features and could hence help unravel the function of the FTO gene in vertebrates and its inferred link with obesity in humans.
  188. Abeel, T., Saeys, Y., & Van de Peer, Y. (2008). ProSOM: core promoter identification in the human genome. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 77–78). Presented at the 18th Annual Belgian-Dutch Machine Learning Conference (Benelearn 2008), Liège, Belgium: Université de Liège.
    More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Better core promoter prediction can improve genome annotation and can be used to guide experimental work. Comparing the average structural profile of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called Pro- SOM, is available and has been compared with the state-of-the-art.
  189. Vandenbroucke, Korneel, Robbens, S., Vandepoele, K., Inzé, D., Van de Peer, Y., & Van Breusegem, F. (2008). Hydrogen peroxide-induced gene expression across kingdoms: a comparative analysis. MOLECULAR BIOLOGY AND EVOLUTION, 25(3), 507–516.
    Cells react to oxidative stress conditions by launching a defense response through the induction of nuclear gene expression. The advent of microarray technologies allowed monitoring of oxidative stress-dependent changes of transcript levels at a comprehensive and genome-wide scale, resulting in a series of inventories of differentially expressed genes in different organisms. We performed a meta-analysis on hydrogen peroxide (H2O2)-induced gene expression in the cyanobacterium Synechocystis PCC 6803, the yeast Saccharomyces cerevisiae and Schizosaccharomyces pombe, the land plant Arabidopsis thaliana, and the human HeLa cell line. The H2O2-induced gene expression in both yeast species was highly conserved and more similar to the A. thaliana response than that of the human cell line. Based on the expression characteristics of genuine antioxidant genes, we show that the antioxidant capacity of microorganisms and higher eukaryotes is differentially regulated. Four families of evolutionarily conserved eukaryotic proteins could be identified that were H2O2 responsive across kingdoms: DNAJ domain-containing heat shock proteins, small guanine triphosphate-binding proteins, Ca2+-dependent protein kinases, and ubiquitin-conjugating enzymes.
  190. Martens, Cindy, Vandepoele, K., & Van de Peer, Y. (2008). Whole-genome analysis reveals molecular innovations and evolutionary transitions in chromalveolate species. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 105(9), 3427–3432.
    The chromalveolates form a highly diverse and fascinating assemblage of organisms, ranging from obligatory parasites such as Plasmodium to free-living ciliates and algae such as kelps, diatoms, and dinoflagellates. Many of the species in this monophyletic grouping are of major medical, ecological, and economical importance. Nevertheless, their genome evolution is much less well studied than that of higher plants, animals, or fungi. In the current study, we have analyzed and compared 12 chromalveolate species for which whole-sequence information is available and provide a detailed picture on gene loss and gene gain in the different lineages. As expected, many gene loss and gain events can be directly correlated with the lifestyle and specific adaptations of the organisms studied. For instance, in the obligate intracellular Apicomplexa we observed massive loss of genes that play a role in general basic processes such as amino acid, carbohydrate, and lipid metabolism, reflecting the transition of a free-living to an obligate intracellular lifestyle. In contrast, many gene families show species-specific expansions, such as those in the plant pathogen oomycete Phytophthora that are involved in degrading the plant cell wall polysaccharides to facilitate the pathogen invasion process. In general, chromalveolates show a tremendous difference in genome structure and evolution and in the number of genes they have lost or gained either through duplication or horizontal gene transfer.
  191. Bowler, Chris, Allen, A. E., Badger, J. H., Grimwood, J., Jabbari, K., Kuo, A., Maheswari, U., et al. (2008). The Phaeodactylum genome reveals the evolutionary history of diatom genomes. NATURE, 456(7219), 239–244.
    Diatoms are photosynthetic secondary endosymbionts found throughout marine and freshwater environments, and are believed to be responsible for around one- fifth of the primary productivity on Earth(1,2). The genome sequence of the marine centric diatom Thalassiosira pseudonana was recently reported, revealing a wealth of information about diatom biology(3-5). Here we report the complete genome sequence of the pennate diatom Phaeodactylum tricornutum and compare it with that of T. pseudonana to clarify evolutionary origins, functional significance and ubiquity of these features throughout diatoms. In spite of the fact that the pennate and centric lineages have only been diverging for 90 million years, their genome structures are dramatically different and a substantial fraction of genes (similar to 40%) are not shared by these representatives of the two lineages. Analysis of molecular divergence compared with yeasts and metazoans reveals rapid rates of gene diversification in diatoms. Contributing factors include selective gene family expansions, differential losses and gains of genes and introns, and differential mobilization of transposable elements. Most significantly, we document the presence of hundreds of genes from bacteria. More than 300 of these gene transfers are found in both diatoms, attesting to their ancient origins, and many are likely to provide novel possibilities for metabolite management and for perception of environmental signals. These findings go a long way towards explaining the incredible diversity and success of the diatoms in contemporary oceans.
  192. Van Landeghem, S., Saeys, Y., De Baets, B., & Van de Peer, Y. (2008). Benchmarking machine learning techniques for the extraction of protein-protein interactions from text. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 79–80). Presented at the 18th Annual Belgian-Dutch Machine Learning Conference (Benelearn 2008), Liège, Belgium: Université de Liège.
  193. Rensing, S. A., Lang, D., Zimmer, A. D., Terry, A., Salamov, A., Shapiro, H., Nishiyama, T., et al. (2008). The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. SCIENCE, 319(5859), 64–69.
    We report the draft genome sequence of the model moss Physcomitrella patens and compare its features with those of flowering plants, from which it is separated by more than 400 million years, and unicellular aquatic algae. This comparison reveals genomic changes concomitant with the evolutionary movement to land, including a general increase in gene family complexity; loss of genes associated with aquatic environments ( e. g., flagellar arms); acquisition of genes for tolerating terrestrial stresses ( e. g., variation in temperature and water availability); and the development of the auxin and abscisic acid signaling pathways for coordinating multicellular growth and dehydration response. The Physcomitrella genome provides a resource for phylogenetic inferences about gene function and for experimental analysis of plant processes through this plant's unique facility for reverse genetics.
  194. Armañanzas, R., Inza, I., Santana, R., Saeys, Y., Flores, J. L., Lozano, J. A., Van de Peer, Y., et al. (2008). A review of estimation of distribution algorithms in bioinformatics. BIODATA MINING, 1.
    Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain.
  195. Baele, Guy, Van de Peer, Y., & Vansteelandt, S. (2008). A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. SYSTEMATIC BIOLOGY, 57(5), 675–692.
    In this article, we present a likelihood-based framework for modeling site dependencies. Our approach builds upon standard evolutionary models but incorporates site dependencies across the entire tree by letting the evolutionary parameters in these models depend upon the ancestral states at the neighboring sites. It thus avoids the need for introducing new and high-dimensional evolutionary models for site-dependent evolution. We propose a Markov chain Monte Carlo approach with data augmentation to infer the evolutionary parameters under our model. Although our approach allows for wide-ranging site dependencies, we illustrate its use, in two non-coding datasets, in the case of nearest-neighbor dependencies (i.e., evolution directly depending only upon the immediate flanking sites). The results reveal that the general time-reversible model with nearest-neighbor dependencies substantially improves the fit to the data as compared to the corresponding model with site independence. Using the parameter estimates from our model, we elaborate on the importance of the 5-methylcytosine deamination process (i.e., the CpG effect) and show that this process also depends upon the 5' neighboring base identity. We hint at the possibility of a so-called TpA effect and show that the observed substitution behavior is very complex in the light of dinucleotide estimates. We also discuss the presence of CpG effects in a nuclear small subunit dataset and find significant evidence that evolutionary models incorporating context-dependent effects perform substantially better than independent-site models and in some cases even outperform models that incorporate varying rates across sites.
  196. Saeys, Yvan, Abeel, T., & Van de Peer, Y. (2008). Towards robust feature selection techniques. In L. Wehenkel, P. Geurts, & R. Marée (Eds.), Benelearn 08 : the annual Belgian-Dutch machine learning conference (pp. 45–46). Presented at the 18th Annual Belgian-Dutch Machine Learning Conference (Benelearn 2008), Liège, Belgium: Université de Liège.
  197. Simillion, C., Janssens, K., Sterck, L., & Van de Peer, Y. (2008). i-ADHoRe 2.0: an improved tool to detect degenerated genomic homology using genomic profiles. BIOINFORMATICS, 24(1), 127–128.
    i-ADHoRe is a software tool that combines gene content and gene order information of homologous genomic segments into profiles to detect highly degenerated homology relations within and between genomes. The new version offers, besides a significant increase in performance, several optimizations to the algorithm, most importantly to the profile alignment routine. As a result, the annotations of multiple genomes, or parts thereof, can be fed simultaneously into the program, after which it will report all regions of homology, both within and between genomes.
  198. Foissac, S., Gouzy, J., Rombauts, S., Mathé, C., Amselem, J., Sterck, L., Van de Peer, Y., et al. (2008). Genome annotation in plants and fungi: EuGène as a model platform. CURRENT BIOINFORMATICS, 3(2), 87–97.
    In this era of whole genome sequencing, reliable genome annotations ( identification of functional regions) are the cornerstones for many subsequent analyses. Not only is careful annotation important for studying the gene and gene family content of a genome and its host, but also for wide-scale transcriptome and proteome analyses attempting to describe a certain biological process or to get a global picture of a cell's behavior. Although the number of sequenced genomes is increasing thanks to the application of new technologies, genome-wide analyses will critically depend on the quality of the genome annotations. However, the annotation process is more complicated in the plant field than in the animal field because of the limited funding that leads to much fewer experimental data and less annotation expertise. This situation calls for highly automated annotation platforms that can make the best use of all available data, experimental or not. We discuss how the gene prediction (the process of predicting protein gene structures in genomic sequences) research field increasingly shifts from methods that typically exploited one or two types of data to more integrative approaches that simultaneously deal with various experimental, statistical, or other in silico evidence. We illustrate the importance of integrative approaches for producing high-quality automatic annotations of genomes of plants and algae as well as of fungi that live in close association with plants using the platform EuGene as an example.
  199. Abeel, T., Saeys, Y., Bonnet, E., Rouzé, P., & Van de Peer, Y. (2008). Generic eukaryotic core promoter prediction using structural features of DNA. GENOME RESEARCH, 18(2), 310–323.
    Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
  200. Fierro, Ana Carolina, Vandenbussche, F., Engelen, K., Van de Peer, Y., & Marchal, K. (2008). Meta analysis of gene expression data within and across species. CURRENT GENOMICS, 9(8), 525–534.
    Since the second half of the 1990s, a large number of genome-wide analyses have been described that study gene expression at the transcript level. To this end, two major strategies have been adopted, a first one relying on hybridization techniques such as microarrays, and a second one based on sequencing techniques such as serial analysis of gene expression (SAGE), cDNA-AFLP, and analysis based on expressed sequence tags (ESTs). Despite both types of profiling experiments becoming routine techniques in many research groups, their application remains costly and laborious. As a result, the number of conditions profiled in individual studies is still relatively small and usually varies from only two to few hundreds of samples for the largest experiments. More and more, scientific journals require the deposit of these high throughput experiments in public databases upon publication. Mining the information present in these databases offers molecular biologists the possibility to view their own small-scale analysis in the light of what is already available. However, so far, the richness of the public information remains largely unexploited. Several obstacles such as the correct association between ESTs and microarray probes with the corresponding gene transcript, the incompleteness and inconsistency in the annotation of experimental conditions, and the lack of standardized experimental protocols to generate gene expression data, all impede the successful mining of these data. Here, we review the potential and difficulties of combining publicly available expression data from respectively EST analyses and microarray experiments. With examples from literature, we show how meta-analysis of expression profiling experiments can be used to study expression behavior in a single organism or between organisms, across a wide range of experimental conditions. We also provide an overview of the methods and tools that can aid molecular biologists in exploiting these public data.
  201. Amoutzias, G., & Van de Peer, Y. (2008). Together we stand: genes cluster to coordinate regulation. DEVELOPMENTAL CELL.
    Although most eukaryotic genomes lack operons, occasionally clusters of genes are discovered that are related in function. Now, a metabolic operon-like gene cluster has been described in Arabidopsis thaliana that is needed for triterpene synthesis.
  202. Amoutzias, G., Robertson, D. L., Van de Peer, Y., & Oliver, S. G. (2008). Choose your partners: dimerization in eukaryotic transcription factors. TRENDS IN BIOCHEMICAL SCIENCES, 33(5), 220–229.
    In many eukaryotic transcription factor gene families, proteins require a physical interaction with an identical molecule or with another molecule within the same family to form a functional dimer and bind DNA. Depending on the choice of partner and the cellular context, each dimer triggers a sequence of regulatory events that lead to a particular cellular fate, for example, proliferation or differentiation. Recent syntheses of genomic and functional data reveal that partner choice is not random; instead, dimerization specificities, which are strongly linked to the evolution of the protein family, apply. Our focus is on understanding these interaction specificities, their functional consequences and how they evolved. This knowledge is essential for understanding gene regulation and designing a new generation of drugs.
  203. Abeel, T., Saeys, Y., Rouzé, P., & Van de Peer, Y. (2008). ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. BIOINFORMATICS, 24(13), I24–I31. Presented at the 16th ISMB Conference on Intelligent Systems for Molecular Biology.
    Motivation: More and more genomes are being sequenced, and to keep up with the pace of sequencing projects, automated annotation techniques are required. One of the most challenging problems in genome annotation is the identification of the core promoter. Because the identification of the transcription initiation region is such a challenging problem, it is not yet a common practice to integrate transcription start site prediction in genome annotation projects. Nevertheless, better core promoter prediction can improve genome annotation and can be used to guide experimental work. Results: Comparing the average structural profile based on base stacking energy of transcribed, promoter and intergenic sequences demonstrates that the core promoter has unique features that cannot be found in other sequences. We show that unsupervised clustering by using self-organizing maps can clearly distinguish between the structural profiles of promoter sequences and other genomic sequences. An implementation of this promoter prediction program, called ProSOM, is available and has been compared with the state-of-the-art. We propose an objective, accurate and biologically sound validation scheme for core promoter predictors. ProSOM performs at least as well as the software currently available, but our technique is more balanced in terms of the number of predicted sites and the number of false predictions, resulting in a better all-round performance. Additional tests on the ENCODE regions of the human genome show that 98 of all predictions made by ProSOM can be associated with transcriptionally active regions, which demonstrates the high precision.
  204. Joshi, A. M., Van de Peer, Y., & Michoel, T. (2008). Analysis of a Gibbs sampler method for model-based clustering of gene expression data. BIOINFORMATICS, 24(2), 176–183.
    Motivation: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. Results: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression.
  205. Saeys, Yvan, Abeel, T., & Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In W. Daelemans, B. Goethals, & K. Morik (Eds.), Lecture Notes in Artificial Intelligence (Vol. 5212, pp. 313–325). Presented at the European conference on Principles of Data Mining and Knowledge Discovery, Berlin, Germany: Springer.
    Robustness or stability of feature selection techniques is a, topic of recent interest, and is an important issue when selected feature subsets are subsequently analysed by domain experts to gain more insight into the problem modelled. In this work, we investigate the use of ensemble feature selection techniques, where multiple feature selection methods are combined to yield more robust results. We show that these techniques show great promise for high-dimensional domains with small sample sizes, and provide more robust feature subsets than a single feature selection technique. In addition, we also investigate the effect of ensemble feature selection techniques on classification performance, giving rise to a new model selection strategy.
  206. Robbens, S., Derelle, E., Ferraz, C., Wuyts, J., Moreau, H., & Van de Peer, Y. (2007). The complete chloroplast and mitochondrial DNA sequence of Ostreococcus tauri: organelle genomes of the smallest eukaryote are examples of compaction. MOLECULAR BIOLOGY AND EVOLUTION, 24(4), 956–968.
    The complete nucleotide sequence of the mt (mitochondrial) and cp (chloroplast) genomes of the unicellular green alga Ostreococcus tauri has been determined. The mt genome assembles as a circle of 44,237 bp and contains 65 genes. With an overall average length of only 42 bp for the intergenic regions, this is the most gene-dense mt genome of all Chlorophyta. Furthermore, it is characterized by a unique segmental duplication, encompassing 22 genes and covering 44% of the genome. Such a duplication has not been observed before in green algae, although it is also present in the mt genomes of higher plants. The quadripartite cp genome forms a circle of 71,666 bp, containing 86 genes divided over a larger and a smaller single-copy region, separated by 2 inverted repeat sequences. Based on genome size and number of genes, the Ostreococcus cp genome is the smallest known among the green algae. Phylogenetic analyses based on a concatenated alignment of cp, mt, and nuclear genes confirm the position of O. tauri within the Prasinophyceae, an early branch of the Chlorophyta.
  207. Robbens, S., Petersen, J., Brinkmann, H., Rouzé, P., & Van de Peer, Y. (2007). Unique regulation of the Calvin cycle in the ultrasmall green alga Ostreococcus. JOURNAL OF MOLECULAR EVOLUTION, 64(5), 601–604.
    Glyceraldehyde-3-phosphate dehydrogenase (GapAB) and CP12 are two major players in controlling the inactivation of the Calvin cycle in land plants at night. GapB originated from a GapA gene duplication and differs from GapA by the presence of a specific C-terminal extension that was recruited from CP12. While GapA and CP12 are assumed to be generally present in the Plantae (glaucophytes, red and green algae, and plants), up to now GapB was exclusively found in Streptophyta, including the enigmatic green alga Mesostigma viride. However, here we show that two closely related prasinophycean green algae, Ostreococcus tauri and Ostreococcus lucimarinus, also possess a GapB gene, while CP12 is missing. This remarkable finding either antedates the GapA/B gene duplication or indicates a lateral recruitment. Moreover, Ostreococcus is the first case where the crucial CP12 function may be completely replaced by GapB-mediated GapA/B aggregation.
  208. Saeys, Yvan, & Van de Peer, Y. (2007). Enhancing coding potential prediction for short sequences using complementary sequence features and feature selection. In K. Tuyts, R. Westra, Y. Saeys, & A. Nowé (Eds.), Lecture Notes in Bioinformatics (Vol. 4366, pp. 107–118). Presented at the 1st International workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics (KDECB 2006), Berlin, Germany: Springer.
    The identification of coding potential in DNA sequences is of major importance in bioinformatics, where it is often used to assist expert systems that automatically try to recognize genes in genomes. For longer sequences, the identification of coding potential tends to be easier due to a better signal-to-noise ratio, whereas for very short sequences the issue becomes more problematic. In this paper, we present new methods that specifically aim at a better prediction of coding potential in short sequences. To this end, we combine different, complementary sequence features together with a feature selection strategy. Results comparing the new classifiers to state of the art models show that our new approach significantly outperforms the existing methods when applied to short sequences.
  209. Rensing, S. A., Ick, J., Fawcett, J., Lang, D., Zimmer, A., Van de Peer, Y., & Reski, R. (2007). An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC EVOLUTIONARY BIOLOGY, 7.
    Background: Analyses of complete genomes and large collections of gene transcripts have shown that most, if not all seed plants have undergone one or more genome duplications in their evolutionary past. Results: In this study, based on a large collection of EST sequences, we provide evidence that the haploid moss Physcomitrella patens is a paleopolyploid as well. Based on the construction of linearized phylogenetic trees we infer the genome duplication to have occurred between 30 and 60 million years ago. Gene Ontology and pathway association of the duplicated genes in P. patens reveal different biases of gene retention compared with seed plants. Conclusion: Metabolic genes seem to have been retained in excess following the genome duplication in P. patens. This might, at least partly, explain the versatility of metabolism, as described for P. patens and other mosses, in comparison to other land plants.
  210. Palenik, B., Grimwood, J., Aerts, A., Rouzé, P., Salamov, A., Putnam, N., Dupont, C., et al. (2007). The tiny eukaryote Ostreococcus provides genomic insights into the paradox of plankton speciation. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 104(18), 7705–7710.
    The smallest known eukaryotes, at approximate to 1-mu m diameter, are ostreococcus tauri and related species of marine phytoplankton. The genome of Ostreococcus lucimarinus has been completed and compared with that of O. tauri. This comparison reveals surprising differences across orthologous chromosomes in the two species from highly syntenic chromosomes in most cases to chromosomes with almost no similarity. Species divergence in these phytoplankton is occurring through multiple mechanisms acting differently on different chromosomes and likely including acquisition of new genes through horizontal gene transfer. We speculate that this latter process may be involved in altering the cell-surface characteristics of each species. In addition, the genome of O. lucimarinus provides insights into the unique metal metabolism of these organisms, which are predicted to have a large number of selenocysteine-containing proteins. Selenoenzymes are more catalytically active than similar enzymes lacking selenium, and thus the cell may require less of that protein. As reported here, selenoenzymes, novel fusion proteins, and loss of some major protein families including ones associated with chromatin are likely important adaptations for achieving a small cell size.
  211. Michoel, T., Maere, S., Bonnet, E., Joshi, A. M., Saeys, Y., Van den Bulcke, T., Van Leemput, K., et al. (2007). Validating module network learning algorithms using simulated data. BMC BIOINFORMATICS, 8(suppl. 2).
    Background: In recent years, several authors have used probabilistic graphical models to learn expression modules and their regulatory programs from gene expression data. Despite the demonstrated success of such algorithms in uncovering biologically relevant regulatory relations, further developments in the area are hampered by a lack of tools to compare the performance of alternative module network learning strategies. Here, we demonstrate the use of the synthetic data generator SynTReN for the purpose of testing and comparing module network learning algorithms. We introduce a software package for learning module networks, called LeMoNe, which incorporates a novel strategy for learning regulatory programs. Novelties include the use of a bottom-up Bayesian hierarchical clustering to construct the regulatory programs, and the use of a conditional entropy measure to assign regulators to the regulation program nodes. Using SynTReN data, we test the performance of LeMoNe in a completely controlled situation and assess the effect of the methodological changes we made with respect to an existing software package, namely Genomica. Additionally, we assess the effect of various parameters, such as the size of the data set and the amount of noise, on the inference performance. Results: Overall, application of Genomica and LeMoNe to simulated data sets gave comparable results. However, LeMoNe offers some advantages, one of them being that the learning process is considerably faster for larger data sets. Additionally, we show that the location of the regulators in the LeMoNe regulation programs and their conditional entropy may be used to prioritize regulators for functional validation, and that the combination of the bottom-up clustering strategy with the conditional entropy-based assignment of regulators improves the handling of missing or hidden regulators. Conclusion: We show that data simulators such as SynTReN are very well suited for the purpose of developing, testing and improving module network algorithms. We used SynTReN data to develop and test an alternative module network learning strategy, which is incorporated in the software package LeMoNe, and we provide evidence that this alternative strategy has several advantages with respect to existing methods.
  212. Sterck, L., Rombauts, S., Vandepoele, K., Rouzé, P., & Van de Peer, Y. (2007). How many genes are there in plants (... and why are they there)? CURRENT OPINION IN PLANT BIOLOGY, 10(2), 199–203.
    Annotation of the first few complete plant genomes has revealed that plants have many genes. For Arabidopsis, over 26 500 gene loci have been predicted, whereas for rice, the number adds up to 41 000. Recent analysis of the poplar genome suggests more than 45 000 genes, and partial sequence data from Medicago and Lotus also suggest that these plants contain more than 40 000 genes. Nevertheless, estimations suggest that ancestral angiosperms had no more than 12 000-14 000 genes. One explanation for the large increase in gene number during angiosperm evolution is gene duplication. It has been shown previously that the retention of duplicates following small- and large-scale duplication events in plants is substantial. Taking into account the function of genes that have been duplicated, we are now beginning to understand why many plant genes might have been retained, and how their retention might be linked to the typical lifestyle of plants.
  213. Casneuf, Tineke, Van de Peer, Y., & Huber, W. (2007). In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC BIOINFORMATICS, 8.
    Background: Microarray co-expression signatures are an important tool for studying gene function and relations between genes. In addition to genuine biological co-expression, correlated signals can result from technical deficiencies like hybridization of reporters with off-target transcripts. An approach that is able to distinguish these factors permits the detection of more biologically relevant co-expression signatures. Results: We demonstrate a positive relation between off-target reporter alignment strength and expression correlation in data from oligonucleotide genechips. Furthermore, we describe a method that allows the identification, from their expression data, of individual probe sets affected by off target hybridization. Conclusion: The effects of off-target hybridization on expression correlation coefficients can be substantial, and can be alleviated by more accurate mapping between microarray reporters and the target transcriptome. We recommend attention to the mapping for any microarray analysis of gene expression patterns.
  214. Carlton, J. M., Hirt, R. P., Silva, J. C., Delcher, A. L., Schatz, M., Zhao, Q., Wortman, J. R., et al. (2007). Draft genome sequence of the sexually transmitted pathogen Trichomonas vaginalis. SCIENCE, 315(5809), 207–212.
    We describe the genome sequence of the protist Trichomonas vaginalis, a sexually transmitted human pathogen. Repeats and transposable elements comprise about two-thirds of the similar to 160-megabase genome, reflecting a recent massive expansion of genetic material. This expansion, in conjunction with the shaping of metabolic pathways that likely transpired through lateral gene transfer from bacteria, and amplification of specific gene families implicated in pathogenesis and phagocytosis of host proteins may exemplify adaptations of the parasite during its transition to a urogenital environment. The genome sequence predicts previously unknown functions for the hydrogenosome, which support a common evolutionary origin of this unusual organelle with mitochondria.
  215. Saeys, Yvan, Abeel, T., Degroeve, S., & Van de Peer, Y. (2007). Translation initiation site prediction on a genomic scale: beauty in simplicity. BIOINFORMATICS, 23(13), i418–i423.
    Motivation: The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions. Results: In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome ( chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting.
  216. Velasco, R., Zharkikh, A., Troggio, M., Cartwright, D. A., Cestaro, A., Pruss, D., Pindo, M., et al. (2007). A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLOS ONE, 2(12).
    Background. Worldwide, grapes and their derived products have a large market. The cultivated grape species Vitis vinifera has potential to become a model for fruit trees genetics. Like many plant species, it is highly heterozygous, which is an additional challenge to modern whole genome shotgun sequencing. In this paper a high quality draft genome sequence of a cultivated clone of V. vinifera Pinot Noir is presented. Principal Findings. We estimate the genome size of V. vinifera to be 504.6 Mb. Genomic sequences corresponding to 477.1 Mb were assembled in 2,093 metacontigs and 435.1 Mb were anchored to the 19 linkage groups (LGs). The number of predicted genes is 29,585, of which 96.1% were assigned to LGs. This assembly of the grape genome provides candidate genes implicated in traits relevant to grapevine cultivation, such as those influencing wine quality, via secondary metabolites, and those connected with the extreme susceptibility of grape to pathogens. Single nucleotide polymorphism ( SNP) distribution was consistent with a diffuse haplotype structure across the genome. Of around 2,000,000 SNPs, 1,751,176 were mapped to chromosomes and one or more of them were identified in 86.7% of anchored genes. The relative age of grape duplicated genes was estimated and this made possible to reveal a relatively recent Vitis-specific large scale duplication event concerning at least 10 chromosomes (duplication not reported before). Conclusions. Sanger shotgun sequencing and highly efficient sequencing by synthesis (SBS), together with dedicated assembly programs, resolved a complex heterozygous genome. A consensus sequence of the genome and a set of mapped marker loci were generated. Homologous chromosomes of Pinot Noir differ by 11.2% of their DNA (hemizygous DNA plus chromosomal gaps). SNP markers are offered as a tool with the potential of introducing a new era in the molecular breeding of grape.
  217. Van de Peer, Y. (2007). The future for plants and plants for the future. GENOME BIOLOGY.
    A report of the 2007 EMBO Conference Series on Plant Molecular Biology ‘From basic genomics to systems biology’, Ghent, Belgium, 2-4 May 2007
  218. Bonet, Isis, García, M. M., Saeys, Y., Van de Peer, Y., & Grau, R. (2007). Predicting human immunodeficiency virus (HIV) drug resistance using recurrent neural networks. In J. Mira & J. Álvarez (Eds.), Lecture Notes in Computer Science (Vol. 4527, pp. 234–243). Presented at the 2nd International work-conference on the Interplay Between Natural and Artificial Computation (IWINAC 2007), Berlin, Germany: Springer.
    Predicting HIV resistance to drugs is one of many problems for which bioinformaticians have implemented and trained machine learning methods, such as neural networks. Predicting HIV resistance would be much easier if we could directly use the three-dimensional (3D) structure of the targeted protein sequences, but unfortunately we rarely have enough structural information available to train a neural network. Fur-thermore, prediction of the 3D structure of a protein is not straightforward. However, characteristics related to the 3D structure can be used to train a machine learning algorithm as an alternative to take into account the information of the protein folding in the 3D space. Here, starting from this philosophy, we select the amino acid energies as features to predict HIV drug resistance, using a specific topology of a neural network. In this paper, we demonstrate that the amino acid ener-gies are good features to represent the HIV genotype. In addi-tion, it was shown that Bidirectional Recurrent Neural Networks can be used as an efficient classification method for this prob-lem. The prediction performance that was obtained was greater than or at least comparable to results obtained previously. The accuracies vary between 81.3% and 94.7%.
  219. Van Hellemont, R, Blomme, T., Van de Peer, Y., & Marchal, K. (2007). Divergence of regulatory sequences in duplicated fish genes. In J.-N. Volff (Ed.), Gene and protein evolution (Vol. 3, pp. 81–100). Basel, Switzerland: Karger.
  220. Merks, R., Van de Peer, Y., Inzé, D., & Beemster, G. (2007). Canalization without flux sensors: a traveling-wave hypothesis. TRENDS IN PLANT SCIENCE, 12(9), 384–390.
    In 1969, Tsvi Sachs published his seminal hypothesis of vascular development in plants: the canalization hypothesis. A positive feedback loop between the flux of the phytohormone auxin and the cells' auxin transport capacity would canalize auxin progressively into discrete channels, which would then differentiate into vascular tissues. Recent experimental studies confirm the central role of polar auxin flux in plant vasculogenesis, but it is unclear if and by which mechanism plant cells could respond to auxin flux. In this Opinion article, we review auxin perception mechanisms and argue that these respond more likely to auxin concentrations than to auxin flux. We propose an alternative mechanism for polar auxin channeling, which is more consistent with recent molecular observations.
  221. Saeys, Yvan, Rouzé, P., & Van de Peer, Y. (2007). In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. BIOINFORMATICS, 23(4), 414–420.
    Motivation: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. Results: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes.
  222. Bonet, Iris, Saeys, Y., Ábalo, R. G., García, M. M., Sanchez, R., & Van de Peer, Y. (2006). Feature extraction using clustering of protein. (J. F. Martínez Trinidad, J. A. Carrasco Ochoa, & J. Kittler, Eds.)Lecture Notes in Computer Science, 4225, 614–623. Presented at the 11th Iberoamerican conference in Pattern Recognition (CIARP 2006).
    In this paper we investigate the usage of a clustering algorithm as a feature extraction technique to find new features to represent the protein sequence. In particular, our work focuses on the prediction of HIV protease resistance to drugs. We use a biologically motivated similarity function based on the contact energy of the amino acid and the position in the sequence. The performance measure was computed taking into account the clustering reliability and the classification validity. An SVM using 10-fold crossvalidation and the k-means algorithm were used for classification and clustering respectively. The best results were obtained by reducing an initial set of 99 features to a lower dimensional feature set of 36-66 features.
  223. Saeys, Yvan, & Van de Peer, Y. (2006). Enhancing coding potential prediction for short sequences using complementary sequence features and feature selection. Proceedings of the 15th Dutch Belgian machine learning conference (Benelearn 2006) (pp. 105–112). Presented at the 15th Annual Machine Learning Conference of Belgium and the Netherlands (Benelearn 2006), Ghent, Belgium: Ghent University. Faculty of Sciences.
  224. Bonet, Isis, García, M. M., Salazar, S., Sanchez, R., Saeys, Y., Van de Peer, Y., & Grau, R. (2006). Predicting human immunodeficiency virus (HIV) drug resistance using recurrent neural networks. In J. A. Seijas, S.-K. Lin, & M. P. Vázquez Tato (Eds.), Proceedings : November 1-30, 2006. Presented at the 10th International electronic conference on Synthetic Organic Chemistry (ECSOC-10), Basel, Switzerland: MDPI.
  225. Baele, Guy, Raes, J., Van de Peer, Y., & Vansteelandt, S. (2006). An improved statistical method for detecting heterotachy in nucleotide sequences. MOLECULAR BIOLOGY AND EVOLUTION, 23(7), 1397–1405.
    The principle of heterotachy states that the substitution rate of sites in a gene can change through time. In this article, we propose a powerful statistical test to detect sites that evolve according to the process of heterotachy. We apply this test to an alignment of 1289 eukaryotic rRNA molecules to 1) determine how widespread the phenomenon of heterotachy is in ribosomal RNA, 2) to test whether these heterotachous sites are nonrandomly distributed, that is, linked to secondary structure features of ribosomal RNA, and 3) to determine the impact of heterotachous sites on the bootstrap support of monophyletic groupings. Our study revealed that with 21 monophyletic taxa, approximately two-thirds of the sites in the considered set of sequences is heterotachous. Although the detected heterotachous sites do not appear bound to specific structural features of the small subunit rRNA, their presence is shown to have a large beneficial influence on the bootstrap support of monophyletic groups. Using extensive testing, we show that this may not be due to heterotachy itself but merely due to the increased substitution rate at the detected heterotachous sites.
  226. Bonnet, E., Van de Peer, Y., & Rouzé, P. (2006). The small RNA world of plants. NEW PHYTOLOGIST, 171(3), 451–468.
    RNA has many functions in addition to being a simple messenger between the genome and the proteome. Over two decades, several classes of small noncoding RNAs c. 21 nucleotides (nt) long have been uncovered in eukaryotic genomes, which appear to play a central role in diverse and fundamental processes. In plants, small RNA-based mechanisms are involved in genome stability, gene expression and defense. Many of the discoveries in this new 'small RNA world' were made by plant biologists. Here, we discuss the three major classes of small RNAs that are found in the plant kingdom, namely small interfering RNAs, microRNAs, and the recently discovered trans-acting small interfering RNAs. Recent results shed light on the identification, integration and specialization of the different components (Dicer-like, Argonaute, and others) involved in the biogenesis of the different classes of small RNAs in plants. Owing to the development of better experimental and computational methods, an ever increasing number of small noncoding RNAs are uncovered in different plant genomes. In particular the well-studied microRNAs seem to act as key regulators in several different developmental pathways, with a marked preference for transcription factors as targets. In addition, an increasing amount of data suggest that they also play an important role in other mechanisms, such as response to stress or environmental changes.
  227. Saeys, Yvan, & Van de Peer, Y. (2006). Combining signal processing and machine learning techniques for coding potential prediction. Proceedings of the First International Workshop on Bioinforrmatics Cuba-Flanders 2006.
  228. Saeys, Yvan, Tsiporkova, E., De Baets, B., & Van de Peer, Y. (Eds.). (2006). Annual Machine Learning Conference of Belgium and The Netherlands.
  229. Blomme, T., Vandepoele, K., De Bodt, S., Simillion, C., Maere, S., & Van de Peer, Y. (2006). The gain and loss of genes during 600 million years of vertebrate evolution. GENOME BIOLOGY, 7(5).
    Background: Gene duplication is assumed to have played a crucial role in the evolution of vertebrate organisms. Apart from a continuous mode of duplication, two or three whole genome duplication events have been proposed during the evolution of vertebrates, one or two at the dawn of vertebrate evolution, and an additional one in the fish lineage, not shared with land vertebrates. Here, we have studied gene gain and loss in seven different vertebrate genomes, spanning an evolutionary period of about 600 million years. Results: We show that: first, the majority of duplicated genes in extant vertebrate genomes are ancient and were created at times that coincide with proposed whole genome duplication events; second, there exist significant differences in gene retention for different functional categories of genes between fishes and land vertebrates; third, there seems to be a considerable bias in gene retention of regulatory genes towards the mode of gene duplication ( whole genome duplication events compared to smaller-scale events), which is in accordance with the so-called gene balance hypothesis; and fourth, that ancient duplicates that have survived for many hundreds of millions of years can still be lost. Conclusion: Based on phylogenetic analyses, we show that both the mode of duplication and the functional class the duplicated genes belong to have been of major importance for the evolution of the vertebrates. In particular, we provide evidence that massive gene duplication ( probably as a consequence of entire genome duplications) at the dawn of vertebrate evolution might have been particularly important for the evolution of complex vertebrates.
  230. Casneuf, Tineke, De Bodt, S., Raes, J., Maere, S., & Van de Peer, Y. (2006). Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana. GENOME BIOLOGY, 7(2).
    Background: Genome analyses have revealed that gene duplication in plants is rampant. Furthermore, many of the duplicated genes seem to have been created through ancient genome-wide duplication events. Recently, we have shown that gene loss is strikingly different for large- and small-scale duplication events and highly biased towards the functional class to which a gene belongs. Here, we study the expression divergence of genes that were created during large- and small-scale gene duplication events by means of microarray data and investigate both the influence of the origin (mode of duplication) and the function of the duplicated genes on expression divergence. Results: Duplicates that have been created by large- scale duplication events and that can still be found in duplicated segments have expression patterns that are more correlated than those that were created by small-scale duplications or those that no longer lie in duplicated segments. Moreover, the former tend to have highly redundant or overlapping expression patterns and are mostly expressed in the same tissues, while the latter show asymmetric divergence. In addition, a strong bias in divergence of gene expression was observed towards gene function and the biological process genes are involved in. Conclusion: By using microarray expression data for Arabidopsis thaliana, we show that the mode of duplication, the function of the genes involved, and the time since duplication play important roles in the divergence of gene expression and, therefore, in the functional divergence of genes after duplication.
  231. Saeys, Yvan, Degroeve, S., & Van de Peer, Y. (2006). Feature ranking using an EDA-based wrapper approach. In J. A. Lozano, P. Larrañga, I. Inza, & E. Bengotxea (Eds.), Towards a new evolutionary computation : advances in estimation of distribution algorithms (Vol. 192, pp. 243–257). Berlin, Germany: Springer.
  232. Michoel, T., & Van de Peer, Y. (2006). Helicoidal transfer matrix model for inhomogeneous DNA melting. PHYSICAL REVIEW E, 73(1).
    An inhomogeneous helicoidal nearest-neighbor model with continuous degrees of freedom is shown to predict the same DNA melting properties as traditional long-range Ising models, for free DNA molecules in solution, as well as superhelically stressed DNA with a fixed linking number constraint. Without loss of accuracy, the continuous degrees of freedom can be discretized using a minimal number of discretization points, yielding an effective transfer matrix model of modest dimension (d=36). The resulting algorithms to compute DNA melting profiles are both simple and efficient.
  233. Gevers, D., & Van de Peer, Y. (2006). Gene duplicates in vibrios genomes. The biology of vibrios (pp. 76–83). ASM Press.
  234. Derelle, E., Ferraz, C., Rombauts, S., Rouzé, P., Worden, A. Z., Robbens, S., Partensky, F., et al. (2006). Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 103(31), 11647–11652.
    The green lineage is reportedly 1,500 million years old, evolving shortly after the endosymbiosis event that gave rise to early photosynthetic eukaryotes. In this study, we unveil the complete genome sequence of an ancient member of this lineage, the unicellular green alga Ostreococcus tauri (Prasinophyceae). This cosmopolitan marine primary producer is the world's smallest free-living eukaryote known to date. Features likely reflecting optimization of environmentally relevant pathways, including resource acquisition, unusual photosynthesis apparatus, and genes potentially involved in C-4 photosynthesis, were observed, as was downsizing of many gene families. Overall, the 12.56-Mb nuclear genome has an extremely high gene density, in part because of extensive reduction of intergenic regions and other forms of compaction such as gene fusion. However, the genome is structurally complex. It exhibits previously unobserved levels of heterogeneity for a eukaryote. Two chromosomes differ structurally from the other eighteen. Both have a significantly biased G+C content, and, remarkably, they contain the majority of transposable elements. Many chromosome 2 genes also have unique codon usage and splicing, but phylogenetic analysis and composition do not support alien gene origin. In contrast, most chromosome 19 genes show no similarity to green lineage genes and a large number of them are specialized in cell surface processes. Taken together, the complete genome sequence, unusual features, and downsized gene families, make O. tauri an ideal model system for research on eukaryotic genome evolution, including chromosome specialization and green lineage ancestry.
  235. Tuskan, G., DiFazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., et al. (2006). The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). SCIENCE, 313(5793), 1596–1604.
    We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the Populus genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport.
  236. Vandepoele, K., Casneuf, T., & Van de Peer, Y. (2006). Identification of novel regulatory modules in dicotyledonous plants using expression data and comparative genomics. GENOME BIOLOGY, 7(11).
    Background: Transcriptional regulation plays an important role in the control of many biological processes. Transcription factor binding sites (TFBSs) are the functional elements that determine transcriptional activity and are organized into separable cis-regulatory modules, each defining the cooperation of several transcription factors required for a specific spatio-temporal expression pattern. Consequently, the discovery of novel TFBSs in promoter sequences is an important step to improve our understanding of gene regulation. Results: Here, we applied a detection strategy that combines features of classic motif overrepresentation approaches in co-regulated genes with general comparative footprinting principles for the identification of biologically relevant regulatory elements and modules in Arabidopsis thaliana, a model system for plant biology. In total, we identified 80 TFBSs and 139 regulatory modules, most of which are novel, and primarily consist of two or three regulatory elements that could be linked to different important biological processes, such as protein biosynthesis, cell cycle control, photosynthesis and embryonic development. Moreover, studying the physical properties of some specific regulatory modules revealed that Arabidopsis promoters have a compact nature, with cooperative TFBSs located in close proximity of each other. Conclusion: These results create a starting point to unravel regulatory networks in plants and to study the regulation of biological processes from a systems biology point of view.
  237. De Bodt, Stefanie, Theissen, G., & Van de Peer, Y. (2006). Promoter analysis of MADS-box genes in eudicots through phylogenetic footprinting. MOLECULAR BIOLOGY AND EVOLUTION, 23(6), 1293–1303.
    The MIKC MADS-box gene family has been shaped by extensive gene duplications giving rise to subfamilies of genes with distinct functions and expression patterns. However, within these subfamilies the functional assignment is not that clear-cut, and considerable functional redundancy exists. One way to investigate the diversity in regulation present in these subfamilies is promoter sequence analysis. With the advent of genome sequencing projects, we are now able to exert a comparative analysis of Arabidopsis and poplar promoters of MADS-box genes belonging to the same subfamily. Based on the principle of phylogenetic footprinting, sequences conserved between the promoters of homologous genes are thought to be functional. Here, we have investigated the evolution of MADS-box genes at the promoter level and show that many genes have diverged in their regulatory sequences after duplication and/or speciation. Furthermore, using phylogenetic footprinting, a distinction can be made between redundancy, neo/nonfunctionalization, and subfunctionalization.
  238. Cannon, S. B., Sterck, L., Rombauts, S., Sato, S., Cheung, F., Gouzy, J., Wang, X., et al. (2006). Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 103(40), 14959–14964.
    Genome sequencing of the model legumes, Medicago truncatula and Lotus japonicus, provides an opportunity for large-scale sequence-based comparison of two genomes in the same plant family. Here we report synteny comparisons between these species, including details about chromosome relationships, large-scale synteny blocks, microsynteny within blocks, and genome regions lacking clear correspondence. The Lotus and Medicago genomes share a minimum of 10 large-scale synteny blocks, each with substantial collinearity and frequently extending the length of whole chromosome arms. The proportion of genes syntenic and collinear within each synteny block is relatively homogeneous. Medicago-Lotus comparisons also indicate similar and largely homogeneous gene densities, although gene-containing regions in Mt occupy 20-30% more space than Lj counterparts, primarily because of larger numbers of Mt retrotransposons. Because the interpretation of genome comparisons is complicated by large-scale genome duplications, we describe synteny, synonymous substitutions and phylogenetic analyses to identify and date a probable whole-genome duplication event. There is no direct evidence for any recent large-scale genome duplication in either Medicago or Lotus but instead a duplication predating speciation. Phylogenetic comparisons place this duplication within the Rosid I clade, clearly after the split between legumes and Salicaceae (poplar).
  239. Van de Peer, Y. (2006). When duplicated genes don’t stick to the rules. HEREDITY.
  240. Van den Bulcke, T., Lemmens, K., Van de Peer, Y., & Marchal, K. (2006). Inferring transcriptional networks by mining “omics” data. CURRENT BIOINFORMATICS, 1(3), 301–313.
    Inferring comprehensive regulatory networks from high-throughput data is one of the foremost challenges of modem computational biology. As high-throughput expression profiling experiments have gained common ground in many laboratories, different techniques have been proposed to infer transcriptional regulatory networks from them. Furthermore, with the advent of diverse types of high-throughput data, the research in network inference has received a new impulse. The use of diverse types of data, together with the increasing tendency of building the inference on biologically plausible simplifications, allows a more reliable and more complete description of networks. Here, we discuss how the research focus in the field of network inference is increasingly shifting from methods trying to reconstruct networks from a single data type towards integrative approaches dealing with several data sources simultaneously to infer regulatory modules.
  241. Faes, P., Minnaert, B., CHRISTIAENS, M., Bonnet, E., Saeys, Y., Stroobandt, D., & Van de Peer, Y. (2006). A Scalable Hardware Accelerator for Comparing Protein Sequences. Proceedings of the First International Conference on Scalable Information Systems. Hong Kong.
  242. Abeel, T., Saeys, Y., & Van de Peer, Y. (2006). Improved core promoter prediction using ensembles of Support Vector Machines. Proceedings of the 15th Dutch Belgian Machine Learning Conference (Benelearn 2006) (pp. 180–181).
  243. Robbens, S., Rombauts, S., Rouzé, P., Wuyts, J., Saeys, Y., Moreau, H., & Van de Peer, Y. (2005). Genome analysis of the world’s smallest free-living eukaryote Ostreococcus tauri unveils unique genome heterogeneity. Proceedings of the Molecular Biology and Evolution Conference (MBE) 2005.
  244. Florquin, K., Saeys, Y., Degroeve, S., Rouzé, P., & Van de Peer, Y. (2005). Large-scale structural analysis of the core promoter in mammalian and plant genomes. NUCLEIC ACIDS RESEARCH, 33(13), 4255–4264.
    DNA encodes at least two independent levels of functional information. The first level is for encoding proteins and sequence targets for DNA-binding factors, while the second one is contained in the physical and structural properties of the DNA molecule itself. Although the physical and structural properties are ultimately determined by the nucleotide sequence itself, the cell exploits these properties in a way in which the sequence itself plays no role other than to support or facilitate certain spatial structures. In this work, we focus on these structural properties, comparing them between different organisms and assessing their ability to describe the core promoter. We prove the existence of distinct types of core promoters, based on a clustering of their structural profiles. These results indicate that the structural profiles are much conserved within plants (Arabidopsis and rice) and animals (human and mouse), but differ considerably between plants and animals. Furthermore, we demonstrate that these structural profiles can be an alternative way of describing the core promoter, in addition to more classical motif or IUPAC-based approaches. Using the structural profiles as discriminatory elements to separate promoter regions from non-promoter regions, reliable models can be built to identify core-promoter regions using a strictly computational approach.
  245. Florquin, K., Saeys, Y., Degroeve, S., & Van de Peer, Y. (2005). Large-scale structural analysis of the core promoter in mammalian and plant genomes. Proceedings of the 7th International EMBL PhD Symposium, Heidelberg, Germany. Presented at the 7th International EMBL PhD Symposium.
  246. Robbens, S., Khadaroo, B., Camasses, A., Derelle, E., Ferraz, C., Inzé, D., Van de Peer, Y., et al. (2005). Genome-wide analysis of core cell cycle genes in the unicellular green alga Ostreococcus tauri. MOLECULAR BIOLOGY AND EVOLUTION, 22(3), 589–597.
    The cell cycle has been extensively studied in various organisms, and the recent access to an overwhelming amount of genomic data has given birth to a new integrated approach called comparative genomics. Comparing the cell cycle across species shows that its regulation is evolutionarily conserved; the best-known example is the pivotal role of cyclin-dependent kinases in all the eukaryotic lineages hitherto investigated. Interestingly, the molecular network associated with the activity of the CDK-cyclin complexes is also evolutionarily conserved, thus, defining a core cell cycle set of genes together with lineage-specific adaptations. In this paper, we describe the core cell cycle genes of Ostreococcus tauri, the smallest free-living eukaryotic cell having a minimal cellular organization with a nucleus, a single chloroplast, and only one mitochondrion. This unicellular marine green alga, which has diverged at the base of the green lineage, shows the minimal yet complete set of core cell cycle genes described to date. It has only one homolog of CDKA, CDKB, CDKD, cyclin A, cyclin B, cyclin D, cyclin H, Cks, Rb, E2F, DP, DEL, Cdc25, and Wee L We have also added the APC and SCF E3 ligases to the core cell cycle gene set. We discuss the potential of genome-wide analysis in the identification of divergent orthologs of cell cycle genes in different lineages by mining the genomes of evolutionarily important and strategic organisms.
  247. Van de Peer, Y., & MEYER, A. (2005). Large-scale gene and ancient genome duplications. The evolution of the genome (pp. 329–368). Elsevier Academic Press.
  248. Van Hellemont, Ruth, Monsieurs, P., Thijs, G., De Moor, B., Van de Peer, Y., & Marchal, K. (2005). A novel approach to identifying regulatory motifs in distantly related genomes. GENOME BIOLOGY, 6(13).
    Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size.
  249. Vandepoele, K., Vlieghe, K., Florquin, K., Hennig, L., Beemster, G., Gruissem, W., Van de Peer, Y., et al. (2005). Genome-wide identification of potential plant E2F target genes. PLANT PHYSIOLOGY, 139(1), 316–328.
    Entry into the S phase of the cell cycle is controlled by E2F transcription factors that induce the transcription of genes required for cell cycle progression and DNA replication. Although the E2F pathway is highly conserved in higher eukaryotes, only a few E2F target genes have been experimentally validated in plants. We have combined microarray analysis and bioinformatics tools to identify plant E2F-responsive genes. Promoter regions of genes that were induced at the transcriptional level in Arabidopsis ( Arabidopsis thaliana) seedlings ectopically expressing genes for the E2Fa and DPa transcription factors were searched for the presence of E2F- binding sites, resulting in the identification of 181 putative E2F target genes. In most cases, the E2F- binding element was located close to the transcription start site, but occasionally could also be localized in the 5'untranslated region. Comparison of our results with available microarray data sets from synchronized cell suspensions revealed that the E2F target genes were expressed almost exclusively during G1 and S phases and activated upon reentry of quiescent cells into the cell cycle. To test the robustness of the data for the Arabidopsis E2F target genes, we also searched for the presence of E2F-cis-acting elements in the promoters of the putative orthologous rice ( Oryza sativa) genes. Using this approach, we identified 70 potential conserved plant E2F target genes. These genes encode proteins involved in cell cycle regulation, DNA replication, and chromatin dynamics. In addition, we identified several genes for potentially novel S phase regulatory proteins.
  250. Meyer, Axel, & Van de Peer, Y. (2005). From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). BIOESSAYS, 27(9), 937–945.
  251. Paterson, A. H., Bowers, J. E., Van de Peer, Y., & Vandepoele, K. (2005). Ancient duplication of cereal genomes. NEW PHYTOLOGIST.
  252. Gevers, D., Cohan, F. M., Lawrence, J. G., Spratt, B. G., Coenye, T., Feil, E. J., Stackebrandt, E., et al. (2005). Re-evaluating prokaryotic species. NATURE REVIEWS MICROBIOLOGY, 3(9), 733–739.
    There is no widely accepted concept of species for prokaryotes, and assignment of isolates to species is based on measures of phenotypic or genome similarity. The current methods for defining prokaryotic species are inadequate and incapable of keeping pace with the levels of diversity that are being uncovered in nature. Prokaryotic taxonomy is being influenced by advances in microbial population genetics, ecology and genomics, and by the ease with which sequence data can be obtained. Here, we review the classical approaches to prokaryotic species definition and discuss the current and future impact of multilocus nucleotide-sequence-based approaches to prokaryotic systematics. We also consider the potential, and difficulties, of assigning species status to biologically or ecologically meaningful sequence clusters.
  253. Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., & Van de Peer, Y. (2005). SpliceMachine: predicting splice sites from high-dimensional local context representations. BIOINFORMATICS, 21(8), 1332–1338.
    Motivation: In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions. Results: This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation and (iii) can be trained by the user on its own data.
  254. Beysen, D., Raes, J., Leroy, B., Lucassen, A., Yates, J., Clayton-Smith, J., Ilyina, H., et al. (2005). Deletions involving long-range conserved nongenic sequences upstream and downstream of FOXL2 as a novel disease-causing mechanism in Blepharophimosis syndrome. AMERICAN JOURNAL OF HUMAN GENETICS, 77(2), 205–218.
    The expression of a gene requires not only a normal coding sequence but also intact regulatory regions, which can be located at large distances from the target genes, as demonstrated for an increasing number of developmental genes. In previous mutation studies of the role of FOXL2 in blepharophimosis syndrome (BPES), we identified intragenic mutations in 70% of our patients. Three translocation breakpoints upstream of FOXL2 in patients with BPES suggested a position effect. Here, we identified novel microdeletions outside of FOXL2 in cases of sporadic and familial BPES. Specifically, four rearrangements, with an overlap of 126 kb, are located 230 kb upstream of FOXL2, telomeric to the reported translocation breakpoints. Moreover, the shortest region of deletion overlap (SRO) contains several conserved nongenic sequences (CNGs) harboring putative transcription-factor binding sites and representing potential long-range cis-regulatory elements. Interestingly, the human region orthologous to the 12-kb sequence deleted in the polled intersex syndrome in goat, which is an animal model for BPES, is contained in this SRO, providing evidence of human-goat conservation of FOXL2 expression and of the mutational mechanism. Surprisingly, in a fifth family with BPES, one rearrangement was found downstream of FOXL2. In addition, we report nine novel rearrangements encompassing FOXL2 that range from partial gene deletions to submicroscopic deletions. Overall, genomic rearrangements encompassing or outside of FOXL2 account for 16% of all molecular defects found in our families with BPES. In summary, this is the first report of extragenic deletions in BPES, providing further evidence of potential long-range cis-regulatory elements regulating FOXL2 expression. It contributes to the enlarging group of developmental diseases caused by defective distant regulation of gene expression. Finally, we demonstrate that CNGs are candidate regions for genomic rearrangements in developmental genes.
  255. De Bodt, Stefanie, Maere, S., & Van de Peer, Y. (2005). Genome duplication and the origin of angiosperms. TRENDS IN ECOLOGY & EVOLUTION.
    Despite intensive research, little is known about the origin of the angiosperms and their rise to ecological dominance during the Early Cretaceous. Based on whole-genome analyses of Arabidopsis thaliana, there is compelling evidence that angiosperms underwent two whole-genome duplication events early during their evolutionary history. Recent studies have shown that these events were crucial for the creation of many important developmental and regulatory genes found in extant angiosperm genomes. Here, we argue that these ancient polyploidy events might have also had an important role in the origin and diversification of the angiosperms.
  256. Vandepoele, K., & Van de Peer, Y. (2005). Exploring the plant transcriptome through phylogenetic profiling. PLANT PHYSIOLOGY, 137(1), 31–42.
    Publicly available protein sequences represent only a small fraction of the full catalog of genes encoded by the genomes of different plants, such as green algae, mosses, gymnosperms, and angiosperms. By contrast, an enormous amount of expressed sequence tags (ESTs) exists for a wide variety of plant species, representing a substantial part of all transcribed plant genes. Integrating protein and EST sequences in comparative and evolutionary analyses is not straightforward because of the heterogeneous nature of both types of sequence data. By combining information from publicly available EST and protein sequences for 32 different plant species, we identified more than 250,000 plant proteins organized in more than 12,000 gene families. Approximately 60% of the proteins are absent from current sequence databases but provide important new information about plant gene families. Analysis of the distribution of gene families over different plant species through phylogenetic profiling reveals interesting insights into plant gene evolution, and identifies species- and lineage-specific gene families, orphan genes, and conserved core genes across the green plant lineage. We counted a similar number of approximately 9,500 gene families in monocotyledonous and eudicotyledonous plants and found strong evidence for the existence of at least 33,700 genes in rice (Oryza sativa). Interestingly, the larger number of genes in rice compared to Arabidopsis (Arabidopsis thaliana) can partially be explained by a larger amount of species-specific single-copy genes and species-specific gene families. In addition, a majority of large gene families, typically containing more than 50 genes, are bigger in rice than Arabidopsis, whereas the opposite seems true for small gene families.
  257. Maere, S., De Bodt, S., Raes, J., Casneuf, T., Van Montagu, M., Kuiper, M., & Van de Peer, Y. (2005). Modeling gene and genome duplications in eukaryotes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 102(15), 5454–5459.
    Recent analysis of complete eukaryotic genome sequences has revealed that gene duplication has been rampant. Moreover, next to a continuous mode of gene duplication, in many eukaryotic organisms the complete genome has been duplicated in their evolutionary past. Such large-scale gene duplication events have been associated with important evolutionary transitions or major leaps in development and adaptive radiations of species. Here, we present an evolutionary model that simulates the duplication dynamics of genes, considering genome-wide duplication events and a continuous mode of gene duplication. Modeling the evolution of the different functional categories of genes assesses the importance of different duplication events for gene families involved in specific functions or processes. By applying our model to the Arabidopsis genome, for which there is compelling evidence for three whole-genome duplications, we show that gene loss is strikingly different for large-scale and small-scale duplication events and highly biased toward certain functional classes. We provide evidence that some categories of genes were almost exclusively expanded through large-scale gene duplication events. In particular, we show that the three whole-genome duplications in Arabidopsis have been directly responsible for >90% of the increase in transcription factors, signal transducers, and developmental genes in the last 350 million years. Our evolutionary model is widely applicable and can be used to evaluate different assumptions regarding small- or large-scale gene duplication events in eukaryotic genomes.
  258. Sterck, L., Rombauts, S., Jansson, S., Sterky, F., Rouzé, P., & Van de Peer, Y. (2005). EST data suggest that poplar is an ancient polyploid. NEW PHYTOLOGIST, 167(1), 165–170.
    We analysed the publicly available expressed sequence tag (EST) collections for the genus Populus to examine whether evidence can be found for large-scale gene-duplication events in the evolutionary past of this genus. The ESTs were clustered into unigenes for each poplar species examined. Gene families were constructed for all proteins deduced from these unigenes, and K-S dating was performed on all paralogs within a gene family. The fraction of paralogs was then plotted against the K-S values, which resulted in a distribution reflecting the age of duplicated genes in poplar. Sufficient EST data were available for seven different poplar species spanning four of the six sections of the genus Populus. For all these species, there was evidence that a large-scale gene-duplication event had occurred. From our analysis it is clear that all poplar species have shared the same large-scale gene-duplication event, suggesting that this event must have occurred in the ancestor of poplar, or at least very early in the evolution of the Populus genus.
  259. Raes, J., & Van de Peer, Y. (2005). Functional divergence of proteins through frameshift mutations. TRENDS IN GENETICS, 21(8), 428–431.
    Frameshift mutations are generally considered to be deleterious and of little importance for the evolution of novel gene functions. However, by screening an exhaustive set of vertebrate gene families, we found that, when a second transcript encoding the original gene product compensates for this mutation, frameshift mutations can be retained for millions of years and enable new gene functions to be acquired.
  260. Coenye, T., Gevers, D., Van de Peer, Y., Vandamme, P., & Swings, J. (2005). Towards a prokaryotic genomic taxonomy. FEMS MICROBIOLOGY REVIEWS, 29(2), 147–167.
  261. Simillion, C., Vandepoele, K., & Van de Peer, Y. (2004). Recent developments in computational approaches for uncovering genomic homology. BIOESSAYS, 26(11), 1225–1235.
  262. Bonnet, E., Wuyts, J., Rouzé, P., & Van de Peer, Y. (2004). Detection of 91 potential in plant conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 101(31), 11511–11516.
    MicroRNAs (miRNAs) are an extensive class of tiny RNA molecules that regulate the expression of target genes by means of complementary base pair interactions. Although the first miRNAs were discovered in Caenorhabditis elegans, >300 miRNAs were recently documented in animals and plants, both by cloning methods and computational predictions. We present a genome-wide computational approach to detect miRNA genes in the Arabidopsis thaliana genome. Our method is based on the conservation of short sequences between the genomes of Arabidopsis and rice (Oryza sativa) and on properties of the secondary structure of the miRNA precursor. The method was fine-tuned to take into account plant-specific properties, such as the variable length of the miRNA precursor sequences. In total, 91 potential miRNA genes were identified, of which 58 had at least one nearly perfect match with an Arabidopsis mRNA, constituting the potential targets of those miRNAs. In addition to already known transcription factors involved in plant development, the targets also comprised genes involved in several other cellular processes, such as sulfur assimilation and ubiquitin-dependent protein degradation. These findings considerably broaden the scope of miRNA functions in plants.
  263. Van de Peer, Y. (2004). Tetraodon genome confirms Takifugu findings: most fish are ancient polyploids. GENOME BIOLOGY, 5(12).
    An evolutionary hypothesis suggested by studies of the genome of the tiger pufferfish Takifugu rubripes has now been confirmed by comparison with the genome of a close relative, the spotted green pufferfish Tetraodon nigroviridis. Ray-finned fish underwent a whole-genome duplication some 350 million years ago that might explain their evolutionary success.
  264. Simillion, C., Vandepoele, K., Saeys, Y., & Van de Peer, Y. (2004). Building genomic profiles for uncovering segmental homology in the twilight zone. Belgian Bioinformatics Conference, 4th, Abstracts. Presented at the 4th Belgian Bioinformatics Conference (BBC 2004).
  265. Van de Peer, Y. (2004). Computational approaches to unveiling ancient genome duplications. NATURE REVIEWS GENETICS, 5(10), 752–763.
    Recent analyses of complete genome sequences have revealed that many genomes have been duplicated in their evolutionary past. Such events have been associated with important biological transitions, major leaps in evolution and adaptive radiations of species. Here, we consider recently developed computational methods to detect such ancient large-scale gene duplication events. Several new approaches have been used to show that large-scale gene duplications are more common than previously thought.
  266. Van de Peer, Y. (2004). “Horizontal” plant biology on the rise. GENOME BIOLOGY.
    A report on the Plant Genomics European Meeting (Plant-GEMS2004), Lyon, France, 22-25 September 2004
  267. Saeys, Yvan, Degroeve, S., & Van de Peer, Y. (2004). Digging into acceptor splice site prediction : an iterative feature selection approach. (J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi, Eds.)LECTURE NOTES IN ARTIFICIAL INTELLIGENCE, 3202, 386–397. Presented at the 8th European conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004).
    Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets.
  268. Bonnet, E., Wuyts, J., Rouzé, P., & Van de Peer, Y. (2004). Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. BIOINFORMATICS, 20(17), 2911–2917.
    Motivation: Most non-coding RNAs are characterized by a specific secondary and tertiary structure that determines their function. Here, we investigate the folding energy of the secondary structure of non-coding RNA sequences, such as microRNA precursors, transfer RNAs and ribosomal RNAs in several eukaryotic taxa. Statistical biases are assessed by a randomization test, in which the predicted minimum free energy of folding is compared with values obtained for structures inferred from randomly shuffling the original sequences. Results: In contrast with transfer RNAs and ribosomal RNAs, the majority of the microRNA sequences clearly exhibit a folding free energy that is considerably lower than that for shuffled sequences, indicating a high tendency in the sequence towards a stable secondary structure. A possible usage of this statistical test in the framework of the detection of genuine miRNA sequences is discussed.
  269. Gevers, D., Vandepoele, K., Simillion, C., & Van de Peer, Y. (2004). Gene duplication and biased functional retention of paralogs in bacterial genomes. TRENDS IN MICROBIOLOGY, 12(4), 148–154.
    Gene duplication is considered an important prerequisite for gene innovation that can facilitate adaptation to changing environments. The analysis of 106 bacterial genome sequences has revealed the existence of a significant number of paralogs. Analysis of the functional classification of these paralogs reveals a preferential enrichment in functional classes that are involved in transcription, metabolism and defense mechanisms. From the organization of paralogs in the genome we can conclude that duplicated genes in bacteria appear to have been mainly created by small-scale duplication events, such as tandem and operon duplications.
  270. Wuyts, Jan, Perrière, G., & Van de Peer, Y. (2004). The European ribosomal RNA database. NUCLEIC ACIDS RESEARCH, 32, D101–D103.
    The European ribosomal RNA database aims to compile all complete or nearly complete ribosomal RNA sequences from both the small (SSU) and large (LSU) ribosomal subunits. All sequences are available in aligned format. Sequence alignment is based on the secondary structure of the molecules, as determined by comparative sequence analysis. Additional information about the sequences, such as taxonomic classification of the organism from which they have been obtained, and literature references are also provided. In order to identify the closest relatives to newly determined sequences, BLAST searches can be performed, after which the best matching sequences are aligned and a phylogenetic tree is inferred. As of 2003, the European ribosomal RNA database is maintained at Ghent University (Belgium). The database can be consulted at http://www.psb.ugent.be/rRNA/.
  271. Simillion, C., Vandepoele, K., Saeys, Y., & Van de Peer, Y. (2004). Building genomic profiles for uncovering segmental homology in the twilight zone. GENOME RESEARCH, 14(6), 1095–1106.
    The identification of homologous regions within and between genomes is all essential prerequisite for Studying genome structure and evolution. Different methods already exist that allow detecting homologous regions ill all automated manner. These methods are based either oil finding sequence similarities at the DNA level or on identifying chromosomal regions showing conservation of gene order and content. Especially the latter approach has proven useful for detecting homology between highly divergent chromosomal regions. However, until now, such map-based approaches required that candidate homologous regions show significant collinearity with other segments to be considered as being homologous. Here, we present a novel method that creates profiles combining the gene order and content information of multiple mutually homologous genomic segments. These profiles can be used to scan one or more genomes to detect segments that show significant collinearity with the entire profile but not necessarily with individual segments. When applying this new method to the combined genomes of Arabidopsis and rice, we find additional evidence for ancient duplication events in the rice genome.
  272. Vandepoele, K., Simillion, C., & Van de Peer, Y. (2004). The quest for genomic homology. CURRENT GENOMICS, 5(4), 299–308.
  273. Alvares, L. E., Wuyts, J., Van de Peer, Y., Silva, E. P., Coutinho, L. L., Brison, O., & Ruiz, I. R. (2004). The 18S rRNA from Odontophrynus americanus 2n and 4n (Amphibia, Anura) reveals unusual extra sequences in the variable region V2. GENOME, 47(3), 421–428.
    The nucleotide sequence of the rDNA 18S region isolated from diploid and tetraploid species of the amphibian Odontophrynus americanus was determined and used to predict the secondary structure of the corresponding 18S rRNA molecules. Comparison of the primary and secondary structures for the 2n and 4n species confirmed that these species are very closely related. Only three nucleotide substitutions were observed, accounting for 99% identity between the 18S sequences, whereas several changes were detected by comparison with the Xenopus laevis 18S sequence (96% identity). Most changes were located in highly variable regions of the molecule. A noticeable feature of the Odontophrynus 18S rRNA was the presence of unusual extra sequences in the V2 region, between helices 9 and 11. These extra sequences do not fit the model for secondary structure predicted for vertebrate 18S rRNA.
  274. Vandepoele, K., De Vos, W., Taylor, J. S., Meyer, A., & Van de Peer, Y. (2004). Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 101(6), 1638–1643.
    It has been suggested that fish have more genes than humans. Whether most of these additional genes originated through a complete (fish-specific) genome duplication or through many lineage-specific tandem gene or smaller block duplications and family expansions continues to be debated. We analyzed the complete genome of the pufferfish Takifugu rubripes (Fugu) and compared it with the paranome of humans. We show that most paralogous genes of Fugu are the result of three complete genome duplications. Both relative and absolute dating of the complete predicted set of protein-coding genes suggest that initial genome duplications, estimated to have occurred at least 600 million years ago, shaped the genome of all vertebrates, In addition, analysis of >150 block duplications in the Fugu genome clearly supports a fish-specific genome duplication (approximate to320 million years ago) that coincided with the vast radiation of most modern ray-finned fishes. Unlike the human genome, Fugu contains very few recently duplicated genes; hence, many human genes are much younger than fish genes. This lack of recent gene duplication, or, alternatively, the accelerated rate of gene loss, is possibly one reason for the drastic reduction of the genome size of Fugu observed during the past 100 million years or so, subsequent to the additional genome duplication that ray-finned fishes but not land vertebrates experienced.
  275. Saeys, Yvan, Degroeve, S., Aeyels, D., Rouzé, P., & Van de Peer, Y. (2004). Feature selection for splice site prediction: A new method using EDA-based feature ranking. BMC BIOINFORMATICS, 5, 64.–64.11.
    Background: The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. Results: In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. Conclusion: We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features ( as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.
  276. Saeys, Yvan, Degroeve, S., Aeyels, D., Rouzé, P., & Van de Peer, Y. (2004). Selecting relevant features for gene structure prediction. In A. Nowé, T. Lenaerts, & K. Steenhaut (Eds.), Proceedings of Benelearn 2004 (pp. 103–109). VUB Press.
  277. Degroeve, S., Saeys, Y., De Baets, B., Van de Peer, Y., & Rouzé, P. (2004). Splice site prediction in eukaryote genome sequences: the algorithmic issues. In J. Seckbach & E. Rubin (Eds.), The new avenues in bioinformatics (pp. 99–111). Dordrecht, The Netherlands: Kluwer Academic.
  278. Trindade, G. S., da Fonseca, F. G., Marques, J. T., Diniz, S., Leite, J. A., De Bodt, S., Van de Peer, Y., et al. (2004). Belo Horizonte virus: a vaccinia-like virus lacking the A-type inclusion body gene isolated from infected mice. JOURNAL OF GENERAL VIROLOGY, 85(7), 2015–2021.
    Here is described the isolation of a naturally occurring A-type inclusion body (ATI)-negative vaccinia-like virus, Belo Horizonte virus (VBH), obtained from a mousepox-like outbreak in Brazil. The isolated virus was identified and characterized as an orthopoxvirus by conventional methods. Molecular characterization of the virus was done by DNA cross-hybridization using Vaccinia virus (VACV) DNA. In addition, conserved orthopoxvirus genes such as vaccinia growth factor, thymidine kinase and haemagglutinin were amplified by PCR and sequenced. All sequences presented high similarity to VACV genes. Based on the sequences, phenograms were constructed for comparison with other poxviruses; VBH clustered consistently with VACV strains. Attempts to amplify the ATI gene (ati) by PCR, currently used to identify orthopoxviruses, were unsuccessful. Results presented here suggest that most of the ati gene is deleted in the VBH genome.
  279. Khadaroo, B., Robbens, S., Ferraz, C., Derelle, E., Eychenié, S., Cooke, R., Peaucellier, G., et al. (2004). The first green lineage cdc25 dual-specificity phosphatase. CELL CYCLE, 3(4), 513–518.
    The Cdc25 protein phosphatase is a key enzyme involved in the regulation of the G(2)/M transition in metazoans and yeast. However, no Cdc25 ortholog has so far been identified in plants, although functional studies have shown that an activating dephosphorylation of the CDK-cyclin complex regulates the G(2)/M transition. In this paper, the first green lineage Cdc25 ortholog is described in the unicellular alga Ostreococcus tauri. It encodes a protein which is able to rescue the yeast S. pombe cdc25-22 conditional mutant. Furthermore, microinjection of GST-tagged O. tauri Cdc25 specifically activates prophase-arrested starfish oocytes. In vitro histone H1 kinase assays and anti-phosphotyrosine Western Blotting confirmed the in vivo activating dephosphorylation of starfish CDK1-cyclinB by recombinant O. tauri Cdc25. We propose that there has been coevolution of the regulatory proteins involved in the control of M-phase entry in the metazoan, yeast and green lineages.
  280. Florquin, K., Degroeve, S., Saeys, Y., & Van de Peer, Y. (2004). The role of non-linear DNA structures in transcription. Belgian Bioinformatics Conference, 4th, Abstracts. Presented at the 4th Belgian Bioinformatics Conference (BBC 2004).
  281. Vlieghe, Kobe, Vuylsteke, M., Florquin, K., Rombauts, S., Maes, S., Ormenese, S., Van Hummelen, P., et al. (2003). Microarray analysis of E2Fa-DPa-overexpressing plants uncovers a cross-talking genetic network between DNA replication and nitrogen assimilation. JOURNAL OF CELL SCIENCE, 116(20), 4249–4259.
    Previously we have shown that overexpression of the heterodimeric E2Fa-DPa transcription factor in Arabidopsis thaliana results in ectopic cell division, increased endoreduplication, and an early arrest in development. To gain a better insight into the phenotypic behavior of E2Fa-DPa transgenic plants and to identify E2Fa-DPa target genes, a transcriptomic microarray analysis was performed. Out of 4,390 unique genes, a total of 188 had a twofold or more up- (84) or down-regulated (104) expression level in E2Fa-DPa transgenic plants compared to wild-type lines. Detailed promoter analysis allowed the identification of novel E2Fa-DPa target genes, mainly involved in DNA replication. Secondarily induced genes encoded proteins involved in cell wall biosynthesis, transcription and signal transduction or had an unknown function. A large number of metabolic genes were modified as well, among which, surprisingly, many genes were involved in nitrate assimilation. Our data suggest that the growth arrest observed upon E2Fa-DPa overexpression results at least partly from a nitrogen drain to the nucleotide synthesis pathway, causing decreased synthesis of other nitrogen compounds, such as amino acids and storage proteins.
  282. Rombauts, S., Florquin, K., Lescot, M., Marchal, K., Rouzé, P., & Van de Peer, Y. (2003). Computational approaches to identify promoters and cis-regulatory elements in plant genomes. PLANT PHYSIOLOGY, 132(3), 1162–1176.
    The identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/ CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
  283. De Bodt, Stefanie, Raes, J., Florquin, K., Rombauts, S., Rouzé, P., Theißen, G., & Van de Peer, Y. (2003). Genomewide structural annotation and evolutionary analysis of the type I MADS-box genes in plants. JOURNAL OF MOLECULAR EVOLUTION, 56(5), 573–586.
    The type I MADS-box genes constitute a largely unexplored subfamily of the extensively studied MADS-box gene family, well known for its role in flower development. Genes of the type I MADS-box subfamily possess the characteristic MADS box but are distinguished from type II MADS-box genes by the absence of the keratin-like box. In this in silico study, we have structurally annotated all 47 members of the type I MADS-box gene family in Arabidopsis thaliana and exerted a thorough analysis of the C-terminal regions of the translated proteins. On the basis of conserved motifs in the C-terminal region, we could classify the gene family into three main groups, two of which could be further subdivided. Phylogenetic trees were inferred to study the evolutionary relationships within this large MADS-box gene subfamily. These suggest for plant type I genes a dynamic of evolution that is significantly different from the mode of both animal type I (SRF) and plant type II (MIKC-type) gene phylogeny. The presence of conserved motifs in the majority of these genes, the identification of Oryza sativa MADS-box type I homologues, and the detection of expressed sequence tags for Arabidopsis thaliana and other plant type I genes suggest that these genes are indeed of functional importance to plants. It is therefore even more intriguing that, from an experimental point of view, almost nothing is known about the function of these MADS-box type I genes.
  284. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2003). Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. BIOINFORMATICS, 19(suppl. 2), ii179–ii188.
    Motivation: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. Results: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large.
  285. Meyer, Axel, & Van de Peer, Y. (2003). “Natural selection merely modified while redundancy created”: Susumu Ohno’s idear of the evolutionary importance of gene and genome duplications. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), VII–IX.
  286. Raes, J., Vandepoele, K., Simillion, C., Saeys, Y., & Van de Peer, Y. (2003). Investigating ancient duplication events in the Arabidopsis genome. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), 117–129.
  287. Van de Peer, Y., Taylor, J. S., & Meyer, A. (2003). Are all fish ancient tetraploids? In Axel Meyer & Y. Van de Peer (Eds.), Genome evolution : gene and genome duplications and the origin of novel gene functions (pp. 65–73). Dordrecht, The Netherlands: Kluwer Academic.
  288. MEYER, A., & Van de Peer, Y. (Eds.). (2003). Genome Evolution: Gene and Genome Duplications and the Origin of Novel Gene Functions. Kluwer Academic.
  289. Vandenbussche, Michiel, Theißen, G., Van de Peer, Y., & Gerats, T. (2003). Structural diversification and neo-functionalization during floral MADS-box gene evolution by C-terminal frameshift mutations. NUCLEIC ACIDS RESEARCH, 31(15), 4401–4409.
    Frameshift mutations generally result in loss-of-function changes since they drastically alter the protein sequence downstream of the frameshift site, besides creating premature stop codons. Here we present data suggesting that frameshift mutations in the C-terminal domain of specific ancestral MADS-box genes may have contributed to the structural and functional divergence of the MADS-box gene family. We have identified putative frameshift mutations in the conserved C-terminal motifs of the B-function DEF/AP3 subfamily, the A-function SQUA/AP1 subfamily and the E-function AGL2 subfamily, which are all involved in the specification of organ identity during flower development. The newly evolved C-terminal motifs are highly conserved, suggesting a de novo generation of functionality. Interestingly, since the new C-terminal motifs in the A- and B-function subfamilies are only found in higher eudicotyledonous flowering plants, the emergence of these two C-terminal changes coincides with the origin of a highly standardized floral structure. We speculate that the frameshift mutations described here are examples of co-evolution of the different components of a single transcription factor complex. 3' terminal frameshift mutations might provide an important but so far unrecognized mechanism to generate novel functional C-terminal motifs instrumental to the functional diversification of transcription factor families.
  290. Paraskevis, D., Lemey, P., Salemi, M., Suchard, M., Van de Peer, Y., & Vandamme, A.-M. (2003). Analysis of the evolutionary relationships of HIV-1 and SIVcpz sequences using Bayesian inference: implications for the origin of HIV-1. MOLECULAR BIOLOGY AND EVOLUTION, 20(12), 1986–1996.
    The most plausible origin of HIV-1 group M is an SIV lineage currently represented by SIVcpz isolated from the chimpanzee subspecies Pan troglodytes troglodytes. The origin of HIV-1 group 0 is less clear. Putative recombination between any of the HIVA-1 and SlVcpz sequences was tested using bootscanning and Bayesian-scanning plots, as well as a new method using a Bayesian multiple change-point (BMCP) model to infer parental sequences and crossing-over points. We found that in the case of highly divergent sequences, such as HIV-1/SIVcpz, Bayesian scanning and BMCP methods are more appropriate than bootscanning analysis to investigate spatial phylogenetic variation, including estimating the boundaries of the regions with discordant evolutionary relationships and the levels of support of the phylogenetic clusters under study. According to the Bayesian scanning plots and BMCP method, there was strong evidence for discordant phylogenetic clustering throughout the genome: (1) HIV-1 group 0 clustered with SIVcpzANT/ TAN in middle pol, and partial vif/env; (2) SIVcpzGab1 clustered with SIVcpzANT/TAN in 3'pol/vif, and middle env; (3) HIV-1 group 0 grouped with SIVcpzCamUS and SIVcpzGab1 in pl7/p24; (4) HIV-1 group M was more closely related to SIVcpzCamUS in 3'gag/pol and in middle pol, whereas in partial gp120 group M clustered with group O. Conditionally independent phylogenetic analysis inferred by maximum likelihood (ML) and Bayesian methods further confirmed these findings. The discordant phylogenetic relationships between the HIV-1/SlVcpz sequences may have been caused by ancient recombination events, but they are also due, at least in part, to altered rates of evolution between parental SIVcpz lineages.
  291. Raes, J., Vandepoele, K., Simillion, C., Saeys, Y., & Van de Peer, Y. (2003). Investigating ancient duplication events in the Arabidopsis genome. In Axel Meyer & Y. Van de Peer (Eds.), Genome evolution : gene and genome duplications and the origin of novel gene functions (pp. 117–129). Dordrecht, The Netherlands: Kluwer Academic.
  292. Rombauts, S., Van de Peer, Y., & Rouzé, P. (2003). AFLPinSilico, simulating AFLP fingerprints. BIOINFORMATICS, 19(6), 776–777.
    A drawback of the Amplified Fragment Length Polymorphism (AFLP) fingerprinting method is the difficulty to correlate the different fragments with their DNA sequence. The AFLPinSilico application presented here simulates AFLP experiments run on either cDNA or genomic sequences, producing virtual fingerprints that allow high throughput identification of AFLP fragments. The program also enables biologists to manage experiments through simulations done beforehand, thereby reducing the number of experiments that have to be run. AFLPinSilico is available through the www or as a stand-alone version, through a command line executable (available upon request, for any platform running PERL).
  293. De Bodt, Stefanie, Raes, J., Van de Peer, Y., & Theißen, G. (2003). And then there were many: MADS goes genomic. TRENDS IN PLANT SCIENCE, 8(10), 475–483.
    During the past decade, MADS-box genes have become known as key regulators in both reproductive and vegetative plant development. Traditional genetics and functional genomics tools are now available to elucidate the expression and function of this complex gene family on a much larger scale. Moreover, comparative analysis of the MADS-box genes in diverse flowering and non-flowering plants, boosted by bioinformatics, contributes to our understanding of how this important gene family has expanded during the evolution of land plants. Therefore, the recent advances in comparative and functional genomics; should enable researchers to identify the full range of MADS-box gene functions, which should help us significantly in developing a better understanding of plant development and evolution.
  294. Van de Peer, Y. (2003). Analysis of nucleotide sequences using TREECON. In M. Salemi & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to DNA and protein phylogeny (pp. 236–255). Cambridge, UK: Cambridge University Press.
  295. Raes, J., & Van de Peer, Y. (2003). Gene duplication, the evolution of novel gene functions, and detecting functional divergence of duplicates in silico. APPLIED BIOINFORMATICS, 2(2), 91–101.
  296. Taylor, J. S., Braasch, I., Frickey, T., Meyer, A., & Van de Peer, Y. (2003). Genome duplication, a trait shared by 22,000 species of ray-finned fish. GENOME RESEARCH, 13(3), 382–390.
    Through phylogeny reconstruction we identified 49 genes with a single copy in man, mouse, and chicken, one or two copies in the tetraploid frog Xenopus laevis, and two copies in zebrafish (Danlo rerio). For 22 of these genes, both zebrafish duplicates had orthologs in the pufferfish (Takifugu rubripes). For another 20 of these genes, we found only one pufferfish ortholog but in each case it was more closely related to one of the zebrafish duplicates than to the other. Forty-three pairs of duplicated genes map to 24 of the 25 zebrafish linkage groups but they are not randomly distributed; we identified 10 duplicated regions of the zebrafish genome that each contain between two and five sets of paralogous genes. These phylogeny and synteny data suggest that the common ancestor of zebrafish and pufferfish, a fish that gave rise to similar to22,000 species, experienced a large-scale gene or complete genome duplication event and that the pufferfish has lost many duplicates that the zebrafish has retained.
  297. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2003). Feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Belgian Bioinformatics Conference, 3rd, Abstracts. Presented at the 3rd Belgian Bioinformatics Conference (BBC 2003).
  298. Van de Peer, Y. (2003). Phylogeny inference based on distance methods: theory. In M. Salemi & A.-M. Vandamme (Eds.), The phylogenetic handbook : a practical approach to DNA and protein phylogeny (pp. 101–119). Cambridge, UK: Cambridge University Press.
  299. Vandepoele, K., Simillion, C., & Van de Peer, Y. (2003). Evidence that rice and other cereals are ancient aneuploids. PLANT CELL, 15(9), 2192–2202.
    Detailed analyses of the genomes of several model organisms revealed that large-scale gene or even entire-genome duplications have played prominent roles in the evolutionary history of many eukaryotes. Recently, strong evidence has been presented that the genomic structure of the dicotyledonous model plant species Arabidopsis is the result of multiple rounds of entire-genome duplications. Here, we analyze the genome of the monocotyledonous model plant species rice, for which a draft of the genomic sequence was published recently. We show that a substantial fraction of all rice genes (similar to15%) are found in duplicated segments. Dating of these block duplications, their nonuniform distribution over the different rice chromosomes, and comparison with the duplication history of Arabidopsis suggest that rice is not an ancient polyploid, as suggested previously, but an ancient aneuploid that has experienced the duplication of one-or a large part of one-chromosome in its evolutionary past, similar to70 million years ago. This date predates the divergence of most of the cereals, and relative dating by phylogenetic analysis shows that this duplication event is shared by most if not all of them.
  300. Van de Peer, Y., Taylor, J. S., & Meyer, A. (2003). Are all fishes ancient polyploids? JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS, 3(1-4), 65–73.
  301. Raes, J., Rohde, A., Christensen, J. H., Van de Peer, Y., & Boerjan, W. (2003). Genome-wide characterization of the lignification toolbox in Arabidopsis. PLANT PHYSIOLOGY, 133(3), 1051–1071.
    Lignin, one of the most abundant terrestrial biopolymers, is indispensable for plant structure and defense. With the availability of the full genome sequence, large collections of insertion mutants, and functional genomics tools, Arabidopsis constitutes an excellent model system to profoundly unravel the monolignol biosynthetic pathway. In a genome-wide bioinformatics survey of the Arabidopsis genome, 34 candidate genes were annotated that encode genes homologous to the 10 presently known enzymes of the monolignol biosynthesis pathway, nine of which have not been described before. By combining evolutionary analysis of these 10 gene families with in silico promoter analysis and expression data (from a reverse transcription-polymerase chain reaction analysis on an extensive tissue panel, mining of expressed sequence tags from publicly available resources, and assembling expression data from literature), 12 genes could be pinpointed as the most likely candidates for a role in vascular lignification. Furthermore, a possible novel link was detected between the presence of the AC regulatory promoter element and the biosynthesis of G lignin during vascular development. Together, these data describe the full complement of monolignol biosynthesis genes in Arabidopsis, provide a unified nomenclature, and serve as a basis for further functional studies.
  302. Wuyts, Jan, Van de Peer, Y., Winkelmans, T., & De Wachter, R. (2002). The European database on small subunit ribosomal RNA. NUCLEIC ACIDS RESEARCH, 30(1), 183–185.
    The European database on SSU rRNA can be consulted via the World WideWeb at http://rrna.uia.ac.be/ssu/ and compiles all complete or nearly complete small subunit ribosomal RNA sequences. Sequences are provided in aligned format. The alignment takes into account the secondary structure information derived by comparative sequence analysis of thousands of sequences. Additional information such as literature references, taxonomy, secondary structure models and nucleotide variability maps, is also available.
  303. Saeys, Yvan, Aeyels, D., Stanssens, P., Van de Peer, Y., & Zabeau, M. (2002). Retrieving DNA sequence information from mass spectra of nucleic acids: application to the detection and identification of SNPs. Belgian Bioinformatics Conference, 2nd, Abstracts. Presented at the 2nd Belgian Bioinformatics Conference (BBC 2002).
  304. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2002). Feature subset selection for splice site prediction by estimation of distribution algorithms. Computational Biology, European conference, Abstracts. Presented at the European conference on Computational Biology 2002 (ECCB 2002).
  305. Rensing, S. A., Rombauts, S., Van de Peer, Y., & Reski, R. (2002). Moss transcriptome and beyond. TRENDS IN PLANT SCIENCE.
    The ancient land plant Physcomitrella patens is a model system that is becoming increasingly important for plant functional genomics because gene knockouts can be produced with relative ease. Recently, several EST-sequencing projects have been launched as a first step towards a thorough functional characterization of the moss. However, for careful comparison with other plant model systems, the complete genomic sequence is needed as well as the transcriptome.
  306. Vandepoele, K., Saeys, Y., Simillion, C., Raes, J., & Van de Peer, Y. (2002). The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between Arabidopsis and rice. GENOME RESEARCH, 12(11), 1792–1801.
    It is expected that one of the merits of comparative genomics lies in the transfer of structural and functional information from one genome to another. This is based on the observation that, although the number of chromosomal rearrangements that occur in genomes is extensive, different species still exhibit a certain degree of conservation regarding gene content and gene order. It is in this respect that we have developed a new software tool for the Automatic Detection of Homologous Regions (ADHoRe). ADHoRe was primarily developed to find large regions of microcolinearity, taking into account different types of microrearrangements such as tandem duplications, gene loss and translocations, and inversions. Such rearrangements often complicate the detection of colinearity, in particular when comparing more anciently diverged species. Application of ADHoRe to the complete genome of Arabidopsis and a large collection of concatenated rice BACs yields more than 20 regions showing statistically significant microcolinearity between both plant species. These regions comprise from 4 up to 11 conserved homologous gene pairs. We predict the number of homologous regions and the extent of microcolinearity to increase significantly once better annotations of the rice genome become available.
  307. Ben Ali, A., De Baere, R., De Wachter, R., & Van de Peer, Y. (2002). Evolutionary relationships among heterokont algae (the autotrophic stramenopiles) based on combined analyses of small and large subunit ribosomal RNA. PROTIST, 153(2), 123–132.
    In order to study the phylogenetic relationships within the stramenopiles, and particularly among the heterokont algae, we have determined complete or nearly complete large-subunit ribosomal RNA sequences for different species of raphidophytes, phaeophytes, xanthophytes, chrysophytes, synurophytes and pinguiophytes. With the small- and large-subunit ribosomal RNA sequences of representatives for nearly all known groups of heterokont algae, phylogenetic trees were constructed from a concatenated alignment of both ribosomal RNAs, including more than 5,000 positions. By using different tree construction methods, inferred phylogenies showed phaeophytes and xanthophytes as sister taxa, as well as the pelagophytes and dictyochophytes, and the chrysophytes/synurophytes and eustigmatophytes. All these relationships are highly supported by bootstrap analysis. However, apart from these sister group relationships, very few other internodes are well resolved and most groups of heterokont algae seem to have diverged within a relatively short time frame.
  308. Simillion, C., Vandepoele, K., Van Montagu, M., Zabeau, M., & Van de Peer, Y. (2002). The hidden duplication past of Arabidopsis thaliana. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 99(21), 13627–13632.
    Analysis of the genome sequence of Arabidopsis thaliana shows that this genome, like that of many other eukaryotic organisms, has undergone large-scale gene duplications or even duplications of the entire genome. However, the high frequency of gene loss after duplication events reduces colinearity and therefore the chance of finding duplicated regions that, at the extreme, no longer share homologous genes. In this study we show that heavily degenerated block duplications that can no longer be recognized by directly comparing two segments because of differential gene loss, can still be detected through indirect comparison with other segments. When these so-called hidden duplications in Arabidopsis are taken into account, many homologous genomic regions can be found in five to eight copies. This finding strongly implies that Arabidopsis has undergone three, but probably no more, rounds of genome duplications. Therefore, adding such hidden blocks to the duplication landscape of Arabidopsis sheds light on the number of polyploidy events that this model plant genome has undergone in its evolutionary past.
  309. Van de Peer, Y., Frickey, T., Taylor, J. S., & Meyer, A. (2002). Dealing with saturation at the amino acid level: a case study based on anciently duplicated zebrafish genes. GENE, 295(2), 205–211. Presented at the 3rd Anton Dohrn Workshop.
  310. Saeys, Yvan, Degroeve, S., Aeyels, D., Van de Peer, Y., & Rouzé, P. (2002). Selecting Relevant Features for Splice Site Prediction by Estimation of Distribution Algorithms. Proceedings of Benelearn 2002 (pp. 64–71).
  311. Rombauts, S., Lescot, M., Thijs, G., Marchal, K., Moreau, Y., Déhais, P., Van de Peer, Y., et al. (2002). The PlantCARE database and tools for in silico search of plant cis-acting regulatory elements. JOBIM 2002 : journées ouvertes biologie, informatique, mathématique (pp. 183–184). Presented at the Journées Ouvertes Biologie, Informatique, Mathématique 2002 (JOBIM 2002).
  312. Lescot, M., Déhais, P., Thijs, G., Marchal, K., Moreau, Y., Van de Peer, Y., Rouzé, P., et al. (2002). PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. NUCLEIC ACIDS RESEARCH, 30(1), 325–327.
    PlantCARE is a database of plant cis-acting regulatory elements, enhancers and repressors. Regulatory elements are represented by positional matrices, consensus sequences and individual sites on particular promoter sequences. Links to the EMBL, TRANSFAC and MEDLINE databases are provided when available. Data about the transcription sites are extracted mainly from the literature, supplemented with an increasing number of in silico predicted data. Apart from a general description for specific transcription factor sites, levels of confidence for the experimental evidence, functional information and the position on the promoter are given as well. New features have been implemented to search for plant cis-acting regulatory elements in a query sequence. Furthermore, links are now provided to a new clustering and motif search method to investigate clusters of co-expressed genes. New regulatory elements can be sent automatically and will be added to the database after curation.
  313. Vandepoele, K., Simillion, C., & Van de Peer, Y. (2002). Detecting the undetectable: uncovering duplicated segments in Arabidopsis by comparison with rice. TRENDS IN GENETICS, 18(12), 606–608.
  314. Vandepoele, K., Saeys, Y., Simillion, C., RAES, J., & Van de Peer, Y. (2002). Detecting microcolinearity between Arabidopsis and Rice. Proceedings of the 6th Gatersleben Research Conference (2002), “Plant Genetic Resources in the Genomic Era: Genetic Diversity, Genome Evolution and New Applications”.
  315. Oborník, M., Van de Peer, Y., Hypša, V., Frickey, T., Šlapeta, J. R., Meyer, A., & Lukeš, J. (2002). Phylogenetic analyses suggest lateral gene transfer from the mitochondrion to the apicoplast. GENE, 285(1-2), 109–118.
  316. Van de Peer, Y., Taylor, J. S., Joseph, J., & Meyer, A. (2002). Wanda : a database of duplicated fish genes. NUCLEIC ACIDS RESEARCH, 30(1), 109–112.
    Comparative genomics has shown that ray-finned fish (Actinopterygii) contain more copies of many genes than other vertebrates. A large number of these additional genes appear to have been produced during a genome duplication event that occurred early during the evolution of Actinopterygii (i.e. before the teleost radiation). In addition to this ancient genome duplication event, many lineages within Actinopterygii have experienced more recent genome duplications. Here we introduce a curated database named Wanda that lists groups of orthologous genes with one copy from man, mouse and chicken, one or two from tetraploid Xenopus and two or more ancient copies (i.e. paralogs) from ray-finned fish. The database also contains the sequence alignments and phylogenetic trees that were necessary for determining the correct orthologous and paralogous relationships among genes. Where available, map positions and functional data are also reported. The Wanda database should be of particular use to evolutionary and developmental biologists who are interested in the evolutionary and functional divergence of genes after duplication. Wanda is available at http://www.evolutionsbiologie.uni-konstanz.de/Wanda/.
  317. Degroeve, S., De Baets, B., Van de Peer, Y., & Rouzé, P. (2002). Feature subset selection for splice site prediction. BIOINFORMATICS, 18(suppl. 2), S75–S83. Presented at the European Conference on Computational Biology 2002 (ECCB 2002).
    Motivation: The large amount of available annotated Arabidopsis thaliana sequences allows the induction of splice site prediction models with supervised learning algorithms (see Haussler (1998) for a review and references). These algorithms need information sources or features from which the models can be computed. For splice site prediction, the features we consider in this study are the presence or absence of certain nucleotides in close proximity to the splice site. Since it is not known how many and which nucleotides are relevant for splice site prediction, the set of features is chosen large enough such that the probability that all relevant information sources are in the set is very high. Using only those features that are relevant for constructing a splice site prediction system might improve the system and might also provide us with useful biological knowledge. Using fewer features will of course also improve the prediction speed of the system. Results: A wrapper-based feature subset selection algorithm using a support vector machine or a naive Bayes prediction method was evaluated against the traditional method for selecting features relevant for splice site prediction. Our results show that this wrapper approach selects features that improve the performance against the use of all features and against the use of the features selected by the traditional method.
  318. Van de Peer, Y. (2001). Phylogeny branches out. NATURE.
  319. Taylor, J. S., Van de Peer, Y., & Meyer, A. (2001). Revisiting recent challenges to the ancient fish-specific genome duplication hypothesis. CURRENT BIOLOGY, 11(24), R1005–R1007.
  320. Van de Peer, Y., Taylor, J. S., Braasch, I., & Meyer, A. (2001). The ghost of selection past: rates of evolution and functional divergence of anciently duplicated genes. JOURNAL OF MOLECULAR EVOLUTION, 53(4-5), 436–446.
    The duplication of genes and even complete genomes may be a prerequisite for major evolutionary transitions and the origin of evolutionary novelties. However, the evolutionary mechanisms of gene evolution and the origin of novel gene functions after gene duplication have been a subject of many debates. Recently, we compiled 26 groups of orthologous genes, which included one gene from human, mouse, and chicken, one or two genes from the tetraploid Xenopus and two genes from zebrafish. Comparative analysis and mapping data showed that these pairs of zebrafish genes were probably produced during a fish-specific genome duplication that occurred between 300 and 450 Mya, before the teleost radiation (Taylor et al. 2001). As discussed here, many of these retained duplicated genes code for DNA binding proteins. Different models have been developed to explain the retention of duplicated genes and in particular the subfunctionalization model of Force et al. (1999) could explain why so many developmental control genes have been retained. Other models are harder to reconcile with this particular set of duplicated genes. Most genes seem to have been subjected to strong purifying selection, keeping properties such as charge and polarity the same in both duplicates, although some evidence was found for positive Darwinian selection, in particular for Hox genes. However, since only the cumulative pattern of nucleotide substitutions can be studied, clear indications of positive Darwinian selection or neutrality may be hard to find for such anciently duplicated genes. Nevertheless, an increase in evolutionary rate in about half of the duplicated genes seems to suggest that either positive Darwinian selection has occurred or that functional constraints have been relaxed at one point in time during functional divergence.
  321. Taylor, J. S., Van de Peer, Y., Braasch, I., & Meyer, A. (2001). Comparative genomics provides evidence for an ancient genome duplication event in fish. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 356(1414), 1661–1679.
    There are approximately 25 000 species in the division Teleostei and most arc believed to have arisen during a relatively short period of time ca. 200 Myr ago. The discovery of 'extra' Hox gene clusters in zebrafish (Danio rerio), medaka (Oryzias latipes), and pufferfish (Fugu rubripes), has led to the hypothesis that genome duplication provided the genetic raw material necessary for the telcost radiation. We identified 27 groups of orthologous genes which included one gene from man, mouse and chicken, one or two genes from tetraploid Xenopus and two genes from zebrafish. A genome duplication in the ancestor of teleost fishes is the most parsimonious explanation for the observations that for 15 of these genes, the two zebrafish orthologues are sister sequences in phylogenies that otherwise match the expected organismal tree, the zebrafish gene pairs appear to have been formed at approximately the same time, and are unlinked. Phylogenies of nine genes differ a little from the tree predicted by the fish-specific genome duplication hypothesis: one tree shows a sister sequence relationship for the zebrafish genes but differs slightly from the expected organismal tree and in eight trees, one zebrafish gene is the sister sequence to a clade which includes the second zebrafish gene and orthologues from Xenopus, chicken, mouse and man. For these nine gene trees, deviations from the predictions of the fish-specific genome duplication hypothesis are poorly supported. The two zebrafish orthologues for each of the three remaining genes are tightly linked and are, therefore, unlikely to have been formed during a genome duplication event. We estimated that the unlinked duplicated zebrafish genes are between 300 and 450 Myr. Thus, genome duplication could have provided the genetic raw material for teleost radiation. Alternatively, the loss of different duplicates in different populations (i.e. 'divergent resolution') may have promoted speciation in ancient teleost populations.
  322. Ben Ali, A., De Baere, R., Van der Auwera, G., De Wachter, R., & Van de Peer, Y. (2001). Phylogenetic relationships among algae based on complete large-subunit rRNA sequences. INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY, 51(3), 737–749.
    The complete or nearly complete large-subunit rRNA (LSU rRNA) sequences were determined for representatives of several algal groups such as the chlorarachniophytes, cryptomonads, haptophytes, bacillariophytes, dictyochophytes and pelagophytes. Our aim was to study the phylogenetic position and relationships of the different groups of algae, and in particular to study the relationships among the different classes of heterokont algae. In LSU rRNA phylogenies, the chlorarachniophytes, cryptomonads and haptophytes seem to form independent evolutionary lineages, for which a specific relationship with any of the other eukaryotic taxa cannot be demonstrated. This is in accordance with phylogenies inferred on the basis of the small-subunit rRNA (SSU rRNA), Regarding the heterokont algae, which form a well-supported monophyletic lineage on the basis of LSU rRNA, resolution between the different classes could be improved by combining the SSU and LSU rRNA data. Based on a concatenated alignment of both molecules, the phaeophytes and the xanthophytes are sister taxa, as well as the pelagophytes and the dictyochophytes, and the chrysophytes and the eustigmatophytes. All these sister group relationships are highly supported by bootstrap analysis and by different methods of tree construction.
  323. Wuyts, Jan, Van de Peer, Y., & De Wachter, R. (2001). Distribution of substitution rates and location of insertion sites in the tertiary structure of ribosomal RNA. NUCLEIC ACIDS RESEARCH, 29(24), 5017–5028.
    The relative substitution rate of each nucleotide site in bacterial small subunit rRNA, large subunit rRNA and 5S rRNA was calculated from sequence alignments for each molecule. Two-dimensional and three-dimensional variability maps of the rRNAs were obtained by plotting the substitution rates on secondary structure models and on the tertiary structure of the rRNAs available from X-ray diffraction results. This showed that the substitution rates are generally low near the centre of the ribosome, where the nucleotides essential for its function are situated, and that they increase towards the surface. An inventory was made of insertions characteristic of the Archaea, Bacteria and Eucarya domains, and for additional insertions present in specific eukaryotic taxa. All these insertions occur at the ribosome surface. The taxon-specific insertions seem to arise randomly in the eukaryotic evolutionary tree, without any phylogenetic relatedness between the taxa possessing them.
  324. Wenderoth, K., Marquardt, J., Fraunholz, M., Van de Peer, Y., Wastl, J., & Maier, U.-G. (1999). The taxonomic position of Chlamydomyxa labyrinthuloides. EUROPEAN JOURNAL OF PHYCOLOGY, 34(2), 97–108.
    Chlamydomyxa labyrinthuloides is a heterokont alga known since the last century. It lives on Sphagnum and other water plants as aplanospores or plasmodia. We have investigated the taxonomic position of Chlamydomyxa labyrinthuloides by combining results from morphological studies, pigment analyses and a molecular phylogenetic analysis of the small subunit (SSU) rRNA gene. Chlamydomyxa labyrinthuloides shares morphological features with xanthophytes and chrysophytes, whereas pigment composition indicates a grouping with the phaeophytes, raphidophytes and chrysophytes. The sequence of the SSU rRNA gene and its phylogenetic reconstruction unambiguously demonstrate that Chlamydomyxa labyrinthuloides is related to the chrysophytes.
  325. Vandamme, P., Segers, P., Ryll, M., Hommez, J., Vancanneyt, M., Coopman, R., De Baere, R., et al. (1998). Pelistega europaea gen. nov., sp. nov., a bacterium associated with respiratory disease in pigeons: taxonomic structure and phylogenetic allocation. INTERNATIONAL JOURNAL OF SYSTEMATIC BACTERIOLOGY, 48(2), 431–440.
    Twenty-four strains isolated mainly from infected respiratory tracts of pigeons were characterized by an integrated genotypic and phenotypic approach. An extensive biochemical examination using conventional tests and several API microtest systems indicated that all isolates formed a phenotypically homogeneous taxon with a DNA G+C content between 42 and 43 mol%. Whole-cell protein and fatty acid analysis revealed an unexpected heterogeneity which was confirmed by DNA-DNA hybridizations. Four main genotypic sub-groups (genomovars) were delineated. 16S rDNA sequence analysis of a representative strain indicated that this taxon belongs to the beta-subclass of the Proteobacteria with Taylorella equigenitalis as its closest neighbour (about 94.8 % similarity). A comparison of phenotypic and genotypic characteristics of both taxa suggested that the pigeon isolates represented a novel genus for which the name Pelistega is proposed. In the absence of differential phenotypic characteristics between the genomovars, it was preferred to include all of the isolates into a single species, Pelistega europaea, and strain LMG 10982 was selected as the type strain. The latter strain belongs to fatty acid cluster I and protein electrophoretic sub-group 1, which comprise 13 and 5 isolates, respectively. It is not unlikely that the name P. europaea will be restricted in the future to organisms belonging to fatty acid cluster I, or even to protein electrophoretic sub-group 1, upon discovery of differential diagnostic features.
  326. Zwart, Gabriël, Huismans, R., van Agterveld, M. P., Van de Peer, Y., De Rijk, P., Eenhoorn, H., Muyzer, G., et al. (1998). Divergent members of the bacterial division Verrucomicrobiales in a temperate freshwater lake. FEMS MICROBIOLOGY ECOLOGY, 25(2), 159–169.
  327. Van de Peer, Y., Vancanneyt, M., & De Wachter, R. (1996). Compilation of pseudomonad sequences present in a database on the structure of ribosomal RNA. SYSTEMATIC AND APPLIED MICROBIOLOGY, 19(4), 493–500.
    The ribosomal RNA database in Antwerp (Belgium) offers extensive alignments of both small and large ribosomal subunit RNA (SSU/LSU rRNA) sequences. In July 1996, the SSU rRNA and LSU rRNA sequence alignments comprised respectively about 6400 and 350 sequences. The alignments are based on the secondary structure models adopted for both molecules, which are corroborated by the observation of compensating mutations. Literature references, accession numbers, and detailed taxonomic information are also compiled. Since part of this issue of Systematic and Applied Microbiology is dedicated to the pseudomonads in particular, all SSU rRNA and LSU rRNA sequences determined for these bacteria and available in the Antwerp databases are listed. The complete databases are accessible to the scientific community through anonymous ftp and World Wide Web. Our server also provides software for sequence alignment and phylogenetic tree construction.
  328. Moens, L., Vanfleteren, J., Van de Peer, Y., Peeters, K., Kapp, O., Czeluzniak, J., Goodman, M., et al. (1996). Globins in nonvertebrate species: dispersal by horizontal gene transfer and evolution of the structure-function relationships. MOLECULAR BIOLOGY AND EVOLUTION, 13(2), 324–333.
    Using a new template based on an alignment of 145 nonvertebrate globins we examined several recently determined sequences of putative globins and globin-like hemeproteins. We propose that all globins have evolved from a family of ancestral, approx. 17-kDa hemeproteins, which displayed the globin fold and functioned as redox proteins. Once atmospheric O-2 became available the acquisition of oxygen-binding properties was initiated, culminating in the various highly specialized functions known at present. During this evolutionary process, we suggest that (1) high oxygen affinity may have been acquired repeatedly and (2) the formation of chimeric proteins containing both a globin and a flavin binding domain was an additional and distinct evolutionary trend. Furthermore, globin-like hemeproteins encompass hemeproteins produced through convergent evolution from nonglobin ancestral proteins to carry out O-2-binding functions as well as hemeproteins whose sequences exhibit the loss of some or all of the structural determinants of the globin fold. We also propose that there occurred two cases of horizontal globin gene transfer, one from an ancestor common to the ciliates Paramecium and Tetrahymena and the green alga Chlamydomonas to a cyanobacterium ancestor and the other, from a eukaryote ancestor of the yeasts Saccharomyces and Candida to a bacterial ancestor of the proteobacterial genera Escherichia, Alcaligenes, and Vitreoscilla.
  329. Vanfleteren, Jacques, Van de Peer, Y., Blaxter, M. L., Tweedie, S. A., Trotman, C., Lu, L., Van Hauwaert, M.-L., et al. (1994). Molecular genealogy of some nematode taxa as based on cytochrome c and globin amino acid sequences. MOLECULAR PHYLOGENETICS AND EVOLUTION, 3(2), 92–101.
  330. Van de Peer, Y., Neefs, J.-M., De Rijk, P., De Vos, P., & De Wachter, R. (1994). About the order of divergence of the major bacterial taxa during evolution. SYSTEMATIC AND APPLIED MICROBIOLOGY, 17(1), 32–38.
    An evolutionary tree, reconstructed from 1232 bacterial small ribosomal subunit RNA sequences by a distance method, reflects the existence of 11 divisions and a number of subdivisions originally recognized by Woese and collaborators. However, the order of divergence that gave rise to these taxa remains indeterminate and the division of Gram positives and relatives does not behave as a monophyletic taxon. Analysis of the data by a novel approach led to a preferred order of divergence for 10 out of 16 tree nodes, but the Gram positives still behaved as biphyletic.
  331. VANCAMP, G., Van de Peer, Y., NICOLAI, S., NEEFS, J., Vandamme, P., & De Wachter, R. (1993). STRUCTURE OF 16S AND 23S RIBOSOMAL-RNA GENES IN CAMPYLOBACTER SPECIES - PHYLOGENETIC ANALYSIS OF THE GENUS CAMPYLOBACTER AND PRESENCE OF INTERNAL TRANSCRIBED SPACERS. SYSTEMATIC AND APPLIED MICROBIOLOGY, 16(3), 361–368.
  332. HENDRIKS, L., GORIS, A., Van de Peer, Y., NEEFS, J., Vancanneyt, M., Kersters, K., BERNY, J., et al. (1992). Phylogenetic-relationships among Ascomycetes and Ascomycete-like yeasts as deduced from small ribosomal-subunit RNA sequences. SYSTEMATIC AND APPLIED MICROBIOLOGY, 15(1), 98–104.
  333. Van de Peer, Y., HENDRIKS, L., GORIS, A., NEEFS, J., Vancanneyt, M., Kersters, K., BERNY, J., et al. (1992). Evolution of basidiomycetous yeasts as deduced from small ribosomal subunit RNA sequences. SYSTEMATIC AND APPLIED MICROBIOLOGY, 15(2), 250–258.
  334. Hendriks, L., Goris, A., Van de Peer, Y., Neefs, J.-M., Vancanneyt, M., Kersters, K., Hennebert, G. L., et al. (1991). Phylogenetic analysis of five medically important Candida species as deduced on the basis of small ribosomal subunit RNA sequences. JOURNAL OF GENERAL MICROBIOLOGY, 137(5), 1223–1230.
    The classification of species belonging to the genus Candida Berkhout is problematic. Therefore, we have determined the small ribosomal subunit RNA (srRNA) sequences of the type strains of three human pathogenic Candida species; Candida krusei, C. lusitaniae and C. tropicalis. The srRNA sequences were aligned with published eukaryotic srRNA sequences and evolutionary trees were inferred using a matrix optimization method. An evolutionary tree comprising all available eukaryotic srRNA sequences, including two other pathogenic Candida species, C. albicans and C. glabrata, showed that the yeasts diverage rather late in the course of eukaryote evolution, namely at the same depth as green plants, ciliates and some smaller taxa. The cluster of the higher fungi consists of 10 ascomycetes and ascomycete-like species with the first branches leading to Neurospora crassa, Pneumocystis carinii, Candida lusitaniae and C. krusei, in that order. Next there is a dichotomous divergence leading to a group consisting of Torulaspora delbrueckii, Saccharomyces cerevisiae, C. glabrata and Kluyveromyces lactis and a smaller group comprising C. tropicalis and C. albicans. The divergence pattern obtained on the basis of srRNA sequence data is also compared to various other chemotaxonomic data.
  335. Moens, L., Van Hauwaert, M.-L., De Smet, K., Ver Donck, K., Van de Peer, Y., Van Beeumen, J., Wodak, S., et al. (1990). Structural interpretation of the amino acid sequence of a second domain from the Artemia covalent polymer globin. JOURNAL OF BIOLOGICAL CHEMISTRY, 265(24), 14285–14291.