Computational approaches to identify promoters and cis-regulatory elements in plant genomes.
Rombauts, S., Florquin, K., Lescot, M., Marchal, K., Rouzé, P., Van de Peer, Y.Corresponding author:
AbstractThe identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
Table I Promoter prediction programs
Table II. Motif search programs
Table III. Motif prediction programs
As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. The original pragmatic definition of a CpG island in human sequences considers a GC% higher than 50 and a ratio between observed and expected (o/e) occurrence of CG dinucleotides of 0.6 over a window of 200 bp (Gardiner-Garden and Frommer, 1987). Recently, these parameters have been upscaled to a GC% >55, an o/e CpG >0.65, and a window size of 500 bp, because the previous parameters had been found to overestimate (50-fold) the number of potential CpG islands (Takai and Jones, 2002). In animals, approximately 40% of genes are expected to be associated with CpG islands (Gardiner-Garden and Frommer, 1987; Antequera and Bird, 1999). Actually, this percentage might be too low because a total of 29,000 CpG islands had been estimated after the completion of the human genome sequence (Venter et al., 2001). With the above-mentioned parameters, no CpG islands are discovered in plants (Takai and Jones, 2002; our results). However, DNA methylation occurs in plants, and DNA methylases are even more numerous and diverse in plants than in animals (Finnegan and Kovac, 2000; Cao and Jacobsen, 2002). Therefore, we attempted to define the parameters that could possibly specify CpG and CpNpG islands in Arabidopsis by exploring the compositional landscape around the TSS. To this end, we built a data set of 5,025 gene sequences, designated ARAPROM, by aligning the full-length cDNA sequences generated by Seki et al. (2002) against the genomic sequence (Arabidopsis Genome Initiative, 2000). [Generally, these sequences are 2.5 kb long, in which 2 kb represent intergenic sequences upstream from the translation start codon, and 500 bp are taken downstream. Nevertheless, when the upstream neighbor gene lies closer than 2 kb, then only the intergenic sequence is kept, up to the predicted coding boundary of the upstream gene. The genomic sequences in the ARAPROM data set include the promoter region, the 5'-UTR, and the first introns and coding exons of each individual gene.]
Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/ CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
Table II. Percentage of genes (out of 5,025) containing CpG (top) or CpNpG (bottom) islands, for a few different parameter settings
Graphical Representation of the results
VIB / UGent
Bioinformatics & Evolutionary Genomics
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)