A higher order background model improves the detection by Gibbs sampling of potential promoter regulatory elements in DNA sequences.

Thijs, G., Lescot, M., Rombauts, S., Marchal, K., De Moor, B., Moreau, Y., Rouzé, P.


Transcriptome analysis allows detection and clustering of genes that are coexpressed under various biological circumstances. Under the assumption that coregulated genes share cis-acting regulatory elements, it is important to investigate the upstream sequences controlling the transcription of these genes. To improve the robustness of the Gibbs sampling algorithm to noisy data sets we propose an extension of this algorithm for motif finding with a higher-order background model.

Simulated data and real biological data sets with well-described regulatory elements are used to test the influence of the different background models on the performance of the motif detection algorithm. We show that the use of a higher-order model considerably enhances the performance of our motif finding algorithm in the presence of noisy data. For Arabidopsis thaliana, a reliable background model based on a set of carefully selected intergenic sequences was constructed.

Our implementation of the Gibbs sampler called the Motif Sampler can be used through a web interface: http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html

Supplementary Data

  • G-boxes dataset:
    • This set of sequences was extracted from PlantCARE (Rombauts et al., 1999).
    • It contains the upstream region of genes having a G-box, a well-conserved ubiquitous cis-acting regulatory element found in plant genomes bound by the GBF (G-box binding factors) family of bZIP proteins.
    • The consensus of the G-box is CACGTG.
    • The position of the G-box is well defined in this data set.
    • The set contains 33 sequences of 500bp.
  • Light induced sequences:
    • This data set contains the upstream region of 28 co-expressed Arabidopsis thaliana genes, co-expression based on a microarray experiment (Desprez et al., 1998).
  • Random sequences:
    • Arabidopsis thaliana upstream sequences of at least 150 bp.
    • Not described to be involved in light regulation and not containing a known G-box or I-box.
  • Figures
  • Tables

