Annotation of a 95-kb Populus deltoides genomic sequence reveals a disease resistance gene cluster and novel class I and class II transposable elements.

Rombauts, S., Lescot, M., Zhang, J., Aubourg, S., Mathé, C., Jansson, S., Rouzé, P., Boerjan, W.

Corresponding author:

Abstract

Poplar has become a model system for functional genomics in woody plants. Here, we report the sequencing and annotation of the first large contiguous stretch of genomic sequence (95 kb) of poplar, corresponding to a bacterial artificial chromosome clone mapped 0.6 centiMorgan from the Melampsora larici-populina resistance locus. The annotation revealed 15 putative genetic objects, of which five were classified as hypothetical genes that were similar only with expressed sequence tags from poplar. Ten putative objects showed similarity with known genes, of which one was similar to a kinase. Three other objects corresponded to the toll/interleukin-1 receptor/nucleotide-binding site/leucine-rich repeat class of plant disease resistance genes, of which two were predicted to encode an amino terminal nuclear localization signal. Four objects were homologous to the Ty1/ copia family of class I transposable elements, one of which was designated Retropop and interrupted one of the disease resistance genes. Two other objects constituted a novel Spm-like class II transposable element, which we designated Magali.

Supplementary Data

The test data set
The sequences were retrieved from the EMBL with SRS (Etzold et al. 1996) by searching annotated sequences of Populus. In order to test the prediction quality of the first and last exons, only sequences containing complete genes were retrieved. Duplicated genes present on one and the same sequence were removed. A total of 24 genes was present in 22 sequences: AB049200, AF016893, AF052570, AF057708, AJ132262, AJ223620, AJ295838, D11102, D30656, D43802, D49710, D82812, D83225, D83227, L11233, U01661, U13171, U93196, X15516, X59995, X70064 and Y18218).

Beaware that the number of sequences is too limited to produce statistically significant results. Results of the evaluation of prediction software are merely indicative and should not be considered as more!

Evaluation of the prediction programs
In order to select the best prediction program for annotation of Populus sequences, different programs were first evaluated on poplar genomic sequences, present in the public databases that contain a complete CDS. Here, CDS corresponds to a gene having an exon-intron structure. Only a limited number of Populus genes belonging to different poplar species was available. The data set contains 22 genomic sequences with a total of 24 genes, representing 91 exons and 67 introns. The average size of genes, exons, and introns were 2.0 kb, 273 bp, and 361 bp, respectively. In this data set, intergenic sequences represented 45.6 %. Programs, such as GeneMark.hmm, EuGene, FgenesH, and GlimmerM, were used on this test set of Populus sequences. For FgenesH, GeneMark.hmm and GlimmerM, the versions for dicot and monocot species were evaluated. Thus, a total of seven different gene prediction results have been compared. The evaluation tools and methods used were those described by Pavy et al. (1999) that can be applied to any eukaryotic genome and to any prediction program, by using a test set as described above. The accuracy of the prediction programs was studied at the nucleotide, the exon, and the gene levels. For all the programs tested, the basic accuracy measures were calculated according to Burset and Guig? (1996). Surprisingly, FgenesH for monocots appeared the most efficient, as can be seen in the analysis reported below.

Evaluation at the nucleotide level
The sensitivity and specificity of the different prediction programs are presented in Table 1. FgenesH for dicots appeared to be the most sensitive program (0.98) whereas GeneMark.hmm for rice showed the best specificity (0.98). Considering the coefficient of correlation, FgenesH for monocots performed the best, FgenesH for dicots, GeneMark.hmm for Arabidopsis and Eugene performing only slightly worse.

Evaluation at the exon structure level
The results of the evaluation revealed that the programs FgenesH for dicots and monocots were the most accurate with a sensitivity of 0.86 and 0.82, respectively. The specificity was the highest for FgenesH for monocots. In addition, like GeneMark.hmm (Arabidopsis and rice) and EuGene, FgenesH for monocots is never merging exons. Eugene and GeneMark.hmm for Arabidopsis are close behind FgenesH as exon predicters.

credits

Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)