Gene families were constructed as follows:

Homolog filtering was done on either of these 3 methods:

E-value cut-off (parameter: < 1e-5, < 1e-10, <1e-20, or 1e-30)
Rost criterion (Rost, 1999) (2 proteins are homologous when they share > 30% identical residues on an alignable region of ≥ 150 aa. If the alignable region is < 150 aa, a cut off curve based on homology derived secondary structure prediction identity is used to determine whether the two sequences are homologous)
Li-Rost criterion (Li, 2001) (the difference with the above criterion is that the percent identity is now recalculated from a similarity over the alignable region, to a similarity along the entire amino acid sequence)

The effect of these different methods (with their different parameters) on the distribution of the proteome over a gene family size range is shown in the graph below.

Stringent conditions (i.e. Li-Rost criterion) were found to be necessary in order to construct reliable gene families across genomes, as an e-value of 1e-10 for example resulted in one gene family of more than 80,000 members, i.e. 30% of our data set. We have chosen for the Li-Rost criterion (with the parameters set to > 30% identical residues) based on this distribution graph and an additional manual control for a number of specific gene families.In the combined bacterial proteome,excluding phage- and transposon-related ORFs, a total of 27,914 gene families were found, encompassing 243,910 proteins (81%).

References:

Li, W.H. et al. (2001) Evolutionary analyses of the human genome. Nature 409, 847-849

Rost, B. (1999) Twilight zone of protein sequence alignments. Protein Eng 12, 85-94

credits

Contact:
VIB / UGent
Bioinformatics & Evolutionary Genomics
Technologiepark 927
B-9052 Gent
BELGIUM
+32 (0) 9 33 13807 (phone)
+32 (0) 9 33 13809 (fax)