TREECON for Windows user manual

SUBSTITUTION RATE CALIBRATION

General introduction

However, differences in substitution rate among sites is very common in almost all genes. In the case of ribosomal RNA, the substitution rate was estimated (Van de Peer et al. 1993b, 1996a, c) to vary by a factor of more than 1000 from the least variable to the most variable site. Olsen (1987) already demonstrated that application of the Jukes and Cantor correction to sequences composed of sites with unequal rates can lead to artefacts in tree topology, caused by the underestimation of the distances, and proposed a different evolutionary model that assumed a log-normal distribution of substitution rates over the sequence positions. Jin and Nei (1990) followed a similar approach but assumed that substitution rates were gamma distributed (see above). On the basis of this gamma distribution, which involves a parameter a that describes the extent of the rate variation, they derived several equations to compute the evolutionary distance from the observed sequence dissimilarities. However, the problem with applying gamma distances is the estimation of the gamma parameter a, the accuracy of which depends on the estimation method, tree topology, and on the number of sequences used in the estimation (Rzhetsky and Nei, 1994; Yang, 1996; Sullivan et al., 1996; Tourasse and Gouy, 1997).

One of the main problems in considering among-site rate variation is the quantitative estimation of the substitution rates or variabilities of nucleotide sites. Maximum parsimony estimates can be heavily biased (Wakeley, 1993; Yang, 1996; Tourasse and Gouy, 1997) while maximum likelihood estimates may experience computational difficulties (Yang, 1996). Recently, we developed a method for measuring the relative substitution rate of individual sites in a nucleotide sequence alignment on the basis of a distance approach (Van de Peer et al., 1993b; Van de Peer et al., 1996c). The main advantage of this approach is that nucleotide variability estimates do not depend upon a given tree topology, contrary to estimates inferred from maximum parsimony or maximum likelihood methods (Sullivan et al., 1996; Yang, 1996), and that they can be based on an very large number of sequences. This is important since the more sequences taken into account, the more accurate the substitution rates can be estimated (see also Tourasse and Gouy, 1997). When the substitution rates of the individual nucleotides of the molecule are computed as described previously (Van de Peer et al., 1993b, 1996c), an equation can be derived that better describes the actual relationship between sequence dissimilarity and evolutionary distance:

Substitution rate calibration has shown to be particularly useful in eliminating artificial clustering caused by increased evolutionary rates. In many cases, higher evolutionary rates tend to pull sequences closer to the base of the tree. Moreover, due to the serious underestimation of large evolutionary distances (see figure 2), distant species seem closer to one another than they actually are, which often result in artificial clustering of long branches, the so called ‘long branch attraction phenomenon’.

How does substitution rate calibration works?

For an alignment of N sequences, N · (N-1)/2 pairwise evolutionary distances d are computed according to the equation of Jukes and Cantor (see above).

When all pairwise distances are computed, they are classified into a number of distance intervals, e.g. distances from 0 to 0.005, distances from 0.005 to 0.010, and so on. Next, the fraction of sequence pairs falling within a certain distance interval and characterized by a nucleotide change, is computed. This is done for every nucleotide position. This fraction is then plotted against the mean of the distance interval (Van de Peer et al., 1993). A curve obeying equation

is then fitted to these points by nonlinear regression. This equation expresses the probability pi that an alignment position i contains a different nucleotide in two sequences, as a function of the evolutionary distance d separating these sequences.

see Fig. 3

Actually, the estimated nucleotide substitution rates are not yet optimal, because they are derived on the basis of a distance matrix computed by means of equation of Jukes and Cantor. This equation only gives a first approximation of the relation between dissimilarity and distance, since it starts from the unrealistic assumption that all nucleotides have the same substitution rate. Therefore, after estimation of all v_ivalues, alignment positions are grouped into sets of similar variability. A spectrum of relative nucleotide substitution rates is thus obtained (see Fig. 4). Once the shape of the spectrum is known, it is possible to derive the following equation for the dissimilarity, f, as a function of the evolutionary distance d (Van de Peer et al., 1996 (see appendix)):

The value of parameter p depends on the shape of the substitution rate spectrum. The inverse of this equation,

This iterative process is repeated several times until the nucleotide substitution rates v_ido not change anymore (see Fig. 5). In general, no more than 3 iterations are needed for the changes to become imperceptible.

Nucleotide substitution rate calibration can be rather time-consuming. Rate calibration of a sequence alignment of 100 sequences of 1000 nucleotides and derivation of the final value for the parameter p took about 15 minutes on a Pentium computer running at 90 MHz (with 3 iterations). However, for a specific alignment, the p parameter has to be estimated only once.

A practical example on rate calibration

Drosophila melanogaster

Acetabularia major

Plasmodium

Here’s how to proceed:

Start TREECON and select ‘Distance estimation’ from the main menu.
Choose ‘Substitution rate calibration’ from the distance estimation menu.
Select the number of iterations you want. The default number is 3: this is usually enough for the changes in nucleotide variabilities to become imperceptible.

In the substitution rate calibration method, a new function to convert dissimilarity into distance is derived considering only positions for which a variability can be estimated. Absolutely conserved positions (for which the variability = 0) are thus not considered. Therefore, by default, these positions are also left out of consideration during the actual computation of evolutionary distances. Positions that are deleted in > 50% of the sequences are not considered either.

Next, select the input file. Select the sequences and press the OK button.
During run-time, three small windows will be displayed at the bottom of the screen. One, titled ‘Function display’ shows the graphic representation of the Jukes and Cantor function, and the new functions that are computed during each iteration, taking into account individual nucleotide substitution rates. The second window, titled ‘Variability distribution’ shows the substitution rate spectrum that is used to derive the new function. The third window, titled ‘Sequence variability’ shows the relative substitution rates of the individual nucleotides of the sequence alignment. Nucleotides with a variability or substitution rate of 1 mutate at the average substitution rate of all nucleotides considered. By clicking on the ‘caption bar’ of these small windows, they are activated and enlarged. You can then make them even larger by clicking and dragging the borders.
When rate calibration is finished, the ‘Function display’ window shows the value of parameter p (see above). The program will also give you the name of the new sequence file you should use to construct a tree with. If you have used the default values, this file will be named ‘TEST_CAL.RES’ (the extension of the file name is changed to ‘res’). This file contains the same sequences as the original file, but without absolutely conserved positions and positions that are deleted in > 50% of the sequences (see above).
Trees can now be constructed on the basis of the new function that takes into account the rate distribution inferred from the specific sequence alignment. Select ‘Start distance estimation’ from the ‘Distance estimation’ menu, select the input file (in this case ‘TEST_CAL.RES’), select all the sequences, or a particular selection, and choose ‘Van de Peer & De Wachter (1996)’, as the option to convert dissimilarities into evolutionary distances. The program then asks you to give the specific value for parameter ‘p’. For the ‘TEST_CAL.RES’ file, this will most probably be 0.58.
The rest is exactly the same as in constructing a Jukes and Cantor tree.

Now compare this tree with a tree constructed by using the Jukes and Cantor (or Kimura) correction on the basis of the file ‘TEST_CAL.SEQ’. As you can see, the position of the fastly evolving sequences (long branches) is quite different in both trees.

Files that may contain valuable information are:

‘VARIABIL.OUT’: lists all the variabilities from position 1 to x
‘VARIABIL.ORD’: lists all the variabilities from the smallest to the largest
‘HISTO.OUT’: lists the number of positions within each variability class (= rate spectrum)

Substitution rate calibration should only be applied when sufficient sequences are available, and of sufficient length. Rate calibration based on 20 sequences of about 500 nucleotides is not very reliable. Use at least 35-40 sequences, and with > 1000 positions that are variable. If you want to construct a tree of only 10 sequences with substitution rate calibration, try to estimate nucleotide variabilities on a larger set of sequences. Once the rate spectrum is inferred and the function derived (parameter "p" has been computed), these can be used on a subset of sequences.

Of course, substitution rate calibration is not a ‘miracle method’ that solves all artifacts caused by fastly evolving species. However, in several cases it does give an improved tree topology. When clear differences are observed between trees constructed with rate calibration and trees constructed by standard corrections for multiple mutations, increased evolutionary rates in certain lineages are probably at play.

Applications of the ‘substitution rate calibration’ method can be found in:

Van de Peer, Y., Van der Auwera, G., De Wachter, R. (1996) The evolution of stramenopiles and alveolates as derived by "substitution rate calibration" of small ribosomal subunit RNA. J. Mol. Evol. 42, 201-210.
Van de Peer, Y., Rensing, S., Maier, U.-G., De Wachter, R. (1996) Substitution rate calibration of small ribosomal subunit RNA identifies chlorarachniophyte endosymbionts as remnants of green algae. Proc. Natl. Acad. Sci. USA 93, 7732-7736.
Van de Peer, Y., Chapelle, S., De Wachter, R. (1996) A quantitative map of nucleotide substitution rates in bacterial ribosomal subunit RNA. Nucleic Acids Res. 24, 3381-3391.
Van de Peer, Y., De Wachter, R (1997) Evolutionary relationships among the eukaryotic crown taxa taking into account site to site rate variation in 18S rRNA. J. Mol. Evol. 45, 619-630.

Return