TREECON for Windows user manual

DISTANCE ESTIMATION FOR NUCLEIC ACID SEQUENCES

General introduction and Jukes and Cantor model

Jukes and Cantor

_AB

dissimilarity (fraction of observed differences)

_AB

estimated evolutionary distance (fraction of expected substitutions)

Fig. 1

An important drawback of most of these models is that they do not consider differences in substitution rate among the sites of a molecule (see further).

The substitution model of Jukes and Cantor, also called the one-parameter model, is the simplest one available for estimating the number of nucleotide substitutions per site and is probably still the most used one.

When the fraction of differences between two sequences exceeds 0.75, the distance cannot be computed. If f > 0.75, TREECON crashes and gives a log-domain error. When this happens, check your alignment and/or use the more conserved parts of the alignment to construct a tree with.

Kimura - two parameter model

Kimura (1980) provided a method for inferring evolutionary distance in which transitions and transversions are treated separately:

where P is the fraction of sequence positions differing by a transition and Q is the fraction of sequence positions differing by a transversion.

Tajima & Nei

In the general correction of Tajima and Nei (1984), the evolutionary distance is estimated by:

where

and f_iis the frequency of the i-th type of nucleotide belonging to the set of possible nucleotide types N (= A, G, C, U or T) in the sequences being compared. This equation holds for the model of nucleotide substitutions with equal substitution rates between different nucleotides and does NOT take into account unequal rates of substitution among different nucleotide pairs (Tajima and Nei, 1984). In TREECON, the computed base composition is the average for all the sequences analyzed (as suggested in Swofford et al., 1996). If the frequencies are 0.25 for all four nucleotides, this equation equals the one of Jukes and Cantor.

Gamma distances

For the Jukes and Cantor's one parameter model, the distance is computed as follows (Jin and Nei, 1990):

for a=1:

for a=2:

for a=1/2:

A more general equation has been given by Rzhetsky and Nei

For the Kimura's two parameter evolution model, the distance is computed as follows

for a=1:

for a=2:

for a=1/2:

A more general equation has been given by Nei

Galtier & Gouy

The transition/transversion ratio is assumed to be the same in all lineages. It is estimated once, from the whole dataset. In the substitution model of Galtier and Gouy, the ratio between the sums of transition and of transversion rates equals a/2. This ratio is estimated for each sequence pair (A, B):

The estimate of parameter a is given by the mean of a(A,B) values for all sequence pairs. This estimate is used for all pairwise distance computations. In TREECON the transition/transversion ratio is estimated before the actual computation of the evolutionary distances starts. Since this value is based on all pairwise comparisons, this in fact doubles the time needed for the estimation of distances between sequences. In the bootstrap analysis, the transition/transversion ratio is only once computed, based on the actual set of sequences.

The evolutionary distance is estimated as follows:

with

and

where theta1 Is the G+C content of sequence 1, and theta2 is the G+C content of sequence 2.

Transversion analysis

Sometimes, it can be interesting to estimate the evolutionary distance on the basis of transversions only (see e.g. Woese et al., 1991; Van de Peer et al., 1996b). The evolutionary distance is then estimated by (Tajima and Nei, 1984; Swofford et al., 1996):

where Q is the fraction of transversions and

No correction

It is also possible not to correct for superimposed mutations (the so-called p-distance)

DISTANCE ESTIMATION FOR AMINO ACID SEQUENCES

Overall, the computation of evolutionary distances for amino acid sequences is similar to the computation of distances for nucleic acid sequences. Poisson correction

The distance between two amino acid sequences is computed starting from the assumption that the rate of amino acid substitution at each site follows the poisson distribution (e.g. Zuckerkandl and Pauling, 1965; Dickerson, 1971):

_AB

Kimura

This formula (Kimura, 1983) should be a good approximation to the Dayhoff model (Hasegawa and Fujiwara, 1993):

Tajima & Nei

The evolutionary distance is calculated as:

where b=0.95 (i.e. 19/20) for amino acid replacements (Tajima and Nei, 1984).

Gamma distances

When one starts from the assumption that the rate of amino acid substitutions varies from site to site and follows the gamma distribution, the evolutionary distance is computed as (Kumar et al., 1993; Ota and Nei, 1994):

GENETIC DISTANCE ESTIMATION FOR PATTERN ANALYSIS

At the moment, two methods to compute the genetic distance for pattern analyses such as RFLP, RAPD, and AFLP are implemented. Method 1 (Nei and Li, 1979)

The genetic distance is computed as follows:

where Nxy is the number of fragments (bands) shared in lines x and y, and Nx is the number of fragments in line x, and Ny is the number of fragments in line y.

Example
sample 1 1010100011             sample 1 1110011000
sample 2 1010111100             sample 2 1110000001
Nx=5                            Nx=5
Ny=6                            Ny=4
Nxy=3 Nxy=3
GDxy=1-(2*3)/(5+6)=0455         GDyx=1-(2*3)/(5+4)=0.33

Method 2 (Link et al., 1995)

The genetic distance is computed as follows:

where Nx is the number of bands in line x and not in line y, Ny is the number of bands in line y and not in line x, and Nxy is the number of bands shared in lines x and y.

Example
sample 1 1010100011             sample 1 1110011000
sample 2 1010111100             sample 2 1110000001
Nx=2                            Nx=2
Ny=3                            Ny=1
Nxy=3                           Nxy=3
GDxy=(2+3)/(2+3+3)=5/8=0.625    GDyx=(2+1)/(2+1+3)=3/6=0.5

INSERTIONS AND DELETIONS

In TREECON, the user can choose whether to take insertions and deletions (indels) into account or not. If they are, they are counted separately since the correction for superimposed events does not hold for insertions and deletions. For example, in TREECON, the evolutionary distance according to the Jukes and Cantor model is then computed as (Van de Peer et al., 1990):

A G C U G G - - - G C N U A
- - C U A G A A A G A C U A

would give the values I=6, f_U=2, G=2, and T=10.

HOW TO CHECK DISTANCE ESTIMATES

If one wants to know the evolutionary distance between two sequences, this is how to interpret the distance matrix (5 sequences were compared):

Tree construction            (title)
   5    0    0    2    0     (not important, except first digit)
(5f8.3)                      (not important)
19.534   0.000 26.837 25.016 24.728 (sequence 2)
24.304 26.837   0.000   9.079 20.176 (sequence 3)
24.593 25.016   9.079   0.000 19.053 (sequence 4)
25.280 24.728 20.176 19.053   0.000 (sequence 5)
(seq 1) (seq 2) (seq 3) (seq 4) (seq 5)

In future, I will make it possible to display the distance estimates in a more user-friendly way.

Return