Expected percentage differences
An expected percentage
difference is the percentage of base mismatches between two
aligned
sequences, within a sampling window (w) of specified
size, that one
would expect if both sequences are free of errors (that is not
anomalous). It is, in effect, the
expected evolutionary distance
between two sequences within w.
If w is a sliding window, moving a fixed
number of
bases at a time along alignment Sqs (formed from
sequences Sq
and Ss)
and n is the total number of sampling positions, the set of
expected percentage differences Eqs = {ei
:
e1, e2,
..., en} can be viewed of as a summary of the local
fluctations in base-mismatches between the two sequences that we would
expect along the alignment. In contrast, observed
percentage differences Oqs = {oi
:
o1,
o2,
..., on} summarises local fluctations that are
observed. Comparing expected with observed percentage
differences, through generation of the Deviation from Expectation
statistic, enables a decision to be made on whether both Sq
and Ss are error-free or one (or both) is anomalous.
Generating expected percentage differences
To generate expected percentage differences for Sq
and Ss one needs to know (i) the window size w
and
step size b used to generate the observed percentage
differences Oqs, (ii)
the overall
evolutionary distance between Sq
and Ss as represented by the mean of the observed percentage
differences, and (iii) the location of the hypervariable regions
within the 16S rRNA gene, as mapped by probability distribution Q.
This information is used as follows:
- By sliding a window of size w with step b along
the probability
distribution Q, the average probability ai
for each
window wi is determined. The resulting data
set Qav = {ai
: a1, a2, ..., an}
is a set of average probabilities that can now be related directly to Oqs.
- Calculate fitting coefficient α as the mean of Oqs
divided by the mean of Qav.
- Convert Qav to Eqs by
multiplying
each element of Qav by α (that is, ei
= ai
* α). Multiplying each element of Qav by α has
the
effect of giving the resulting data set Eqs the same
mean as Oqs.