Bioinformatics toolkit
www.cardiff.ac.uk/biosi/research/biosoft/

Observed Percentage Difference


In the context of the Pintail algorithm, an observed percentage difference is the number of base mismatches between two aligned sequences, as determined within a sampling window at position i within the alignment, and expressed as a percentage of that window (Fig. 1).  
Figure 1. Calculating observed percentage difference.  At position i within a sequence alignment, the observed percentage difference oi will be

oi  = m/w * 100

where m = number of mismatches, and w = length of sampling window.



Observed percentage difference is, in effect, a crude estimation of evolutionary distance, sometimes refered to as uncorrected evolutionary distance.  Attempts to calculate a more accurate estimation of evolutionary distance, that take into account the effect of multiple nucleotide substitution events, have led to the development of algorithms such as Jukes and Cantor (1969), Kimura (1980), and Jin and Nei (1990).  However, for the purposes of anomaly detection, a simple uncorrected measure of evolutionary distance is sufficient.

If the sampling window w is a sliding window, moving a set number of bases at a time along the alignment of two sequences Sq and Ss, and n is the total number of sampling positions, the set of observed percentage differences Oqs = {oi : o1, o2, ..., on} is generated.  This array summarises local fluctuations in base mismatches between the two sequences along the length of their alignment, and if plotted gives a visual representation of these changes:
Figure 2. Observed percentage differences generated from a pair of aligned sequences.  In this example, Escherichia coli ATCC11775T (X80725) was aligned with Pseudomonas aeruginosa LMG 1242T (Z76651).  The resulting alignment was then sampled with a 100 base sliding window, moving 25 bases at a time along the alignment.


The mean of Oqs will be approximately the same as the overall uncorrected evolutionary distance between the two sequences.  

Observed percentage differences reflect those localised 'evolutionary distances' actually recorded between two sequences.  In contrast, expected percentage differences summarise evolutionary distances one would expect to see between two sequences, assuming that both sequences are error-free (that is, not anomalous).

Relating observed percentage differences to absolute base positions

In theory Sq and Ss need only be aligned with each other in order to generate observed percentage difference values.  In practice, a reference sequence Sr is also included in the alignment in order to relate calculated observed percentage differences to absolute base positions within the 16S rRNA gene sequence.  Click here for further information.


Index | Toolkit website

Dr K.E. Ashelford. © 2006, Cardiff University