Bioinformatics toolkit
www.cardiff.ac.uk/biosi/research/biosoft/

Pintail: Algorithm


The Pintail algorithm is a technique for determining whether a 16S rDNA sequence is anomalous.  It is based on the idea that the extent of local base differences between two aligned 16S rDNA sequences should be roughly the same along the length of the alignment (having allowed for the underlying pattern of hypervariable and conserved regions known to exist within the 16S rRNA gene).  In other words, evolutionary distance between two reliable sequences should be constant along the length of the gene. 

In contrast, if an error-free sequence is compared with an anomalous sequence, evolutionary distance along the alignment is unlikely to be constant, especially if the anomaly in question is a chimera and formed from phylogenetically different parental sequences. 

The Pintail algorithm is designed to detect and quantify such local variations and in doing so generates the Deviation from Expectation (DE) statistic.  The higher the DE value, the greater the likelihood that the query is anomalous.

The algorithm works as follows

The sequence to be checked (the query) is first globally aligned with a phylogenetically similar sequence known to be error-free (the subject).  At regular intervals along the resulting alignment, the local evolutionary distance between query and subject is estimated by recording percentage base mismatches within a sampling window of fixed length.  The resulting array of percentages (observed percentage differences) reflects variations in evolutionary distance between the query and subject along the length of the 16S rRNA gene.  Subtracting observed percentage differences from an equivalent array of expected percentage differences (predicted values for error-free sequences), we obtain a set of deviations, the standard deviation of which (Deviation from Expectation, DE) summarises the variation between observed and expected datasets.  The greater the DE value, the greater the disparity there is between observed and expected percentage differences, and the more likely it is that the query sequence is anomalous. 

More formally...

  1. Input Sq - the query sequence to be checked for anomalies.
  2. Input Ss - the subject sequence, a reliable, error-free sequence*.
  3. Globally align Sq with Ss to generate alignment Sqs.
  4. Move a sampling window, of size w, b bases at a time along Sqs and at each position i determine the percentage of mismatched bases oi within window wi where i is 1 ≤  in, and n is the total number of windows.  
  5. Oqs = {oi: o1, o2, ..., on} is the set of observed percentage differences detected between Sq and SsThe corresponding expected percentage differences Eqs {ei: e1, e2, ..., en} are calculated from the mean of Oqs.
  6. Subtracting ei from oi for each position i generates a series of deviations, the standard deviation of which quantifies the overall deviation of Oqs from Eqs.  This is the Deviation from Expectation (DE) statistic.

*Note that algorithm accuracy is dependent on the choice of subject sequence.  A good subject is both error-free and as evolutionary close to the query as possible.  Anomalies become progressively harder to detect, the greater the overall evolutionary distance between query and subject.
The Pintail algorithm is described in Ashelford et al. (2005)


Index | Toolkit website

Dr K.E. Ashelford. © 2006, Cardiff University