Bioinformatics toolkit
www.cardiff.ac.uk/biosi/research/biosoft/

  Probability distribution Q


The 16S rRNA gene has an alternating pattern of conserved and hypervariable regions.  Predicting where these regions lie, in any given sequence, is an important step towards distinguishing between natural intergene variation and a genuine sequence anomaly, when comparing two or more sequences.  What is needed is a probability distribution for the 16S rRNA gene (with one probability allocated to each base position) which maps the position of the various variable regions.

Defining Q

Consider two, randomly selected, error-free sequences Sq and Ss and their resulting global alignment Sqs: let qi be the probability that alignment position i contains a different base residue in sequence Sq to that in sequence Ss.  Consider further Q = {qi : q1, q2, ..., qn}, a set of these probabilities, one for each position in the alignment.  A knowledge of Q allows us to predict where in the alignment Sqs the two sequences should most vary from one another, and conversely where they should be most similar, assuming both sequences are error-free.  Thus, if Sq and Ss are found to deviate from each other in a manner outside of that predicted by Q, then we have good evidence that one, or other (or both) of the sequences is in some way anomalous. 

Calculating probability distribution Q

Q can be generated for any dataset of homologous gene sequences as follows:
  1. Amass a collection of error-free, full-length gene sequences representing the taxon to be investigated.
  2. Include a reference sequence to serve as a position guide*.
  3. Align the sequences to create a multiple sequence alignment.
  4. At each alignment position that coincides with a base in the reference sequence, determine the frequency of occurrence of the most common residue f, ignoring gap characters (Fig. 1), so generating the frequency distribution F = {fi : f1, f2, ..., fn}, where n is the total number of residues in the reference sequence. 
  5. Convert F to probability distribution P = {pi: p1, p2, ..., pn}, where pi = (fi - 0.25)/0.75§ and is the probability that reference position i contains the same residue in each sequence of the alignment (that is, i is a conserved position).
  6. Finally, convert P to Q = {qi : q1, q2, ..., qn}, where qi = 1 - pi and is the probability that reference position i does not contain the same residue in each sequence of the alignment (that is, i is a hypervariable position).
Figure 1. Calculating Q values.  In this example, residue T occurs most often at reference sequence position i, so the relative frequency of this base (0.6) is used to calculate probability pi and hence qi (0.43).  Similarly at position j, G is most common with a relative frequency of 0.89, ignoring gaps, hence qj is 0.15.

Probability distribution Q as used by Pintail algorithm

The Pintail algorithm, as implemented in the Pintail and Mallard programs, uses the probability distribution Q generated from an aligned dataset of 4,383 full-length Bacteria type-strain sequences, as described in Ashelford et al. (2005).

A plot of Q from this data gives a visual representation of the conserved and hypervariable regions Q describes:
Figure 2.  Plot of probability distribution Q generated from 4,383 full-length Bacteria type-strain sequences.

Smoothing the data enables a clear picture of the various regions.  The peaks and troughs coincide closely with the known positions of the variable regions within the 16S rRNA gene.
Figure 3.  Plot of probability distribution Q, after smoothing data by taking means with a sliding window of 50 bases, moving 1 base at a time. Peaks represent hypervariable regions whilst troughs represent conserved regions.

 Note...

* In order to relate probability distribution Q to any other sequence alignment (e.g. during running of the Pintail algorithm) it is necessary to include a reference sequence to serve as a suitable position guide.  Q probabilities are calculated with reference to this sequence, hence the size of Q corresponds to the length of the reference.  The implementation of the Pintail algorithm used by the Pintail and Mallard programs uses a Q distribution generated with Escherichia coli K12 U00096 as reference.

§ At any particular position i, the values for fi will range from 0.25 to 1, inclusive; fi can never be lower than 0.25 since the lowest possible frequency of the most common base occurs when all four bases (A, C, G, and T/U) occur in equal numbers (that is, are equally common).  Consequently, to convert frequencies to probabilities (which should range from 0 to 1, inclusive), each frequency is converted with the equation pi = (fi - 0.25)/0.75.


Index | Toolkit website

Dr K.E. Ashelford. © 2006, Cardiff University