Bioinformatics toolkit
www.cardiff.ac.uk/biosi/research/biosoft/

Mallard: Worked Example


In this worked example, the 16S rRNA partial gene sequence clone library described in O’Sullivan et al. (2004) is examined.  The 156 sequences in question have been deposited under accession numbers AY354711 to AY354866

1. Get sequences

First, the necessary sequences are downloaded, in FASTA format, from the NCBI web site using the search phrase AY354711[ACCN]:AY354866[ACCN].  Click here for the resulting file.

2. Check sequences

As with any comparative sequence analysis, it is important that all sequences being compared share the same orientation, and cover the same region of the 16S rRNA gene.   In this worked example, the OrientationChecker tool is used to check orientation and coverage.  All have the required sense orientation.   However, of the 156 sequences, 82 are found to be partial sequences located at the  5' end of the 16S rRNA gene, 70 are partial sequences located at the 3' end, and 4 are near-complete sequences (Fig. 1).  Because the sequences will need to be aligned, these three groups of sequences will need to be treated separately, so using OrientationChecker, three separate files FivePrime.fas, ThreePrime.fas, and Complete.fas are created.
Figure 1.  Screenshot of the OrientationChecker tool (running on Mac OS X), with 42 of the 156 Plymouth sequences displayed.  Note that most of the sequences are partial records covering either the 5' end or the 3' end of the 16S rRNA gene.  A few sequences are near-complete.  Clearly, it will not be possible to carry out a comparative analysis of all the sequences together; the data set needs to be re-packaged into three smaller files, and these analysed separately.

3. Include reference sequence  

Next, a full-length Escherichia coli K12 U00096 is added to each of the three sequence files.  U00096 will act as a reference sequence for the program.  Click here for further information on why a reference sequence is needed.

4. Align sequences

Next, each file is aligned using ClustalW, to produce the following files:

5. Run Mallard

Each file is now analysed, in turn, with Mallard.  The procedure for each is as follows:
  1. The input file in question is loaded into the Mallard program. 
  2. The Run button is selected and a plot of DE values is generated.  Each DE value summarises a single pairwise comparison; the higher the DE value, the more likely it is that one (or both) of the sequences involved is anomalous.
  3. Clicking the Identify Outliers button identifies those outlier DE values judged to be too high to be caused by error-free sequences. All DE values above the resulting default cut-off line (superimposed over the plot) are deemed to be outliers.  The program identifies the sequences responsible for the identified outliers and lists them in the Bad Sequences panel (Fig. 2).
  4. Selecting a specific sequence from the Bad Sequences panel causes all DE values generated by that record to be highlighted red (Fig. 2).   Clicking on any one of these red DE data points, reveals the underlying Pintail plot responsible for that value (Fig. 3). The profile of this plot gives an indication of the sort of anomaly the identified sequence is.
Figure 2. Screenshot of Mallard (running on Mac OS X) after analysis of FivePrime.aln.  Listed are five potentially anomalous sequences (upper right panel), responsible for the outlier DE values observed within the DE plot (left panel).  Record AY354817, has been selected (by mouse-clicking record), and in red are those DE values generated by this record.

Figure 3. Clicking on an individual DE value (right panel) causes the corresponding Pintail plot to be displayed (left panel).  In this example, a DE value of 11.37, generated by a comparison between AY354817 and AY354813, is selected.  The profile of this Pintail plot is characteristic of a chimeric sequence suggesting AY354817 is a chimera.

6. Summarising results

With the default settings, five putatively anomalous sequences are identifed within FivePrime.aln; a further five ThreePrime.aln.  No anomalies were detected in Complete.aln.  These are listed in Table 1.  
Table 1.  Putative anomalies, as identified with (i) the default cut-off line of 99.9%, and (ii) a less stringent cut-off line of 99%.
Input file
FivePrime.aln ThreePrime.aln Complete.aln
AY354817 AY354789
AY354824 AY354794
AY354826 AY354776
AY354718 AY354851
AY354749 AY354852

7. Checking results

By following a standardised procedure, each identified record is checked to see whether it is indeed anomalous.  Eleven records are confirmed to be true anomalies (Table 2). Further checking reveals all eleven to be chimeric.
Table 2.  Outcome of confirmatory checks on putative anomalies. Click on individual accession numbers for specific reports.
Accession Size Thumbnail (5'-3') Anomaly? Comments
AY354817 1013
SSU rRNA thumbnail Yes Two fragment chimera.
AY354824 1055
SSU rRNA thumbnail Yes Two fragment chimera.
AY354826 1030
SSU rRNA thumbnail Yes Duplicate of AY354824.
AY354718 1037
SSU rRNA thumbnail Yes Three fragment chimera.
AY354749 982
SSU rRNA thumbnail Yes Two fragment chimera.
AY354776 846
SSU rRNA thumbnail Yes Two fragment chimera.
AY354794 963
SSU rRNA thumbnail Yes Two fragment chimera.
AY354789 942
SSU rRNA thumbnail Yes Two fragment chimera.
AY354851
942

No False positive - no anomaly detected.
AY354852 945
SSU rRNA thumbnail Yes Two fragment chimera.

8. Conclusion and Discussion

Of the 156 records within the clone library, 9 are are shown to be clear chimeras.  One sequence (AY354851) has been falsely identified as anomalous, however as the program had marked this identification as questionable to begin with, this result is not surprising.

Have all possible anomalies been detected? Not quite; reducing the level of the cut-off line (by going to the Options menu, selecting Identify Outliers - options) from 99.9% (default setting) to 99% does sucessfully identify a further two chimeras (AY354811 and AY354804) - however, in doing so a further five false positives are also identified, clearly indicating that further exploration is unlikely to reveal any further anomalies.


Index | Toolkit website

Dr K.E. Ashelford. © 2006, Cardiff University