Bioinformatics toolkit
www.cardiff.ac.uk/biosi/research/biosoft/

Mallard: Cut-Off Line


The cut-off line identifies which DE values, when plotted against mean percentage differences, are likely to be due to anomalous sequences, such as chimeras.  The cut-off line defines that region of the plot-space in which only DE values resulting from error-free comparisons should occur, based on previous comparisons of error-free sequences (Fig. 1).  Any DE values plotted outside of this plot-space are considered by the program to be outlier values.
Figure 1. DE plot with cut-off line, in red, superimposed.  Data points below the line are judged to represent comparisons between error-free sequences.  Data points above the line are likely to result from comparisons involving anomalous sequences.  In this example, a cut-off line, defining the location of 99.9% of error-free comparisons, is used.  See main text for further explanation.

Default cut-off line

The default cut-off line used by the program aims to define a plot-space in which 99.9% of all error-free comparisons should occur, based on previous comparisons.  Put another way, the probability of a pair of error-free sequences, selected at random, giving rise to a DE value outside the area defined by this 99.9% cut-off line, should be P = 0.001. 

Prior experimentation (data not shown) has shown that a 99.9% cut-off line is usually most efficient at detecting the maximum number of anomalous sequences with the minimum number of false positives, and hence this is the default cut-off line used by the program. 

 Note...

By selecting the Identify Outliers - options menu item from the Options menu, the user can select alternative cut-off lines representing 75%, 95%, 99% or 100% of reliable comparisons.  The higher the value of cut-off line, the more stringent will be the identification of potential outliers.

How the cut-off lines are calculated

2,007 reliable type-strain sequences, as catalogued by the Ribosome Database Project (RDP), were compared with each other to produce 2,013,021 separate DE values.  These were plotted against their corresponding mean percentage differences values (Fig. 2). 
Figure 2. Plot of DE values generated from comparisons among 2,007 reliable type-strain sequences.
The median, upper quartile, and 95, 99, 99.9 and 100% quantiles of the plotted data were then determined at each 1% interval along the x axis of the plot.  Plotting these quantile data sets produce the plots shown in Fig. 3, which act to summarise the distribution of the DE values. 
Figure 3. Quantile lines generated from data presented in Fig. 2.
When plotted against a Log10 x axis, these 'raw' quantile plots give roughly straight lines; consequently they can be simplified to a series of linear equations (Table 1) which when plotted, give
a simplified representation of the original type-strain data (Fig. 4).

Table 1.  Equations for cut-off lines as used by Mallard program.
Quantile line Equation
50 % (median)
y = 1.98Log10x + 0.746
75 % (upper quantile)
y = 2.28Log10x + 1.00
95 %
y = 2.64 Log10x + 1.46
99 %
y = 3.12 Log10x + 1.66
99.9 %
y = 3.27 Log10x + 2.07
100% (maximum)
y = 4.37 Log10x + 1.81
Figure 4. Quantile lines as generated from the equations listed in Table 1.

 Note...

By selecting the Identify Outliers - options menu item from the Options menu, the user can choose to display the cut-off lines in the simplified form presented in Fig. 4 (default setting) or as the raw quantile data as originally calculated from the type-strain data (Fig. 3).


Index | Toolkit website

Dr K.E. Ashelford. © 2006, Cardiff University