Useful comparative analysis of sequence classification systems w/ a few questionable bits

There is a useful new publication just out: BMC Bioinformatics | Abstract | A comparative evaluation of sequence classification programs by Adam L Bazinet and Michael P Cummings.  In the paper the authors attempt to do a systematic comparison of tools for classifying DNA sequences according to the taxonomy of the organism from which they come.

I have been interested in such activities since, well, since 1989 when I started working in Colleen Cavanaugh's lab at Harvard sequencing rRNA genes to do classification.  And I have known one of the authors, Michael Cummings for almost as long.

Their abstract does a decent job of summing up what they did

A fundamental problem in modern genomics is to taxonomically or functionally classify DNA sequence fragments derived from environmental sampling (i.e., metagenomics). Several different methods have been proposed for doing this effectively and efficiently, and many have been implemented in software. In addition to varying their basic algorithmic approach to classification, some methods screen sequence reads for ’barcoding genes’ like 16S rRNA, or various types of protein-coding genes. Due to the sheer number and complexity of methods, it can be difficult for a researcher to choose one that is well-suited for a particular analysis. 
We divided the very large number of programs that have been released in recent years for solving the sequence classification problem into three main categories based on the general algorithm they use to compare a query sequence against a database of sequences. We also evaluated the performance of the leading programs in each category on data sets whose taxonomic and functional composition is known. 
We found significant variability in classification accuracy, precision, and resource consumption of sequence classification programs when used to analyze various metagenomics data sets. However, we observe some general trends and patterns that will be useful to researchers who use sequence classification programs.

The three main categories of methods they identified are

  • Programs that primarily utilize sequence similarity search
  • Programs that primarily utilize sequence composition models (like CompostBin from my lab)
  • Programs that primarily utilize phylogenetic methods (like AMPHORA & STAP from my lab)
The paper has some detailed discussion and comparison of some of the methods in each category.  They even made a tree of the methods

Figure 1. Program clustering. A neighbor-joining tree
 that clusters the classification programs based on their similar attributes. From here.
In some ways - I love this figure.  Since, well, I love trees.  But in other ways I really really really do not like it.  I don't like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history.  Why use neighbor-joining here?  What is the basis for using this method to cluster methods?  It is cute, sure.  But I don't get it.  What do deep branches represent in this case?  It drives me a bit crazy when people throw a method designed to represent branching history at a situation where clustering by similarity is needed.  Similarly it drives me crazy when similarity based clustering methods are used when history is needed.

Not to take away from the paper too much since this is definitely worth a read for those working on methods to classify sequences as well as for those using such methods.  They even go so far as to test various web served (e.g., MGRAST) and discuss time to get results.  They also test the methods for their precision and sensitivity.  Very useful bits of information here.

So - overall I like the paper.  But one other thing in here sits in my craw in the wrong way.  The discussion of "marker genes."  Below is some of the introductory text on the topic.  I have labelled some bits I do not like too much:

It is important to note that some supervised learning methods will only classify sequences that contain “marker genes”. Marker genes are ideally present in all organisms, and have a relatively high mutation rate that produces significant variation between species. The use of marker genes to classify organisms is commonly known as DNA barcoding. The 16S rRNA gene has been used to greatest effect for this purpose in the microbial world (green genes [6], RDP [7]). For animals, the mitochondrial COI gene is popular [8], and for plants the chloroplast genes rbcL and matK have been used [9]. Other strategies have been proposed, such as the use of protein-coding genes that are universal, occur only once per genome (as opposed to 16S rRNA genes that can vary in copy number), and are rarely horizontally transferred [10]. Marker gene databases and their constitutive multiple alignments and phylogenies are usually carefully curated, so taxonomic and functional assignments based on marker genes are likely to show gains in both accuracy and speed over methods that analyze input sequences less discriminately. However, if the sequencing was not specially targeted [11], reads that contain marker genes may only account for a small percentage of a metagenomic sample.  
I think I will just leave these highlighted sections uncommented upon and leave it to people to imagine what I don't like about them .. for now.

Anyway - again - the paper is worth checking out.  And if you want to know more about methods used for classifying sequences see this Mendeley collection which focuses on metagenomic analysis but has many additional paper on top of the ones discussed in this paper.


  1. " I don't like it because they use an explicitly phylogenetic method (neighbor joining, which is designed to infer phylogenetic trees and not to simply cluster entities by their similarity) to cluster entities that do not have a phylogenetic history."

    Erm... I was under the impression, and was always taught that neighbor joining was a phenetic (distance) similarity-based approach. So its usage here seems fair play to me. It does seem though that a lot of authors in the literature, now and in the past are using NJ-phenograms to represent phylogeny, so perhaps there isn't a consensus-agreement on this.

  2. Umm ... it is a distance method. And it was designed explicitly for phylogenetic analysis. See The key feature is that branch length are allowed to vary and in essence optimized under the minimum evolution concept. As far as I know this is in essence unique to phylogenetic reconstruction and not something that would make sense to use to cluster objects that do not have a history.

  3. Plenty of more modern papers explicitly state when using neighbor-joining that it's phenetic method:

    (too many to list them individually)

    of course I freely admit you'll also find many papers referring to NJ as a phylogenetic method. Seems like it there's no consensus view to me. Of course one can use phenetic methods to try and infer phylogeny - whether you *should* is another matter, and perhaps irrelevant in this case.

    This paper may however summarise some of NJ's problems.

    Queiroz & Good 1997 Phenetic clustering in biology: a critique. Quarterly Review of Biology 72(1)

    Anyway - this might perhaps explain why Bazinet & Cummings used an NJ-gram to cluster the similarity between the different methods. It makes sense to me, I see nothing wrong with it if NJ is explicitly used as a phenetic classification method.

    1. Phenetics is fine. I have no objection to phenetics. It is also for classification of organisms or genes or other objects with history. What concept suggests that such methods should be used for classifying computational methods? It makes no sense to me. As for benefits / drawbacks of NJ - I was not defending it as the best phylogenetic method. I was saying it should not be used for NON phylogenetics.

      Show me ONE reasonable paper that has used NJ for clustering that is not phylogenetics-driven (e.g., as used here).

  4. NJ just assumes that the distances are additive (i.e. the distance between two leaves summed along the path between them on the tree's branches is the same as their pairwise distance), which is a weaker constraint than UPGMA, which assumes that distances are not only additive but also ultrametric (tree is rooted, with every leaf equidistant from the root). So NJ includes the UPGMA tree as a special case -- if the data were ultrametric, NJ and UPGMA give you the same tree. It's nonsensical to object to using NJ on nonphylogenetic data, while demanding that standard and even more formally restrictive UPGMA (i.e. the usual hierarchical clustering algorithm) be used instead. There's nothing about NJ that's necessarily "phylogenetics-driven" other than the fact that we believe evolutionary distances are roughly additive, thus NJ can be reasonably applied. I doubt that the "distances" between software implementations are additive, but I'm even more sure they're not ultrametric.

  5. Thanks Sean ... you are right here (in part) ... I was overly focusing on phylogenetics. But I was not trying to defend UPGMA as a method to analyze non biological data either. I agree with you that the distances here are unlikely to be additive or ultrameric and thus neither method should be used for this type of data. What I (generally) object to is throwing a method at something without justifying why that particular model/method is being used. NJ or UPGMA seem unwise here.

  6. Well, I dunno, maybe that's a little too dogmatic; clustering (either by NJ or UPGMA) is a useful visualization and you don't have to take branch lengths all that seriously when you're just visualizing. If we took your view then we'd have to reject pretty much every use of hierarchical clustering in the literature, where the underlying distances are unlikely to be ultrametric. Hrmmm. Well, OK, maybe I could be up for that, now that I think about it. Lets's start with that gene expression clustering paper from that other Eisen guy...

    1. Yes - I am with you. Clustering is a very nice tool for visualization but it can lead to misleading results if the distances are not right for the method.

  7. If anyone is further interested in continuing the 'is NJ phenetic or phylogenetic?' debate, we're having an open discussion of it here over on Google+

    There's some diversity of opinion already.

    I probably won't engage with the matter further here. But thanks all for the comments.

  8. Ross - I commented over on G+. For others - here is what I wrote.

    I think there is some diversity in the use of "phylogenetic" out there but here are my thoughts on some of the terminology. Phylogenetics is really the study of the relationships among organisms (or genes, genomes, or other entities). And phylogenetic methods are methods for inferring phylogeny.

    Phylogenetic methods come in many flavors. Some people divide them into two classes - as Joseph Brown did above - into distance based methods and discrete data methods (see for example Other people divide up methods into distance, parsimony and likelihood or distance, parsimony, likelihood and Bayesian categories - in essence treating pastimony based methods as distinct from likelihood/Bayesian methods even though they both deal with analyzing discrete data/characters.

    In regard to phenetics - phenetics as far as I am aware has been used to describe methods that grouped organisms by their similarity to each other and generally ignored evolutionary history. To group organisms by their similarity one generally uses distance matrix methods so there is some overlap between distance matrix phylogenetic methods and phenetic methods.

    There is some disagreement as to what distance based methods should be called phenetic and just what neighbor-joining actually is doing (e.g., see In essence NJ attempts to infer a phylogenetic tree from a distance matrix by minimizing the total branch length in the tree and in essence assuming that the distances are additive. It is certainly true that one could feed ANY distance matrix into NJ. But given its methodology, I think it is only really suitable for distances that are the result of a bifurcating evolutionary history. I have yet to see an example of a case where other types of data can reasonably be analyzed using NJ. In addition, due to the method of NJ it is not the standard phenetic approach of grouping organisms by similarity per se - because NJ allows rates of evolution to vary between taxa and thus organisms could be monophyletic yet be more similar to things outside of their clade.

    Just as distance methods can be used for non evolutionary purposes (e.g., standard clustering) so too can discrete character methods. For example, parsimony analysis can be applied to any data matrix. And one can infer "changes" between states even for objects that are not homologous and share no common ancestry. This does not mean parsimony methods SHOULD be used in such cases, but they can. And similarly, just because one can use a distance based phylogenetic method to analyze data that does not have a phylogenetic history, this does not mean one should. The issue in both cases is whether the model/algorithm is appropriate for the type of data. Since NJ in essence assumes additive distances it does not seem valid for most cases except phylogenetic history (note - I am not saying it is ideal for phylogenetic history and in fact I do not use it anymore) but that is not the point.

    It does not matter what we call the methods - phenetic or phylogenetic. What matters is the nuts and bolts of how they work. And NJ seems like a bad idea for clustering most objects.

  9. There is also the conflating issue that hard-line cladists of the Willi Hennig school tend to use "phenetic" to just mean "bad; not the one true holy method of maximum parsimony".

  10. Jonathan, thanks for pointing this out - I didn't know about it even though I was Director of the Center where the two authors work until a year ago. (They never discussed it with me, even though PhymmBL from my lab is included in the study.) I think the paper may have seriously erred in the way it evaluated the content-based aligners (including PhymmBL). The problem is that they trained all the aligners on everything in RefSeq, and then proceeded to test them on their 3 data sets. But those data sets are largely drawn from RefSeq genomes! So this introduces a very serious and fundamental bias - the training data include the test data. Therefore a method like Naive Bayes is expected to do well, since it has many parameters and can basically memorize (sort of) what it's seen before.
    This should not have gotten through review without them fixing this design flaw. The training sets should have been carefully designed for each test to exclude the test data. Oh well.

    1. Someone else on twitter (Rob Beiko) also pointed out this same issue -- big flaw ---

  11. For anyone who stumbles upon this thread (as I just did), I thought I would mention that the final (non-provisional) version of this paper is now online:

    It contains some corrections to the previous version, most notably to some PhymmBL values. Thanks to Arthur Brady and Steven for helping resolve those issues.

    To the concern Steven mentions in the last post: this was simply a design decision, primarily in the interest of time. We chose to evaluate the case where the query read is represented in the database, which reflects the use case where people are interested in previously described organisms within their sample. We realize that this experimental design surely yields different relative performance among programs compared to a design based on clade-level exclusion of sequences (in which PhymmBL may very well out-compete other programs). However, we consider our design valid for the use case described. In the updated version of the manuscript, there is a note about the clade-level exclusion technique thanks to Steven's feedback.