UTAX accurately predicts taxonomy of marker gene sequences

Size: px

Start display at page:

Download "UTAX accurately predicts taxonomy of marker gene sequences"

Lee McKinney
6 years ago
Views:

1 UTAX accurately predicts taxonomy of marker gene sequences Robert C. Edgar Independent Investigator Tiburon, California, USA. The UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and other marker gene sequences targeted by next-generation metagenomics experiments. UTAX has comparable sensitivity but much lower error rates compared to most existing methods, predicting dramatically fewer false positives for novel taxa. Recent studies using next-generation sequencing of marker gene segments include the Human Microbiome Project (HMP) 1 and a survey of the Arabidopsis root microbiome 2. A fundamental step in such studies is to predict the taxonomy of sequences in the reads, which are typically clustered into Operational Taxonomic Units (OTUs). Computational taxonomy prediction is complicated by the fact that only a small minority of microbial species have authoritative classifications and reference databases have sparse coverage so that in practice, an OTU often does not have an exact match in the database (Supp. Note 3). With the goal of improving taxonomy prediction accuracy, I developed a new algorithm, UTAX, that accounts for sparseness in the database and for varying correlations between rank and sequence identity in different groups. UTAX calculates a novel score combining k- mer distances to the top hit and to the nearest neighbor at each rank, i.e. the most similar sequence with a different name at that rank. For each rank, the probability that the query belongs to the same group as the top hit is calculated from the distribution of scores over all pairs in the reference database.

2 Available reference databases for the 16S ribosomal RNA gene (16S) include SILVA 3, Greengenes 4 and the RDP Classifier 5 (RDP) training set. The current RDP training set (v14, here called RDP14) contains 10,679 sequences. Greengenes and SILVA are larger, giving better coverage than RDP14 but not as much as might be expected from the numbers of sequences (Supp. Note 3). Most of the annotations in SILVA and Greengenes are not authoritative classifications but predictions generated by a combination of automated and manual methods 6,7 which I estimated to have error rates of ~6% and ~18% respectively for genus (Supp. Note 4). Also, SILVA and Greengenes are not compatible with some programs because many sequences lack species and genus names (Supp. Note 5), and I therefore chose to use RDP14 for comparative validation on 16S and the RDP Warcup training set 8 version 4 (War4) for the fungal internal transcribed spacer (ITS) region. Given sequences from a biological sample (here called OTUs without necessarily implying clustering) and a reference database, I defined coverage at each taxonomic rank to be the fraction of OTUs that belong to a known group. Here, known means that the group is present in the reference database, regardless of whether the group has been named, and novel that the group is not present. I defined the lowest known rank (LKR) of an OTU as its lowest rank having at least one reference sequence and the LKR frequency λr as the fraction of OTUs having LKR = r. For example, if λgenus = 0.4, then 40% of the OTUs belong to a novel species in a known genus. LKR frequencies can be interpreted as a profile summarizing taxonomic novelty in the OTUs with respect to the database. I estimated LKR frequencies for soil, human gut and mouse gut reads of the 16S V4 region from a recent study 9 using sequence identity thresholds determined by Yarza et al. 10 : 95% for genus, 86% for family, etc. (Supp. Note 7). While identity gives only an approximate indication of rank, averaging over OTUs for a typical sample should give frequencies that are realistic even if not accurate for that particular sample. Using RDP14 as a reference and OTUs constructed by UPARSE 11, I estimated the fraction of OTUs with novel genera to be 83% for soil, 63% for mouse gut and 57% for human gut, showing that coverage is sparse in practice (Fig.1 and Supp. Notes 7 and 16).

3 At high identities and high ranks, an OTU almost certainly belongs to the same group as the top database hit, and at low ranks and low identities, an OTU almost certainly belongs to a different group. The most challenging cases occur when identity is close to the average for the rank, for example attempting to predict genus when identity is ~95%. This is a twilight zone for taxonomy prediction (Fig. 1) analogous to the twilight zone for protein homology prediction 12. In principle, it might be possible to identify genus-specific sequence features, but not when reference data is too sparse. For example, almost half (913 / 1,948) of the genera in RDP14 have only one reference sequence, and in these cases it is impossible to predict whether a human expert would assign another species to the same group from its sequence alone. Thus, in the twilight zone, predictions of known genera will often be false positives while non-predictions will often be false negatives (see also Supp. Note 15). Identity distributions for typical samples (Fig. 1 and Supp. Fig. SN6.2) show that twilight zone OTUs are common in practice, underscoring the difficulty of accurate taxonomy prediction and the importance of providing a confidence estimate. With this in mind, I designed UTAX to predict the mean number of errors per query (EPQ) for each rank (see Methods). For testing, I set a threshold of P = (1 EPQ) 0.9 on the assumption that ~10% is an acceptable error rate for a typical study. The RDP authors measured accuracy using leave-one-out validation 5, which I believe is inappropriate in this context (Supp. Note 6). I used a different strategy that has been applied to validation of shotgun metagenomics taxonomy prediction 13 by constructing datasets where LKRs are known from trusted annotations, as follows. For k=genus, family... phylum I divided RDP14 into two subsets (rank splits) Xk and Yk such that the LKR between the subsets is k. For example, with LKR = family, I discarded families with only one genus and randomly assigned the remaining genera to Xfamily or Yfamily with the constraint that at least one genus from every family must be present in both (Supp. Fig. SN13.1). For each k and for each region of interest (full-length gene, V4 etc.), I measured prediction performance for all ranks using Xk as the query and Yk as the reference and vice versa. I included a null split XN = YN = RDP14 to measure performance when the sequence is known. I followed the same procedure for War4. For every split at rank k I calculated the following accuracy metrics for each rank r (see Supp. Note 8 for discussion). Sensitivity (Srk) is the

4 fraction of known names at rank r that were correctly predicted. The misclassification error rate (Mrk) is the fraction of known names at rank r that were incorrectly predicted. The overclassification error rate (Ork) is the fraction of novel r's that were incorrectly predicted to be known. Given the LKR frequencies λk, the total sensitivity Sensr and errors per query EPQr at rank r for a set of OTUs can be estimated by assuming that the sensitivities and error rates at each rank are approximately the same as those measured on the rank splits: Sensr = Σ λk Srk, k (Eq.1) EPQr = Σ λk (Ork + Mrk). k (Eq.2) To obtain sensitivities and error rates for typical data, I used Eqs. 1 and 2 with the estimated LKR frequencies for the soil, human gut and mouse gut OTUs. While the frequencies may be inaccurate for those samples, and the sensitivities and error rates for each LKR in a given set of OTUs may differ somewhat from those measured on the rank splits, this procedure should nevertheless give good estimates in the sense that they fall comfortably within the range of true values for typical data in practice, giving a far more realistic indication of algorithm accuracy than leave-one-out testing (Supp. Note 6). Using this method, I compared the accuracy of UTAX with GAST 14, RDP and methods supported by mothur 15 and QIIME 16 (see Supp. Note 11 for method name abbreviations, software versions and command lines). Representative results are given in Table 1; the underlying performance metrics are given in the Supplementary Files and Supp. Note 12. Mothur-rdp gave very similar results to RDP (Supp. Note 1). The only method to consistently achieve an estimated EPQgenus below 10% was mothur-knn, but its sensitivity was also much lower than the other methods (Sensgenus < 40% on all samples). The estimated EPQgenus of UTAX was ~10% on all three samples, remarkably close to the rate predicted by the P 0.9 threshold given that P is calculated by an independent method that

5 does not use identity thresholds or rank splits (Methods). All other algorithms had substantially higher EPQgenus, ranging from EPQgenus ~17% for RDP at 80% bootstrap to QIIME-blast which consistently had the highest error rate (EPQgenus 62% to 78%). The default QIIME method, QIIME-uc, had EPQgenus = 39% to 45% and QIIME-rdp, which sets the bootstrap cutoff at 50% by default, had EPQgenus = 36% to 40%. Sensphylum was >90% for all methods except QIIME-uc (78% on soil, 87% on mouse gut) and QIIME-sm (79% on soil, 87% on mouse gut). Methods Given a pair of sequences Q and R, I defined the lowest common rank (LCR) of Q and R to be the lowest rank where Q and R have the same name. Given a similarity measure d(q, R, k), P(LCR=k d) is the probability that the LCR is k. For example, if d is sequence identity then P(LCR=phylum d=93%) will be close to one but P(LCR=genus d=93%) will be lower. To obtain a discrete range, UTAX converts a real-valued similarity d taking values zero to one to an integer percentage D = 100 d. Considering all pairs of sequences in a reference database B, let the number of pairs with a given D be HD and the number of those pairs with LCR=k be hd,k, UTAX calculates an a-posteriori estimate for P(LCR=k D) from B as the fraction of pairs having distance D which also have LCR=k, i.e. P(LCR=k D) ~ hd,k/hd. (Eq.3) For motivation and visualization of Eq.3 see Supp. Note 9. UTAX calculates the matrix CD,k = hd,k/hd from B and stores it for use in run-time prediction. Let P(CR(k) D) be the probability that two sequences have a common rank at level k, i.e. have the same name at that rank. Let taxon(q, k) be the name of Q at rank k. Q and R have the same name at rank k if their LCR is not > k, hence

6 P(CR(k) D) = P(taxon(Q, k) = taxon(r, k) D) = 1 P(LCR(Q, R) > k D) = 1 Σ CD,r. (Eq.4) r > k Thus, given a reference sequence R and an integer similarity D, Eq.4 gives the probability that the taxon name of Q is the same as R at rank k. This gives a framework for constructing a taxonomy prediction algorithm based on a similarity measure d. Natural choices for d include identity calculated from an alignment or a word-counting distance. However, these would not take into account that the correlation varies in different groups due to differing evolutionary rates and lumping or splitting by taxonomists. I therefore also considered the similarity of a reference sequence R with its nearest neighbor NNk(R) for each k, i.e. the sequence in B with highest similarity to R and a different name at rank k. If NNk(R) is close to R, then the confidence that taxon(q, k) = taxon(r, k) should be reduced because of the increased likelihood that taxon(q, k) = taxon(nnk(r), k). I chose to use similarities calculated from the set w8(q) of 8-mers in Q. I defined the unique word similarity (U) of a pair of sequences Q and R as U(Q, R) = w8(q) w8(r) /min( w8(q), w8(r) ). (Eq.5) I designed a similarity measure (dutax) that increases with higher similarity between Q and R, decreases with higher similarity between R and Hk(R), and takes real values between zero and one, dutax(q, R, k) = (2 U(Q, R) U(R, NNk(R))/2. (Eq.6) (See Supp. Note 14 for comparison with other measures). Given a query sequence Q, UTAX identifies the top hit T by unique word similarity, i.e. T = argmaxr { U(Q, R), R B }. The rank names of Q are predicted to be the same as those of T with probabilities calculated by Eq.4 using the dutax similarity measure.

7 Figures and tables Fig. 1. Estimated Lowest Known Ranks (LKRs) for soil OTUs. The upper graph shows lowest common rank (LCR) probabilities as a function of sequence identity calculated for the V4 region of RDP14 (using Eq.3, see also Supp. Note 9). The lower histogram shows frequencies of integer-rounded sequence identities of top hits of OTUs to the RDP14 database. Histogram bars are colored to indicate estimated LKRs according to the Yarza thresholds. While identity thresholds are not reliable indicators of rank, the fraction of OTUs in a Yarza identity range nevertheless gives a realistic indication of how many OTUs with the corresponding LKR might be found in a similar sample. The "twilight zone" is a region around 95% identity where high sensitivity for genus prediction cannot be achieved without high false positive rates because if the closest reference sequence has ~95% identity, then it is unlikely that there are enough training examples to identify genusspecific sequence features, and identity correlates only approximately with taxonomic rank, noting e.g. that P(LCR=genus 95%) = 0.34, P(LCR=family 95%) = 0.33 and P(LCR=order 95%) = 0.23.

Table 1. Estimated accuracy for soil, mouse gut and human gut OTUs. The table shows estimated sensitivity and errors per query (EPQ) for genus and phylum predictions, expressed as percentages.

8 Table 1. Estimated accuracy for soil, mouse gut and human gut OTUs. The table shows estimated sensitivity and errors per query (EPQ) for genus and phylum predictions, expressed as percentages. Error rates >10% are highlighted yellow and >30% magenta. Genus sensitivities <50% are highlighted magenta and phylum sensitivities <90% yellow. Results for UTAX are shown for threshold P 0.9. Results for RDP are shown with 80% bootstrap cutoff (recommended by the authors) and 50% bootstrap (the default for QIIMErdp).

9 References 1. HMP Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, (2012). 2. Lundberg, D. S. et al. Defining the core Arabidopsis thaliana root microbiome. Nature 488, (2012). 3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, (2007). 4. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rrna gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, (2006). 5. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, (2007). 6. Yilmaz, P. et al. The SILVA and all-species Living Tree Project (LTP) taxonomic frameworks. Nucleic Acids Res. 42, (2014). 7. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, (2012). 8. Deshpande, V. et al. Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences. Mycologia (2015). doi: / Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ. Microbiol. 79, (2013). 10. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rrna gene sequences. Nat. Rev. Microbiol. 12, (2014). 11. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10, (2013). 12. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, (1999). 13. Patil, K. R. et al. Taxonomic metagenome sequence assignment with structured output models. Nat. Methods 8, (2011).

10 14. Huse, S. M. et al. Exploring microbial diversity and taxonomy using SSU rrna hypervariable tag sequencing. PLoS Genet. 4, e (2008). 15. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, communitysupported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, (2009). 16. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, (2010). Author contributions R.C.E. conceived of the study, performed the analysis and wrote the manuscript.

11 UTAX accurately predicts taxonomy of marker gene sequences Supplementary Notes Note 1. Mothur-rdp is effectively equivalent to RDP. Note 2. Genus predictions for the Soil86 set. Note 3. Coverage of SILVA and Greengenes. Note 4. Error rates of SILVA and Greengenes taxonomy annotations. Note 5. Reference database compatibility. Note 6. Leave-one-out and leave-10%-out validation. Note 7. Estimated LKR frequencies for in vivo samples. Note 8. Accuracy metrics for taxonomy prediction. Note 9. Calculation of LCR probabilities and S/E. Note 10. Compute time and memory use of the tested methods. Note 11. Software versions and command lines. Note 12. Performance metrics on RDP14 and War4. Note 13. Construction of a rank split. Note 14. Sensitivity/ EPQ plots for similarity measures. Note 15. Non-predictions and blank names. Note 16. LKR estimates and OTU error rates. Supplementary References

12 Note 1. Mothur-rdp is effectively equivalent to RDP. I compared the predictions of mothur-rdp and RDP for all rank splits of the RDP14 V4 region. At a bootstrap cutoff of 80%, 241,585 taxon names were predicted by one or both algorithms. Of these, 234,877 (97%) were identical. At 50% bootstrap, 280,334 / 291,705 (96%) were identical. A rate of disagreement of 3 to 4% is consistent with differences due to the use of random numbers in the bootstrapping procedure. I concluded that mothur-rdp and RDP are effectively equivalent implementations of the same algorithm and did not consider mothur-rdp separately for the rest of this work.

13 Note 2. Genus predictions for the Soil86 dataset. Method Genus predictions UTAX 0 QIIME-uc 3 (0.1%) QIIME-sm 19 (0.5%) mothur-knn 283 (8%) RDP (80% bootstrap) 561 (15%) RDP (50% bootstrap) 942 (26%) GAST 3,048 (84%) QIIME-blast 3,637 (89%) Table SN2.1. Genus predictions on the Soil86 dataset. Soil86 contains 3,637 UPARSE OTU sequences from the soil sample of Kozich et al. 1 with 86% identity to the RDP14 reference database, suggesting a lowest known rank of order or higher. Few genus predictions would be expected for this set considering that the Yarza threshold is 95% for genus, but some methods predicted many genera, the most by QIIME-blast which predicted genera for 89% of the sequences.

14 Note 3. Coverage of the Greengenes and SILVA databases. The Greengenes 2 and SILVA 3 reference databases are larger than RDP14: Greengenes v.13.5 has sequences with taxonomy annotations and SILVA v123 has , compared to for RDP14. The default database for the QIIME methods is a subset of Greengenes (GG-QIIME, 99,322 sequences) obtained by clustering at 97% identity, while one of several suggested databases for use with mothur is a subset of SILVA (SILVAmothur, 172,418 sequences) [ retrieved 12th Dec 2015]. The GG-QIIME and SILVA-mothur databases thus have an order of magnitude more annotated sequences than RDP14 and a priori might provide a better reference set for compatible algorithms, noting that RDP is not compatible because it requires names for the lowest rank for all training sequences (Note 5) while most sequences in GG-QIIME and SILVA-mothur lack species and genus names. Fig. SN3.1 shows the identity distributions for soil OTUs against GG-QIIME and SILVAmothur compared with RDP14, showing that GG-QIIME and SILVA-mothur have less sparse coverage than RDP14 though there are still many OTUs with estimated LKR>genus. Coverage is less sparse in the sense there are more OTUs with high identities / fewer with low identities, and this gives the appearance of more known ranks. However, while almost all ranks are named in RDP14 (Note 11), the interpretation of lowest known rank is different for GG-QIIME and SILVA-mothur where most sequences lack names for low ranks, so "known" (present in the database) does not necessarily imply "named". Most annotations in those databases were predicted using sequence analysis methods, so "named" does not imply "authoritatively named" by conventional standards. I estimate the genus annotation error rate to be ~6% for SILVA-mothur and ~18% for GG-QIIME (Note 4). It is therefore difficult to assess whether using one of the larger databases improves or degrades prediction accuracy for a given algorithm compared to using RDP14, but especially in the case of GG-QIIME it appears that the annotation error rate of the database may be high enough to substantially degrade prediction performance, noting that annotation errors of the database will be compounded by the inherent error rate of a

prediction algorithm, and confidence will be systematically overestimated because the database error rate is not considered. Fig SN3.1. Identity distributions of soil OTUs.

15 prediction algorithm, and confidence will be systematically overestimated because the database error rate is not considered. Fig SN3.1. Identity distributions of soil OTUs. Histograms show frequencies of integerrounded sequence identities of top hits of OTUs to RDP14, GG-QIIME (the subset of Greengenes which is the default reference in QIIME) and SILVA-mothur, one of the reference databases provided for use by mothur. Colors indicate estimated lowest known ranks according to the Yarza thresholds (see main text for methods).

16 Note 4. Error rates of Greengenes and SILVA taxonomy annotations. Henri Poincaré famously described mathematics as the art of giving the same name to different things 4. In taxonomy, this is a bad idea. Most taxonomy annotations in Greengenes and SILVA databases were predicted for uncultured sequences using a combination of automated and manual methods 5,6. I don't fully understand their guiding principles or exactly how they were implemented, but presumably they work something like the following. The starting point is a set of sequences obtained from authoritatively classified organisms (gold-standard sequences). Other annotations are made using a predicted phylogentic tree. If a non-gold sequence is in the same subtree as a gold sequence at a given rank, the name at that ranks is inferred to be the same. To the best of my knowledge, neither Greengenes nor SILVA documents which sequences were used as gold standards or the evidence supporting a given annotation (is it a gold standard sequence? an automated prediction? an automated prediction which was manually adjusted, and if so why?), making the reliability of any given annotation difficult to evaluate or verify independently. There are several differences in taxonomic nomenclatures and procedures for reconciling conflicts between taxonomy and sequence evidence. Greengenes is based on the NCBI taxonomy, RDP14 on Bergey's 7 and SILVA on LSPN 8. While RDP14 strictly adheres to Bergey's to the best of my knowledge, Greengenes and SILVA modify their base taxonomies to address inconsistencies with phylogenies determined from sequence. For example, Greengenes deletes the genera Escherichia and Shigella, which are believed to overlap 9, leaving their sequences classified to family level only (Enterobacteriaceae). SILVA deals with this issue in a different way by defining a combined genus (Escherichia-Shigella) and retaining well-known species names such as Escherichia coli, while Greengenes leaves their species names blank.

17 Both databases maintain large multiple alignments of 16S sequences, many of which have incorrect and ambiguous bases and some of which are undetected chimeras 10. The Greengenes alignment is fixed at 7,682 columns using the NAST approach 2 which intentionally introduces misalignments (i.e., errors) to avoid increasing the number of columns. Construction of RNA alignments is challenging, especially for large and diverse datasets, and the best current alignment algorithms have substantial error rates when challenged with highly diverged sequences 11. Perfect tree inference from a sequence alignment is generally not possible due to alignment errors and information loss 12. Tree construction error rates are difficult to estimate but can be substantial on large datasets 13. Given these issues, it is plausible that the Greengenes and SILVA trees could have substantial error rates, raising the question whether these, perhaps together with other imperfections in their annotation methods, have caused substantial numbers of taxonomy annotation errors. This cannot be assessed directly because the ground truth is not known. Instead, I identified errors by noting that annotations for identical sequences should agree, so if two databases have different annotations for the same sequence then one or both of them must be wrong. Implementing this analysis is complicated by the fact that the databases use taxonomic systems with different sets of names. Another complication is the interpretation of blank names. Does a blank name indicate assignment to a sub-tree that has not been named, that a name cannot be assigned due to overlapping named groups (like Escherichia-Shigella), or low confidence in a prediction (i.e., the name might be known, or there are two candidate known names which do not overlap but which are hard to distinguish)? (see also Note 15). In consideration of these issues, I counted only names used by both systems (common names), excluding names which do not correspond to clades such as unclassified, uncultured, candidatus and incertae sedis. If one or both names were blank, the pair was not counted.

18 Results are summarized in Table SN4.1, which shows that SILVA-mothur and GG-QIIME disagree on 24% of genus annotations and 2% of phylum annotations for identical sequences. This provides an lower bound on the sum of the annotation error rates for both databases. The lower bound is achieved when every incorrect annotation is correct in the other database. It should be rare for annotations to be wrong in both databases by chance (if errors are random at a rate of ~10%, then ~1% will be wrong in both). Given that distinctly different methods are used for alignment and tree construction, I would guess that the errors have low correlation between the databases and the true combined rate is close to this lower bound. A pair-wise comparison measures the combined error rate without indicating the relative rate, i.e. whether one database has a higher or lower error rate than the other. This can be investigated using pair-wise comparisons with a third database, RDP14. Genus annotation disagreement rates with RDP14 are 11% for GG-QIIME and 3% for SILVA-mothur. This indicates that GG-QIIME has a higher error rate than SILVA-mothur because the error rate of RDP14 should be roughly the same in both pair-wise comparisons, adding approximately the same term to both combined rates. Also, all RDP14 sequences have genus annotations and its much smaller size is more amenable to curation, suggesting that it has a high frequency of gold-standard sequences and is likely to have a much lower error rate. This hypothesis is supported by the lower pair-wise disagreements of RDP with the other two databases. If we assume that the error rate of RDP14 is smaller than the other two databases, then we can infer that the error rate of GG-QIIME is roughly 11% / 3% 3 to 4 larger than SILVA-mothur. Assuming a factor of three implies that the total error rates are 24% 3/4 = 18% for GG-QIIME and 24% 1/4 = 6% for SILVA-mothur. While these estimates are uncertain, the combined rate of 24% is robust and it is reasonable to conclude that the minimum plausible genus annotation error rates are 5% for SILVAmothur (minimum determined by assuming a maximum of 4 more errors in GG-QIIME) and 12% for GG-QIIME (minimum determined as half of the 24% combined rate, given that the comparison with RDP14 indicates a higher rate for GG-QIIME).

19 1. GG-QIIME and SILVA-mothur Rank Common Names Same Name Different Name Phylum (98.3%) 481 (1.7%) Class (88.2%) 1201 (4.9%) Order (78.1%) 2804 (12.8%) Family (83.1%) 1428 (9.0%) Genus (69.2%) 1868 (24.1%) 2. GG-QIIME and RDP14 Rank Common Names Same Name Different Name Phylum (99.6%) 2 (0.4%) Class (95.3%) 27 (1.5%) Order (88.6%) 79 (4.4%) Family (92.1%) 78 (5.0%) Genus (89.2%) 151 (10.8%) 3. SILVA-mothur and RDP14 Rank Common Names Same Name Different Name Phylum (99.8%) 2 (0.2%) Class (99.4%) 17 (0.4%) Order (93.7%) 57 (1.7%) Family (94.8%) 141 (3.3%) Genus (97.3%) 124 (2.7%) Table SN4.1. Pair-wise comparisons of taxonomy annotations. The table shows the rate of agreement and disagreement between taxonomy annotations for identical sequences found in each pair of reference databases. Common Names is the number of identical sequences having a common name for the given rank in one or both databases, Same Name is the number of these sequences for which the name was the same and Different Name is the number for which the name was different. A common name is a taxon name found in the taxonomy systems for both databases.

20 Note 5. Reference database compatibility. The tested programs place different constraints on taxonomy annotations. Mothur does not allow a species name, which ruled out testing at species rank on War4. RDP requires that the lowest rank is named for all reference sequences, which ruled out testing on Greengenes or SILVA where genus and species names are often omitted. The mothur reimplementation of the RDP algorithm does allow missing genus names. RDP14 includes reference sequences with optional ranks (suborder and subclass) and missing ranks (e.g., sometimes only phylum and genus are specified with no intermediate ranks). These variations are supported by RDP but not by some other programs. UTAX requires that names correspond to clades so that the LCR can be determined for all pairs of sequences. This means that names such as unclassified, uncultured, candidatus and incertae sedis should be excluded for training. I therefore constructed subsets of the reference databases with taxonomies that were compatible with all programs to enable testing on the same reference data. This was done by filtering out special cases such as "uncultured", deleting optional ranks (suborder, subclass) and discarding annotations with any missing or blank names for required ranks (genus, family, class, order and phylum for RDP14 and species, family, class, order and phylum for War4). This required discarding 506 / 10,049 sequences (5%) from RDP14 and 9,546 / 24,500 (40%) from War4. The compatible versions of the reference databases are included in the Supplementary Files.

21 Note 6. Leave-one-out and leave-10%-out validation. In their 2007 paper describing the RDP Naive Bayesian Classifier 14, Wang et al. state in the Abstract that " results from leave-one-out testing show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level". In my opinion, this approach is not appropriate for microbial taxonomy prediction because an informative leave-one-out validation requires that all categories are known and training data is dense (Fig. SN6.1). With microbial taxonomy, training data is sparse and many microbial genera and higher ranks are novel in typical data (Fig. SN6.2). In addition, accuracy was measured using a bootstrap cutoff of zero rather than the authors' recommended cutoff of 80%. Roughly half of the genera in RDP14 have only a single sequence (913/1,948, Table SN6.1) and therefore cannot be predicted if left out of the training set, but this is not taken into account. Accuracy as measured by this test is thus the maximum possible sensitivity in a scenario where a large majority of query sequences have identity >97% (Fig. SN6.2), which is unrealistic, and where the maximum achievable accuracy is not 100% as would be expected by convention. At RDP14 genus level, RDP and UTAX have 86% accuracy by this definition, close to the maximum possible of 91% (Table SN6.1), as would be expected for sequences with >97% identity. The observation that accuracy is less than 100% is mostly explained by classifications that are impossible due to singletons (9%) with a smaller contribution by misclassification errors (5%). It is therefore clear that accuracy as measured by the RDP leave-one-out test methodology is not predictive of sensitivity or error rates on typical biological data. Leave-one-out accuracies for RDP and UTAX are reported in Table SN6.2. In a recent preprint [ Bokulich et al. describe a taxonomy prediction validation framework designed to enable reproducible results. I was unable to install the framework or download the test data. The framework has several dependencies on third-party code including Python packages which failed to install. One of the described tests uses leave-10%-out validation where 10% of sequences are extracted from Greengenes for use as a query set with the remaining sequences used as a reference. I followed the methodology described in the preprint by extracting the V4

22 region of Greengenes v13.5 using the 515F/806R primers and extracting 10% subsets chosen at random. I found the identity distribution shown in Fig SN6.2 (lower-right) which shows that a large majority of sequences in the query sets have 99% identity with their corresponding reference sets. This distribution is even more strongly skewed towards 100% identity than the RDP leave-one-out test, which is explained by stronger sampling biases; for example, the most abundant genus in Greengenes v13.5 is Staphylococcus with 135,711 sequences, comprising more than 10% of the database. Therefore, this test is not predictive of sensitivity and error rates on typical biological data.

23 Fig. SN6.1. Microbial taxonomy prediction is not a textbook problem. In a textbook classification problem (left), all categories are known (handwritten digits, in this example) and have many training examples. Leave-one-out and leave-10%-out validation is informative in a textbook case because they are realistic models of classification in practice. With microbial taxonomy, reference data is sparse (right). In this analogy, the task of an algorithm is to predict handwritten characters when the full alphabet is not known and training data is sparse. If leave-one-out validation is used, the algorithm is not challenged by realistic amounts of novel data (9, A, B ). Characters with only one training example (4 though 8) cannot be predicted when they are left out. If accuracy is measured as the fraction of characters that are correctly predicted in a leave-one-out test, the highest possible accuracy is less than 100% due to the singletons. Taxonomy has additional complications. There is strong sampling bias in the reference data, e.g., human pathogens are overrepresented (like digits 0, 1 and 2 on the right). Some training examples have multiple labels because multiple genera can have the same V4 sequence, analogous to the problem that 0 and I can be digits or letters. Even if only one genus is known for a given V4 sequence, a novel genus in the same family might have the same sequence so a prediction of genus for that sequence should have <100% confidence.

24 Fig. SN6.2. LKRs for in vivo samples, leave-one-out and leave-10%-out test data. This figure compares the identity distribution of soil, mouse gut and human gut OTUs (left) with the identity distribution of query-reference pairs used in the RDP leave-one-out test and the Bokulich et al. leave-10%-out test on the 16S V4 region (right). Colors show lowest known ranks (LKRs) estimated using Yarza identities as described in the main text. In the distributions for the validation tests, a large majority of query sequences have >97% identity to the reference set (right), while in practice most sequences belong to novel genera (left).

25 War4 RDP14 Rank Names Singletons Max. acc. Names Singletons Max. acc. Phylum % % Class % % Order % % Family % % Genus 1, % 1, % Species 7,390 2, % Table SN6.2. Leave-one-out maximum accuracy. The table shows the maximum possible accuracy of leave-one-out tests on the War4 (ITS) and RDP14 (16S) training sets which are the defaults currently used by RDP. Names is the number of taxon names in the training set. Singletons is the number of names having exactly one training sequence, which therefore cannot be predicted when left out. Max. acc. is the maximum possible accuracy by the RDP definition, which is <100% when there are singletons in the training set. Since there are singletons at all ranks, the maximum accuracy is always <100% but appears as 100% in some cases because values are shown to three significant figures.

26 Reference Method Phylum Class Order Family Genus Species War4 (ITS1) War4 (ITS2) War4 (full-length) RDP14 (V4) RDP14 (full-length) RDP UTAX RDP UTAX RDP UTAX RDP UTAX RDP UTAX Table SN6.2. Leave-one-out results for War4 and RDP14. The table shows accuracy as defined by the RDP leave-one-out methodology, i.e. the fraction of query sequences for which the rank is correctly predicted at >0% bootstrap confidence for RDP and P>0 for UTAX. The maximum possible accuracy by this definition is <100% when there are singleton taxa (i.e., those having only one reference sequence). At RDP14 genus level, RDP and UTAX have 86% accuracy, close to the maximum possible of 91% (Table SN6.1). Singletons in the reference database thus reduce accuracy below 100% more than misclassification errors by the algorithms.

27 Note 7. Estimated LKR frequencies for in vivo samples. Prediction error rates for known and novel taxa respectively were measured using data for which LKRs are inferred from authoritative annotations. However, these rates do not directly indicate overall error rates for typical biological samples. For example, if most genera in a given sample are known, then most errors will be due to misclassifications and the overclassification rate for genus will be largely irrelevant, but if novel genera are common, then the genus overclassification rate is important. (See Note 8 for definitions). Thus, in order to estimate realistic error rates for typical data, we also need to determine realistic rates of novelty, i.e. realistic LKR frequencies. Once we have LKR frequencies, then overall sensitivity and error rates can be estimated by summing over all ranks (Eqs. 1 and 2 in the main text). I estimated LKR frequencies for soil, human gut and mouse gut samples from a recent study by Kozich et al. 1 The goal of this step was to obtain realistic frequencies, i.e. rates of novel taxa at each rank that are representative for biological samples in practice, not to make an accurate determination of the frequencies on those particular samples. LKR frequencies were estimated using identity thresholds, as described in detail below. This method is not expected to be very accurate, but this doesn't matter because the frequencies will be realistic even if they are under- or over-estimated by quite large factors. For example, I estimate that 37% of the genera in the soil sample are known. This number could be quite far off -- perhaps the true number is 20% or 50%, but it is surely not 1% or 99%. As long as the estimate is in the right ballpark, a sample with 37% known genera is not exceptional, and this rate is reasonable for summarizing the performance of a taxonomy prediction algorithm. To avoid any misunderstanding on this central point, it is also important to note that my methodology does not use identity to determine LKRs of individual sequences when required, they are obtained using authoritative annotations. Identity thresholds were used only to obtain realistic LKR frequencies for three representative samples. Identity thresholds are commonly used to determine approximate taxonomic relationships. For example, it is commonly assumed that 97% identity for two full-length 16S sequences

28 indicates that the species is probably the same and conversely, if the identity is <97%, then the species is probably different. This gives us a method for estimating the frequency of known species in a sample: it is the fraction of sequences with 97% identity with the reference database. This approach can be generalized to other ranks, as in the work of Yarza et al. 15 who determined the number of novel taxa in large databases of full-length 16S sequences. Their method was based on finding appropriate clustering thresholds for ranks from species to phylum. Sequence identity correlates only approximately with taxonomic rank, so clusters will not correspond one-to-one with names some clusters will contain more than one name (lumping) and some names will be found in several clusters (splitting). Yarza et al. tuned their thresholds so that the number of clusters containing known taxa agreed with the number of distinct taxon names. In other words, the tuning balanced splitting and lumping so that (number of clusters) = (number of distinct names) at the given rank. In this framework, the number of clusters which do not contain known names is an operational definition of the number of unnamed taxa. At genus rank, Yarza et al. found that the clustering threshold which balanced splitting and lumping was 95%. Using this threshold, I estimated the number of known genera as the number of sequences having 95% identity with the reference database. This test is not reliable in any given case some sequences with known genera will have <95% identity and some novel genera will have 95% identity, but these will tend to balance each other out (analogous to lumping and splitting of clusters). LKR frequencies at higher ranks were estimated in the same way. The Yarza identity thresholds were determined for full-length 16S sequences, which raises the question of whether they are optimal for shorter gene segments such as the V4 region used in this work. The thresholds are probably not optimal, but they are surely good enough to give realistic frequencies. From Fig. 1 in the main text we can see that the genus threshold (95%) appears to be too low because P(LCR=genus 95%) = 0.34, so at 95% identity the LKR is more likely to be family or higher. A better V4 threshold for genus appears to be 96% or 97% with P(LCR=genus) = 0.51 and 0.61 respectively. Using a higher

29 identity would increase the estimated frequency of novel genera, so using the thresholds determined on full-length sequences gives a conservative estimate of novel genus frequency. Rank Id. Sample Known Novel Novel% LKR%. Soil % 5% Phylum 75% Mouse gut % 3% Human gut % 0.2% Soil % 7% Class 79% Mouse gut % 3% Human gut % 0% Soil % 21% Order 82% Mouse gut % 10% Human gut % 6% Soil % 45% Family 86% Mouse gut % 42% Human gut % 49% Soil % 16% Genus 95% Mouse gut % 18% Human gut % 8% Soil % 4% Species 98% Mouse gut % 12% Human gut % 8% Soil % 2% Sequence 100% Mouse gut % 10% Human gut % 17% Table SN7.1 Estimated LKR frequencies for in vivo samples vs. RDP14. LKR frequencies estimated for UPARSE OTUs constructed from the Kozich et al. samples of soil (7,564 OTUs), mouse gut (757 OTUs) and human gut (452 OTUs). Column headings are: Id., the Yarza et al. cutoff identity threshold for the rank, LKR% the fraction of OTUs having an LKR at this rank according to the thresholds, Known the number of known OTUs, Novel the number of novel OTUs, Novel% the fraction of novel OTUs. Novel frequencies >20% are highlighted.

30 Note 8. Accuracy metrics for taxonomy validation. Algorithm predictions are often characterized as true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). Prediction accuracy is conventionally summarized using measures calculated from totals for given types of prediction, e.g. Bokulich et al. (reference in Note 6) use the textbook metrics precision = TP/(TP+FP) and recall = TP/(TP+FN). However, this is not a textbook case (Note 6), and I used different metrics which I found to correspond better with intuitive concepts of accuracy relevant for taxonomy. UTAX and the other algorithms considered in this work do not predict novelty (Note 15). The concept of a true negative therefore does not apply because predictions are never negative in the sense that they are for a binary classifier. To characterize false positive rates, I defined a misclassification as a false positive when the rank is known (FPmis), and an overclassification as a false positive when the rank is novel (FPover). An overclassification error occurs when the algorithm predicts too many ranks it should have climbed higher up the taxonomic tree. In this spirit, a false negative could be described as an underclassification error because too few ranks are predicted, but this is true of all FNs so there is no need for a new category. To characterize the rate of true positives, I defined sensitivity = TP / Nknown where Nknown = TP+FN+FPmis is the number of queries with known names. Sensitivity by my definition has a maximum of 100% which could be achieved by an ideal algorithm, while the RDP accuracy measure is necessarily <100% if there are novel query sequences (Note 6). My definition of sensitivity captures the intuitive idea of "fraction of achievable predictions which are correct". Precision and recall cannot do this because misclassification errors (where an ideal algorithm could make a TP prediction) and overclassification errors (impossible because there are no training examples) are not distinguished.

31 As a summary statistic for errors I chose to use errors per query (EPQ) = FP / NQ where NQ is the total number of query sequences. False negatives are not counted as errors for calculating EPQ because they are already accounted for in sensitivity. When precision and recall are used, false positives are indicated by precision < 100%. As errors increase, precision gets lower. The divisor for precision is (TP+FP) = number of predictions, while the divisor for EPQ is the total number of queries. Some, or many, queries may not get a prediction (which is not the same as a prediction that the rank is novel as noted above; see also Note 15). Both precision and EPQ capture the FP rate, and can readily be converted given the number of queries and number of predictions. In a prediction task with dense reference data they capture a similar intuitive notion because all FPs are misclassifications. However, with sparse reference data / novel query data there is an important difference. If you continue to add novel queries, all new predictions are overclassification errors and the precision is reduced indefinitely and approaches zero for very high novelty, even if the algorithm has a low overclassification rate. In other words, precision reflects a property of the query set as well as a property of the algorithm. For low ranks, novelty may be high enough that overclassifications swamp misclassifications even if the algorithm has low rates for both types of error, making precision hard to interpret. By contrast, when there is high novelty EPQ will converge on the overclassification rate, an intrinsic property of the algorithm.

32 Note 9. Calculation of LCR probabilities and S/E. If a married couple has a height difference of 2cm, what is the probability that the taller spouse is male? To answer this, collect information about a large number of couples, extract the subset where the height difference is 2cm, and calculate the fraction where the taller spouse is a man. If 80% of those couples have a taller man, we conclude that the probability is 0.8. Implicitly, this procedure assumes we have observed events generated by a hidden stochastic process, and the best estimate we can make of the underlying probability distribution (given some reasonable assumptions) is the observed frequency in those samples. This is called an a-posteriori estimate. If a pair of sequences has 90% identity, what is the probability that their lowest common rank is family? To answer this, collect a large number of pairs of sequences, extract the subset with 90% identity and calculate the fraction with LCR=family. Fig. SN9.1 shows schematically how UTAX calculates LCR probabilities from a reference database, using sequence identity as the similarity measure for this example. (In practice, UTAX uses dutax defined by Eq.6 in the main text). An all-pairs triangular matrix (a) is constructed containing pair-wise sequence identities, indicated by colors (green=100%, yellow=95% and orange=90%). The lowest common rank (LCR) is determined for each pair by comparing taxonomy annotations and marked as s (species), g (genus) or f (family). For each identity, the corresponding pairs are identified: (b) for 100%, (c) for 95% and (d) for 90%. For a given identity, the fraction of pairs having each LCR is calculated, i.e. the LCR frequencies. For example, in (c) there are nine pairs with 95% identity. Of these, one has LCR=species, five have LCR=genus and three have LCR=family. The LCR probabilities are estimated to be the observed frequencies, so P(LCR=species 95%) ~ 1/9, P(LCR=genus 95%) ~ 5/9 and P(LCR=family 95%) ~ 3/9 (the symbol ~ means "is estimated to be"). Using integer-rounded percent identities ensures that the set of pairs for a given identity is usually large enough to make a good estimate of its LCR probabilities. Missing values are filled in by interpolation, e.g. if there are no pairs with 76% identity then P(LCR 76%) ~ (P(LCR 75%) + P(LCR 77%))/2.

33 Fig. SN9.1. Calculation of LCR probabilities from a reference database.

34 Fig. SN9.2. Calculation of sensitivity vs. EPQ from a reference database. This figure shows how a sensitivity vs. error plot for common rank (CR) is calculated for genus, using the toy example from Fig. SN9.1. Pairs are considered in order of decreasing identity. If LCR=s or LCR=g, the pair is a true positive CR prediction because the genus is the same, or if LCR=f this is a false positive because the genus is different. At each identity, the number of true positives and false positives (f, red outlines) are counted. There are 14 pairs with common genera (LCR=s or g) and there are 21 queries (the total number of pairs), so the CR sensitivity at a given cutoff is TP/14 and EPQ is FP/21 (see Note 8 for definitions and discussion of sensitivity and EPQ). Here, there are three possible thresholds at identities 100%, 95% and 90% which incrementally include queries from pairs in groups (b), (c) and (d) respectively.

35 Note 10. Software versions and command lines. UTAX version 1.0. Source code and Linux binary are in the Supplementary Files. RDP: Stand-alone classifier version RDP training: java -Xmx8g -cp /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar edu/msu/cme/rdp/classifier/train/classifiertraineemaker treefile dbfile 1 version1 name_not_used traindir/ RDP classification: java -Xmx1g -jar /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar -t traindir/rrnaclassifier.properties -q query.fa -o output.txt QIIME: version QIIME-uc: assign_taxonomy.py -i query.fa -m uclust -r db.fa -t taxonomy.txt QIIME-sm: assign_taxonomy.py -i query.fa -m sortmerna -r db.fa -t taxonomy.txt QIIME-blast: assign_taxonomy.py -i query.fa -m blast -r db.fa -t taxonomy.txt mothur-knn: classify.seqs(fasta=query.fa, template=db.fa, taxonomy=taxonomy.txt, method=knn, processors=6) GAST: Source dated 25 Feb 2011 (no version number given). gast -in query_fa -ref ref_fa -rtax taxonomy.txt -out output.txt

36 Note 11. Compute time and memory use. Method Elapsed time (secs.) Maximum memory UTAX Mb RDP Mb QIIME-uc Mb QIIME-sm Mb QIIME-blast 25, Mb GAST Mb mothur-knn Mb Table SN11.1. Execution time and maximum memory use of the tested methods. The table reports elapsed time in seconds and maximum memory in megabytes for the tested methods using the 9,364 sequences in the V4 reference database extracted from RDP14 as both query and reference. Programs were run under Ubuntu Linux on an Intel Core i7-3930k CPU running at 3.20GHz with 64Gb RAM.

37 Note 12. Performance metrics on RDP14 and War4. Method sensitivities for RDP14 and War4 are given in Supp. Table SN12.1. UTAX is seen to have a relatively low sensitivity yet maintains performance which I considered to be acceptable and comparable to the best alternative with one exception: genus predictions on War4 (~39% sensitivity compared to ~60-70% for RDP_80). I interpreted this anomaly as an underestimate of EPQ by the algorithm, which I was not able to explain but found that it could be addressed by setting P 0.7, which gave 71% sensitivity and EPQ ~5%. Error rates are shown in Table SN12.2. which shows that UTAX consistently achieves lower error rates than most other methods, dramatically so in many cases, with the exception of mothur-knn, which has much lower sensitivity. Genus overclassification rates with LKR=genus increased from V4 to full-length for all methods except UTAX for which the overclassification rate was lower (19% V4, 13% full-length; Supplementary Files). Notably, the overclassification rate of RDP_80 jumped from 31% on V4 to 50% on full-length sequences and RDP_50 to 81%.

38 Genus 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Family 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Order 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Class 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn

39 Phylum 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Table SN12.1. Sensitivity with LKR=genus on RDP14 (16S) and War4 (ITS). The table shows sensitivity (defined in Note 8) as a percentage with LKR=genus for predicted ranks from genus to phylum. LKR=genus was chosen as representative of the in vivo samples (Note 7). Sensitivities <75% are highlighted in yellow, <50% in orange. The complete matrices for sensitivity, overclassification and misclassification for all pairs (prediction rank, LKR) are included in the Supplementary Files. The V5 region of 16S was truncated to 120nt to simulate reads obtained by older NGS machines. The V4 region (~250nt) is popular with current sequencing technologies. The V3V5 region (~520nt) was sequenced on older 454 machines and models the longer reads which will be achieved by NGS machines in the near future.

40 Predicted genus LKR=genus LKR=family LKR=order LKR=class (Mis.) (Over.) (Over.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted family LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Over.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted order LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted class LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Mis.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn

41 Predicted phylum LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Mis.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Table SN12.2. Error rates measured on the RDP14 V4 region. The table shows misclassification (Mis.) and overclassification (Over.) error rates as percentages for predicted ranks from genus to phylum as defined in Note 8. LKRs from genus to class are shown as novel phyla are rare in practice. With the predicted rank is <LKR, errors are overclassifications and when the predicted rank is LKR than the errors are misclassifications (Note 8). Error rates 10% are highlighted in yellow, 20% in orange and 50% in red.

42 Supplementary Note 13. Construction of a rank split. Fig. SN13.1. Rank split with LKR=family. The reference database is divided into two subsets X and Y, colored gold and blue respectively in the figure, such that LKR=family. For each family, its genera are assigned at random to X or to Y. At least one genus from each family must always be present in both X and Y. No genus is present in both. Families containing only one genus are discarded. With LKR=family, ranks of family and above are always known (i.e., present in both X and Y) while ranks of genus and below are always novel (i.e., not present in both).

Robert Edgar. Independent scientist

Robert Edgar. Independent scientist Robert Edgar Independent scientist robert@drive5.com www.drive5.com Reads FASTQ format Millions of reads Many Gb USEARCH commands "UPARSE pipeline" OTU sequences FASTA format >Otu1 GATTAGCTCATTCGTA >Otu2