UTAX accurately predicts taxonomy of marker gene sequences

Size: px
Start display at page:

Download "UTAX accurately predicts taxonomy of marker gene sequences"

Transcription

1 UTAX accurately predicts taxonomy of marker gene sequences Robert C. Edgar Independent Investigator Tiburon, California, USA. The UTAX algorithm accurately predicts the taxonomy of 16S ribosomal RNA and other marker gene sequences targeted by next-generation metagenomics experiments. UTAX has comparable sensitivity but much lower error rates compared to most existing methods, predicting dramatically fewer false positives for novel taxa. Recent studies using next-generation sequencing of marker gene segments include the Human Microbiome Project (HMP) 1 and a survey of the Arabidopsis root microbiome 2. A fundamental step in such studies is to predict the taxonomy of sequences in the reads, which are typically clustered into Operational Taxonomic Units (OTUs). Computational taxonomy prediction is complicated by the fact that only a small minority of microbial species have authoritative classifications and reference databases have sparse coverage so that in practice, an OTU often does not have an exact match in the database (Supp. Note 3). With the goal of improving taxonomy prediction accuracy, I developed a new algorithm, UTAX, that accounts for sparseness in the database and for varying correlations between rank and sequence identity in different groups. UTAX calculates a novel score combining k- mer distances to the top hit and to the nearest neighbor at each rank, i.e. the most similar sequence with a different name at that rank. For each rank, the probability that the query belongs to the same group as the top hit is calculated from the distribution of scores over all pairs in the reference database.

2 Available reference databases for the 16S ribosomal RNA gene (16S) include SILVA 3, Greengenes 4 and the RDP Classifier 5 (RDP) training set. The current RDP training set (v14, here called RDP14) contains 10,679 sequences. Greengenes and SILVA are larger, giving better coverage than RDP14 but not as much as might be expected from the numbers of sequences (Supp. Note 3). Most of the annotations in SILVA and Greengenes are not authoritative classifications but predictions generated by a combination of automated and manual methods 6,7 which I estimated to have error rates of ~6% and ~18% respectively for genus (Supp. Note 4). Also, SILVA and Greengenes are not compatible with some programs because many sequences lack species and genus names (Supp. Note 5), and I therefore chose to use RDP14 for comparative validation on 16S and the RDP Warcup training set 8 version 4 (War4) for the fungal internal transcribed spacer (ITS) region. Given sequences from a biological sample (here called OTUs without necessarily implying clustering) and a reference database, I defined coverage at each taxonomic rank to be the fraction of OTUs that belong to a known group. Here, known means that the group is present in the reference database, regardless of whether the group has been named, and novel that the group is not present. I defined the lowest known rank (LKR) of an OTU as its lowest rank having at least one reference sequence and the LKR frequency λr as the fraction of OTUs having LKR = r. For example, if λgenus = 0.4, then 40% of the OTUs belong to a novel species in a known genus. LKR frequencies can be interpreted as a profile summarizing taxonomic novelty in the OTUs with respect to the database. I estimated LKR frequencies for soil, human gut and mouse gut reads of the 16S V4 region from a recent study 9 using sequence identity thresholds determined by Yarza et al. 10 : 95% for genus, 86% for family, etc. (Supp. Note 7). While identity gives only an approximate indication of rank, averaging over OTUs for a typical sample should give frequencies that are realistic even if not accurate for that particular sample. Using RDP14 as a reference and OTUs constructed by UPARSE 11, I estimated the fraction of OTUs with novel genera to be 83% for soil, 63% for mouse gut and 57% for human gut, showing that coverage is sparse in practice (Fig.1 and Supp. Notes 7 and 16).

3 At high identities and high ranks, an OTU almost certainly belongs to the same group as the top database hit, and at low ranks and low identities, an OTU almost certainly belongs to a different group. The most challenging cases occur when identity is close to the average for the rank, for example attempting to predict genus when identity is ~95%. This is a twilight zone for taxonomy prediction (Fig. 1) analogous to the twilight zone for protein homology prediction 12. In principle, it might be possible to identify genus-specific sequence features, but not when reference data is too sparse. For example, almost half (913 / 1,948) of the genera in RDP14 have only one reference sequence, and in these cases it is impossible to predict whether a human expert would assign another species to the same group from its sequence alone. Thus, in the twilight zone, predictions of known genera will often be false positives while non-predictions will often be false negatives (see also Supp. Note 15). Identity distributions for typical samples (Fig. 1 and Supp. Fig. SN6.2) show that twilight zone OTUs are common in practice, underscoring the difficulty of accurate taxonomy prediction and the importance of providing a confidence estimate. With this in mind, I designed UTAX to predict the mean number of errors per query (EPQ) for each rank (see Methods). For testing, I set a threshold of P = (1 EPQ) 0.9 on the assumption that ~10% is an acceptable error rate for a typical study. The RDP authors measured accuracy using leave-one-out validation 5, which I believe is inappropriate in this context (Supp. Note 6). I used a different strategy that has been applied to validation of shotgun metagenomics taxonomy prediction 13 by constructing datasets where LKRs are known from trusted annotations, as follows. For k=genus, family... phylum I divided RDP14 into two subsets (rank splits) Xk and Yk such that the LKR between the subsets is k. For example, with LKR = family, I discarded families with only one genus and randomly assigned the remaining genera to Xfamily or Yfamily with the constraint that at least one genus from every family must be present in both (Supp. Fig. SN13.1). For each k and for each region of interest (full-length gene, V4 etc.), I measured prediction performance for all ranks using Xk as the query and Yk as the reference and vice versa. I included a null split XN = YN = RDP14 to measure performance when the sequence is known. I followed the same procedure for War4. For every split at rank k I calculated the following accuracy metrics for each rank r (see Supp. Note 8 for discussion). Sensitivity (Srk) is the

4 fraction of known names at rank r that were correctly predicted. The misclassification error rate (Mrk) is the fraction of known names at rank r that were incorrectly predicted. The overclassification error rate (Ork) is the fraction of novel r's that were incorrectly predicted to be known. Given the LKR frequencies λk, the total sensitivity Sensr and errors per query EPQr at rank r for a set of OTUs can be estimated by assuming that the sensitivities and error rates at each rank are approximately the same as those measured on the rank splits: Sensr = Σ λk Srk, k (Eq.1) EPQr = Σ λk (Ork + Mrk). k (Eq.2) To obtain sensitivities and error rates for typical data, I used Eqs. 1 and 2 with the estimated LKR frequencies for the soil, human gut and mouse gut OTUs. While the frequencies may be inaccurate for those samples, and the sensitivities and error rates for each LKR in a given set of OTUs may differ somewhat from those measured on the rank splits, this procedure should nevertheless give good estimates in the sense that they fall comfortably within the range of true values for typical data in practice, giving a far more realistic indication of algorithm accuracy than leave-one-out testing (Supp. Note 6). Using this method, I compared the accuracy of UTAX with GAST 14, RDP and methods supported by mothur 15 and QIIME 16 (see Supp. Note 11 for method name abbreviations, software versions and command lines). Representative results are given in Table 1; the underlying performance metrics are given in the Supplementary Files and Supp. Note 12. Mothur-rdp gave very similar results to RDP (Supp. Note 1). The only method to consistently achieve an estimated EPQgenus below 10% was mothur-knn, but its sensitivity was also much lower than the other methods (Sensgenus < 40% on all samples). The estimated EPQgenus of UTAX was ~10% on all three samples, remarkably close to the rate predicted by the P 0.9 threshold given that P is calculated by an independent method that

5 does not use identity thresholds or rank splits (Methods). All other algorithms had substantially higher EPQgenus, ranging from EPQgenus ~17% for RDP at 80% bootstrap to QIIME-blast which consistently had the highest error rate (EPQgenus 62% to 78%). The default QIIME method, QIIME-uc, had EPQgenus = 39% to 45% and QIIME-rdp, which sets the bootstrap cutoff at 50% by default, had EPQgenus = 36% to 40%. Sensphylum was >90% for all methods except QIIME-uc (78% on soil, 87% on mouse gut) and QIIME-sm (79% on soil, 87% on mouse gut). Methods Given a pair of sequences Q and R, I defined the lowest common rank (LCR) of Q and R to be the lowest rank where Q and R have the same name. Given a similarity measure d(q, R, k), P(LCR=k d) is the probability that the LCR is k. For example, if d is sequence identity then P(LCR=phylum d=93%) will be close to one but P(LCR=genus d=93%) will be lower. To obtain a discrete range, UTAX converts a real-valued similarity d taking values zero to one to an integer percentage D = 100 d. Considering all pairs of sequences in a reference database B, let the number of pairs with a given D be HD and the number of those pairs with LCR=k be hd,k, UTAX calculates an a-posteriori estimate for P(LCR=k D) from B as the fraction of pairs having distance D which also have LCR=k, i.e. P(LCR=k D) ~ hd,k/hd. (Eq.3) For motivation and visualization of Eq.3 see Supp. Note 9. UTAX calculates the matrix CD,k = hd,k/hd from B and stores it for use in run-time prediction. Let P(CR(k) D) be the probability that two sequences have a common rank at level k, i.e. have the same name at that rank. Let taxon(q, k) be the name of Q at rank k. Q and R have the same name at rank k if their LCR is not > k, hence

6 P(CR(k) D) = P(taxon(Q, k) = taxon(r, k) D) = 1 P(LCR(Q, R) > k D) = 1 Σ CD,r. (Eq.4) r > k Thus, given a reference sequence R and an integer similarity D, Eq.4 gives the probability that the taxon name of Q is the same as R at rank k. This gives a framework for constructing a taxonomy prediction algorithm based on a similarity measure d. Natural choices for d include identity calculated from an alignment or a word-counting distance. However, these would not take into account that the correlation varies in different groups due to differing evolutionary rates and lumping or splitting by taxonomists. I therefore also considered the similarity of a reference sequence R with its nearest neighbor NNk(R) for each k, i.e. the sequence in B with highest similarity to R and a different name at rank k. If NNk(R) is close to R, then the confidence that taxon(q, k) = taxon(r, k) should be reduced because of the increased likelihood that taxon(q, k) = taxon(nnk(r), k). I chose to use similarities calculated from the set w8(q) of 8-mers in Q. I defined the unique word similarity (U) of a pair of sequences Q and R as U(Q, R) = w8(q) w8(r) /min( w8(q), w8(r) ). (Eq.5) I designed a similarity measure (dutax) that increases with higher similarity between Q and R, decreases with higher similarity between R and Hk(R), and takes real values between zero and one, dutax(q, R, k) = (2 U(Q, R) U(R, NNk(R))/2. (Eq.6) (See Supp. Note 14 for comparison with other measures). Given a query sequence Q, UTAX identifies the top hit T by unique word similarity, i.e. T = argmaxr { U(Q, R), R B }. The rank names of Q are predicted to be the same as those of T with probabilities calculated by Eq.4 using the dutax similarity measure.

7 Figures and tables Fig. 1. Estimated Lowest Known Ranks (LKRs) for soil OTUs. The upper graph shows lowest common rank (LCR) probabilities as a function of sequence identity calculated for the V4 region of RDP14 (using Eq.3, see also Supp. Note 9). The lower histogram shows frequencies of integer-rounded sequence identities of top hits of OTUs to the RDP14 database. Histogram bars are colored to indicate estimated LKRs according to the Yarza thresholds. While identity thresholds are not reliable indicators of rank, the fraction of OTUs in a Yarza identity range nevertheless gives a realistic indication of how many OTUs with the corresponding LKR might be found in a similar sample. The "twilight zone" is a region around 95% identity where high sensitivity for genus prediction cannot be achieved without high false positive rates because if the closest reference sequence has ~95% identity, then it is unlikely that there are enough training examples to identify genusspecific sequence features, and identity correlates only approximately with taxonomic rank, noting e.g. that P(LCR=genus 95%) = 0.34, P(LCR=family 95%) = 0.33 and P(LCR=order 95%) = 0.23.

8 Table 1. Estimated accuracy for soil, mouse gut and human gut OTUs. The table shows estimated sensitivity and errors per query (EPQ) for genus and phylum predictions, expressed as percentages. Error rates >10% are highlighted yellow and >30% magenta. Genus sensitivities <50% are highlighted magenta and phylum sensitivities <90% yellow. Results for UTAX are shown for threshold P 0.9. Results for RDP are shown with 80% bootstrap cutoff (recommended by the authors) and 50% bootstrap (the default for QIIMErdp).

9 References 1. HMP Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, (2012). 2. Lundberg, D. S. et al. Defining the core Arabidopsis thaliana root microbiome. Nature 488, (2012). 3. Pruesse, E. et al. SILVA: A comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, (2007). 4. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rrna gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, (2006). 5. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, (2007). 6. Yilmaz, P. et al. The SILVA and all-species Living Tree Project (LTP) taxonomic frameworks. Nucleic Acids Res. 42, (2014). 7. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, (2012). 8. Deshpande, V. et al. Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences. Mycologia (2015). doi: / Kozich, J. J., Westcott, S. L., Baxter, N. T., Highlander, S. K. & Schloss, P. D. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the miseq illumina sequencing platform. Appl. Environ. Microbiol. 79, (2013). 10. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rrna gene sequences. Nat. Rev. Microbiol. 12, (2014). 11. Edgar, R. C. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10, (2013). 12. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, (1999). 13. Patil, K. R. et al. Taxonomic metagenome sequence assignment with structured output models. Nat. Methods 8, (2011).

10 14. Huse, S. M. et al. Exploring microbial diversity and taxonomy using SSU rrna hypervariable tag sequencing. PLoS Genet. 4, e (2008). 15. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, communitysupported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, (2009). 16. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, (2010). Author contributions R.C.E. conceived of the study, performed the analysis and wrote the manuscript.

11 UTAX accurately predicts taxonomy of marker gene sequences Supplementary Notes Note 1. Mothur-rdp is effectively equivalent to RDP. Note 2. Genus predictions for the Soil86 set. Note 3. Coverage of SILVA and Greengenes. Note 4. Error rates of SILVA and Greengenes taxonomy annotations. Note 5. Reference database compatibility. Note 6. Leave-one-out and leave-10%-out validation. Note 7. Estimated LKR frequencies for in vivo samples. Note 8. Accuracy metrics for taxonomy prediction. Note 9. Calculation of LCR probabilities and S/E. Note 10. Compute time and memory use of the tested methods. Note 11. Software versions and command lines. Note 12. Performance metrics on RDP14 and War4. Note 13. Construction of a rank split. Note 14. Sensitivity/ EPQ plots for similarity measures. Note 15. Non-predictions and blank names. Note 16. LKR estimates and OTU error rates. Supplementary References

12 Note 1. Mothur-rdp is effectively equivalent to RDP. I compared the predictions of mothur-rdp and RDP for all rank splits of the RDP14 V4 region. At a bootstrap cutoff of 80%, 241,585 taxon names were predicted by one or both algorithms. Of these, 234,877 (97%) were identical. At 50% bootstrap, 280,334 / 291,705 (96%) were identical. A rate of disagreement of 3 to 4% is consistent with differences due to the use of random numbers in the bootstrapping procedure. I concluded that mothur-rdp and RDP are effectively equivalent implementations of the same algorithm and did not consider mothur-rdp separately for the rest of this work.

13 Note 2. Genus predictions for the Soil86 dataset. Method Genus predictions UTAX 0 QIIME-uc 3 (0.1%) QIIME-sm 19 (0.5%) mothur-knn 283 (8%) RDP (80% bootstrap) 561 (15%) RDP (50% bootstrap) 942 (26%) GAST 3,048 (84%) QIIME-blast 3,637 (89%) Table SN2.1. Genus predictions on the Soil86 dataset. Soil86 contains 3,637 UPARSE OTU sequences from the soil sample of Kozich et al. 1 with 86% identity to the RDP14 reference database, suggesting a lowest known rank of order or higher. Few genus predictions would be expected for this set considering that the Yarza threshold is 95% for genus, but some methods predicted many genera, the most by QIIME-blast which predicted genera for 89% of the sequences.

14 Note 3. Coverage of the Greengenes and SILVA databases. The Greengenes 2 and SILVA 3 reference databases are larger than RDP14: Greengenes v.13.5 has sequences with taxonomy annotations and SILVA v123 has , compared to for RDP14. The default database for the QIIME methods is a subset of Greengenes (GG-QIIME, 99,322 sequences) obtained by clustering at 97% identity, while one of several suggested databases for use with mothur is a subset of SILVA (SILVAmothur, 172,418 sequences) [ retrieved 12th Dec 2015]. The GG-QIIME and SILVA-mothur databases thus have an order of magnitude more annotated sequences than RDP14 and a priori might provide a better reference set for compatible algorithms, noting that RDP is not compatible because it requires names for the lowest rank for all training sequences (Note 5) while most sequences in GG-QIIME and SILVA-mothur lack species and genus names. Fig. SN3.1 shows the identity distributions for soil OTUs against GG-QIIME and SILVAmothur compared with RDP14, showing that GG-QIIME and SILVA-mothur have less sparse coverage than RDP14 though there are still many OTUs with estimated LKR>genus. Coverage is less sparse in the sense there are more OTUs with high identities / fewer with low identities, and this gives the appearance of more known ranks. However, while almost all ranks are named in RDP14 (Note 11), the interpretation of lowest known rank is different for GG-QIIME and SILVA-mothur where most sequences lack names for low ranks, so "known" (present in the database) does not necessarily imply "named". Most annotations in those databases were predicted using sequence analysis methods, so "named" does not imply "authoritatively named" by conventional standards. I estimate the genus annotation error rate to be ~6% for SILVA-mothur and ~18% for GG-QIIME (Note 4). It is therefore difficult to assess whether using one of the larger databases improves or degrades prediction accuracy for a given algorithm compared to using RDP14, but especially in the case of GG-QIIME it appears that the annotation error rate of the database may be high enough to substantially degrade prediction performance, noting that annotation errors of the database will be compounded by the inherent error rate of a

15 prediction algorithm, and confidence will be systematically overestimated because the database error rate is not considered. Fig SN3.1. Identity distributions of soil OTUs. Histograms show frequencies of integerrounded sequence identities of top hits of OTUs to RDP14, GG-QIIME (the subset of Greengenes which is the default reference in QIIME) and SILVA-mothur, one of the reference databases provided for use by mothur. Colors indicate estimated lowest known ranks according to the Yarza thresholds (see main text for methods).

16 Note 4. Error rates of Greengenes and SILVA taxonomy annotations. Henri Poincaré famously described mathematics as the art of giving the same name to different things 4. In taxonomy, this is a bad idea. Most taxonomy annotations in Greengenes and SILVA databases were predicted for uncultured sequences using a combination of automated and manual methods 5,6. I don't fully understand their guiding principles or exactly how they were implemented, but presumably they work something like the following. The starting point is a set of sequences obtained from authoritatively classified organisms (gold-standard sequences). Other annotations are made using a predicted phylogentic tree. If a non-gold sequence is in the same subtree as a gold sequence at a given rank, the name at that ranks is inferred to be the same. To the best of my knowledge, neither Greengenes nor SILVA documents which sequences were used as gold standards or the evidence supporting a given annotation (is it a gold standard sequence? an automated prediction? an automated prediction which was manually adjusted, and if so why?), making the reliability of any given annotation difficult to evaluate or verify independently. There are several differences in taxonomic nomenclatures and procedures for reconciling conflicts between taxonomy and sequence evidence. Greengenes is based on the NCBI taxonomy, RDP14 on Bergey's 7 and SILVA on LSPN 8. While RDP14 strictly adheres to Bergey's to the best of my knowledge, Greengenes and SILVA modify their base taxonomies to address inconsistencies with phylogenies determined from sequence. For example, Greengenes deletes the genera Escherichia and Shigella, which are believed to overlap 9, leaving their sequences classified to family level only (Enterobacteriaceae). SILVA deals with this issue in a different way by defining a combined genus (Escherichia-Shigella) and retaining well-known species names such as Escherichia coli, while Greengenes leaves their species names blank.

17 Both databases maintain large multiple alignments of 16S sequences, many of which have incorrect and ambiguous bases and some of which are undetected chimeras 10. The Greengenes alignment is fixed at 7,682 columns using the NAST approach 2 which intentionally introduces misalignments (i.e., errors) to avoid increasing the number of columns. Construction of RNA alignments is challenging, especially for large and diverse datasets, and the best current alignment algorithms have substantial error rates when challenged with highly diverged sequences 11. Perfect tree inference from a sequence alignment is generally not possible due to alignment errors and information loss 12. Tree construction error rates are difficult to estimate but can be substantial on large datasets 13. Given these issues, it is plausible that the Greengenes and SILVA trees could have substantial error rates, raising the question whether these, perhaps together with other imperfections in their annotation methods, have caused substantial numbers of taxonomy annotation errors. This cannot be assessed directly because the ground truth is not known. Instead, I identified errors by noting that annotations for identical sequences should agree, so if two databases have different annotations for the same sequence then one or both of them must be wrong. Implementing this analysis is complicated by the fact that the databases use taxonomic systems with different sets of names. Another complication is the interpretation of blank names. Does a blank name indicate assignment to a sub-tree that has not been named, that a name cannot be assigned due to overlapping named groups (like Escherichia-Shigella), or low confidence in a prediction (i.e., the name might be known, or there are two candidate known names which do not overlap but which are hard to distinguish)? (see also Note 15). In consideration of these issues, I counted only names used by both systems (common names), excluding names which do not correspond to clades such as unclassified, uncultured, candidatus and incertae sedis. If one or both names were blank, the pair was not counted.

18 Results are summarized in Table SN4.1, which shows that SILVA-mothur and GG-QIIME disagree on 24% of genus annotations and 2% of phylum annotations for identical sequences. This provides an lower bound on the sum of the annotation error rates for both databases. The lower bound is achieved when every incorrect annotation is correct in the other database. It should be rare for annotations to be wrong in both databases by chance (if errors are random at a rate of ~10%, then ~1% will be wrong in both). Given that distinctly different methods are used for alignment and tree construction, I would guess that the errors have low correlation between the databases and the true combined rate is close to this lower bound. A pair-wise comparison measures the combined error rate without indicating the relative rate, i.e. whether one database has a higher or lower error rate than the other. This can be investigated using pair-wise comparisons with a third database, RDP14. Genus annotation disagreement rates with RDP14 are 11% for GG-QIIME and 3% for SILVA-mothur. This indicates that GG-QIIME has a higher error rate than SILVA-mothur because the error rate of RDP14 should be roughly the same in both pair-wise comparisons, adding approximately the same term to both combined rates. Also, all RDP14 sequences have genus annotations and its much smaller size is more amenable to curation, suggesting that it has a high frequency of gold-standard sequences and is likely to have a much lower error rate. This hypothesis is supported by the lower pair-wise disagreements of RDP with the other two databases. If we assume that the error rate of RDP14 is smaller than the other two databases, then we can infer that the error rate of GG-QIIME is roughly 11% / 3% 3 to 4 larger than SILVA-mothur. Assuming a factor of three implies that the total error rates are 24% 3/4 = 18% for GG-QIIME and 24% 1/4 = 6% for SILVA-mothur. While these estimates are uncertain, the combined rate of 24% is robust and it is reasonable to conclude that the minimum plausible genus annotation error rates are 5% for SILVAmothur (minimum determined by assuming a maximum of 4 more errors in GG-QIIME) and 12% for GG-QIIME (minimum determined as half of the 24% combined rate, given that the comparison with RDP14 indicates a higher rate for GG-QIIME).

19 1. GG-QIIME and SILVA-mothur Rank Common Names Same Name Different Name Phylum (98.3%) 481 (1.7%) Class (88.2%) 1201 (4.9%) Order (78.1%) 2804 (12.8%) Family (83.1%) 1428 (9.0%) Genus (69.2%) 1868 (24.1%) 2. GG-QIIME and RDP14 Rank Common Names Same Name Different Name Phylum (99.6%) 2 (0.4%) Class (95.3%) 27 (1.5%) Order (88.6%) 79 (4.4%) Family (92.1%) 78 (5.0%) Genus (89.2%) 151 (10.8%) 3. SILVA-mothur and RDP14 Rank Common Names Same Name Different Name Phylum (99.8%) 2 (0.2%) Class (99.4%) 17 (0.4%) Order (93.7%) 57 (1.7%) Family (94.8%) 141 (3.3%) Genus (97.3%) 124 (2.7%) Table SN4.1. Pair-wise comparisons of taxonomy annotations. The table shows the rate of agreement and disagreement between taxonomy annotations for identical sequences found in each pair of reference databases. Common Names is the number of identical sequences having a common name for the given rank in one or both databases, Same Name is the number of these sequences for which the name was the same and Different Name is the number for which the name was different. A common name is a taxon name found in the taxonomy systems for both databases.

20 Note 5. Reference database compatibility. The tested programs place different constraints on taxonomy annotations. Mothur does not allow a species name, which ruled out testing at species rank on War4. RDP requires that the lowest rank is named for all reference sequences, which ruled out testing on Greengenes or SILVA where genus and species names are often omitted. The mothur reimplementation of the RDP algorithm does allow missing genus names. RDP14 includes reference sequences with optional ranks (suborder and subclass) and missing ranks (e.g., sometimes only phylum and genus are specified with no intermediate ranks). These variations are supported by RDP but not by some other programs. UTAX requires that names correspond to clades so that the LCR can be determined for all pairs of sequences. This means that names such as unclassified, uncultured, candidatus and incertae sedis should be excluded for training. I therefore constructed subsets of the reference databases with taxonomies that were compatible with all programs to enable testing on the same reference data. This was done by filtering out special cases such as "uncultured", deleting optional ranks (suborder, subclass) and discarding annotations with any missing or blank names for required ranks (genus, family, class, order and phylum for RDP14 and species, family, class, order and phylum for War4). This required discarding 506 / 10,049 sequences (5%) from RDP14 and 9,546 / 24,500 (40%) from War4. The compatible versions of the reference databases are included in the Supplementary Files.

21 Note 6. Leave-one-out and leave-10%-out validation. In their 2007 paper describing the RDP Naive Bayesian Classifier 14, Wang et al. state in the Abstract that " results from leave-one-out testing show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level". In my opinion, this approach is not appropriate for microbial taxonomy prediction because an informative leave-one-out validation requires that all categories are known and training data is dense (Fig. SN6.1). With microbial taxonomy, training data is sparse and many microbial genera and higher ranks are novel in typical data (Fig. SN6.2). In addition, accuracy was measured using a bootstrap cutoff of zero rather than the authors' recommended cutoff of 80%. Roughly half of the genera in RDP14 have only a single sequence (913/1,948, Table SN6.1) and therefore cannot be predicted if left out of the training set, but this is not taken into account. Accuracy as measured by this test is thus the maximum possible sensitivity in a scenario where a large majority of query sequences have identity >97% (Fig. SN6.2), which is unrealistic, and where the maximum achievable accuracy is not 100% as would be expected by convention. At RDP14 genus level, RDP and UTAX have 86% accuracy by this definition, close to the maximum possible of 91% (Table SN6.1), as would be expected for sequences with >97% identity. The observation that accuracy is less than 100% is mostly explained by classifications that are impossible due to singletons (9%) with a smaller contribution by misclassification errors (5%). It is therefore clear that accuracy as measured by the RDP leave-one-out test methodology is not predictive of sensitivity or error rates on typical biological data. Leave-one-out accuracies for RDP and UTAX are reported in Table SN6.2. In a recent preprint [ Bokulich et al. describe a taxonomy prediction validation framework designed to enable reproducible results. I was unable to install the framework or download the test data. The framework has several dependencies on third-party code including Python packages which failed to install. One of the described tests uses leave-10%-out validation where 10% of sequences are extracted from Greengenes for use as a query set with the remaining sequences used as a reference. I followed the methodology described in the preprint by extracting the V4

22 region of Greengenes v13.5 using the 515F/806R primers and extracting 10% subsets chosen at random. I found the identity distribution shown in Fig SN6.2 (lower-right) which shows that a large majority of sequences in the query sets have 99% identity with their corresponding reference sets. This distribution is even more strongly skewed towards 100% identity than the RDP leave-one-out test, which is explained by stronger sampling biases; for example, the most abundant genus in Greengenes v13.5 is Staphylococcus with 135,711 sequences, comprising more than 10% of the database. Therefore, this test is not predictive of sensitivity and error rates on typical biological data.

23 Fig. SN6.1. Microbial taxonomy prediction is not a textbook problem. In a textbook classification problem (left), all categories are known (handwritten digits, in this example) and have many training examples. Leave-one-out and leave-10%-out validation is informative in a textbook case because they are realistic models of classification in practice. With microbial taxonomy, reference data is sparse (right). In this analogy, the task of an algorithm is to predict handwritten characters when the full alphabet is not known and training data is sparse. If leave-one-out validation is used, the algorithm is not challenged by realistic amounts of novel data (9, A, B ). Characters with only one training example (4 though 8) cannot be predicted when they are left out. If accuracy is measured as the fraction of characters that are correctly predicted in a leave-one-out test, the highest possible accuracy is less than 100% due to the singletons. Taxonomy has additional complications. There is strong sampling bias in the reference data, e.g., human pathogens are overrepresented (like digits 0, 1 and 2 on the right). Some training examples have multiple labels because multiple genera can have the same V4 sequence, analogous to the problem that 0 and I can be digits or letters. Even if only one genus is known for a given V4 sequence, a novel genus in the same family might have the same sequence so a prediction of genus for that sequence should have <100% confidence.

24 Fig. SN6.2. LKRs for in vivo samples, leave-one-out and leave-10%-out test data. This figure compares the identity distribution of soil, mouse gut and human gut OTUs (left) with the identity distribution of query-reference pairs used in the RDP leave-one-out test and the Bokulich et al. leave-10%-out test on the 16S V4 region (right). Colors show lowest known ranks (LKRs) estimated using Yarza identities as described in the main text. In the distributions for the validation tests, a large majority of query sequences have >97% identity to the reference set (right), while in practice most sequences belong to novel genera (left).

25 War4 RDP14 Rank Names Singletons Max. acc. Names Singletons Max. acc. Phylum % % Class % % Order % % Family % % Genus 1, % 1, % Species 7,390 2, % Table SN6.2. Leave-one-out maximum accuracy. The table shows the maximum possible accuracy of leave-one-out tests on the War4 (ITS) and RDP14 (16S) training sets which are the defaults currently used by RDP. Names is the number of taxon names in the training set. Singletons is the number of names having exactly one training sequence, which therefore cannot be predicted when left out. Max. acc. is the maximum possible accuracy by the RDP definition, which is <100% when there are singletons in the training set. Since there are singletons at all ranks, the maximum accuracy is always <100% but appears as 100% in some cases because values are shown to three significant figures.

26 Reference Method Phylum Class Order Family Genus Species War4 (ITS1) War4 (ITS2) War4 (full-length) RDP14 (V4) RDP14 (full-length) RDP UTAX RDP UTAX RDP UTAX RDP UTAX RDP UTAX Table SN6.2. Leave-one-out results for War4 and RDP14. The table shows accuracy as defined by the RDP leave-one-out methodology, i.e. the fraction of query sequences for which the rank is correctly predicted at >0% bootstrap confidence for RDP and P>0 for UTAX. The maximum possible accuracy by this definition is <100% when there are singleton taxa (i.e., those having only one reference sequence). At RDP14 genus level, RDP and UTAX have 86% accuracy, close to the maximum possible of 91% (Table SN6.1). Singletons in the reference database thus reduce accuracy below 100% more than misclassification errors by the algorithms.

27 Note 7. Estimated LKR frequencies for in vivo samples. Prediction error rates for known and novel taxa respectively were measured using data for which LKRs are inferred from authoritative annotations. However, these rates do not directly indicate overall error rates for typical biological samples. For example, if most genera in a given sample are known, then most errors will be due to misclassifications and the overclassification rate for genus will be largely irrelevant, but if novel genera are common, then the genus overclassification rate is important. (See Note 8 for definitions). Thus, in order to estimate realistic error rates for typical data, we also need to determine realistic rates of novelty, i.e. realistic LKR frequencies. Once we have LKR frequencies, then overall sensitivity and error rates can be estimated by summing over all ranks (Eqs. 1 and 2 in the main text). I estimated LKR frequencies for soil, human gut and mouse gut samples from a recent study by Kozich et al. 1 The goal of this step was to obtain realistic frequencies, i.e. rates of novel taxa at each rank that are representative for biological samples in practice, not to make an accurate determination of the frequencies on those particular samples. LKR frequencies were estimated using identity thresholds, as described in detail below. This method is not expected to be very accurate, but this doesn't matter because the frequencies will be realistic even if they are under- or over-estimated by quite large factors. For example, I estimate that 37% of the genera in the soil sample are known. This number could be quite far off -- perhaps the true number is 20% or 50%, but it is surely not 1% or 99%. As long as the estimate is in the right ballpark, a sample with 37% known genera is not exceptional, and this rate is reasonable for summarizing the performance of a taxonomy prediction algorithm. To avoid any misunderstanding on this central point, it is also important to note that my methodology does not use identity to determine LKRs of individual sequences when required, they are obtained using authoritative annotations. Identity thresholds were used only to obtain realistic LKR frequencies for three representative samples. Identity thresholds are commonly used to determine approximate taxonomic relationships. For example, it is commonly assumed that 97% identity for two full-length 16S sequences

28 indicates that the species is probably the same and conversely, if the identity is <97%, then the species is probably different. This gives us a method for estimating the frequency of known species in a sample: it is the fraction of sequences with 97% identity with the reference database. This approach can be generalized to other ranks, as in the work of Yarza et al. 15 who determined the number of novel taxa in large databases of full-length 16S sequences. Their method was based on finding appropriate clustering thresholds for ranks from species to phylum. Sequence identity correlates only approximately with taxonomic rank, so clusters will not correspond one-to-one with names some clusters will contain more than one name (lumping) and some names will be found in several clusters (splitting). Yarza et al. tuned their thresholds so that the number of clusters containing known taxa agreed with the number of distinct taxon names. In other words, the tuning balanced splitting and lumping so that (number of clusters) = (number of distinct names) at the given rank. In this framework, the number of clusters which do not contain known names is an operational definition of the number of unnamed taxa. At genus rank, Yarza et al. found that the clustering threshold which balanced splitting and lumping was 95%. Using this threshold, I estimated the number of known genera as the number of sequences having 95% identity with the reference database. This test is not reliable in any given case some sequences with known genera will have <95% identity and some novel genera will have 95% identity, but these will tend to balance each other out (analogous to lumping and splitting of clusters). LKR frequencies at higher ranks were estimated in the same way. The Yarza identity thresholds were determined for full-length 16S sequences, which raises the question of whether they are optimal for shorter gene segments such as the V4 region used in this work. The thresholds are probably not optimal, but they are surely good enough to give realistic frequencies. From Fig. 1 in the main text we can see that the genus threshold (95%) appears to be too low because P(LCR=genus 95%) = 0.34, so at 95% identity the LKR is more likely to be family or higher. A better V4 threshold for genus appears to be 96% or 97% with P(LCR=genus) = 0.51 and 0.61 respectively. Using a higher

29 identity would increase the estimated frequency of novel genera, so using the thresholds determined on full-length sequences gives a conservative estimate of novel genus frequency. Rank Id. Sample Known Novel Novel% LKR%. Soil % 5% Phylum 75% Mouse gut % 3% Human gut % 0.2% Soil % 7% Class 79% Mouse gut % 3% Human gut % 0% Soil % 21% Order 82% Mouse gut % 10% Human gut % 6% Soil % 45% Family 86% Mouse gut % 42% Human gut % 49% Soil % 16% Genus 95% Mouse gut % 18% Human gut % 8% Soil % 4% Species 98% Mouse gut % 12% Human gut % 8% Soil % 2% Sequence 100% Mouse gut % 10% Human gut % 17% Table SN7.1 Estimated LKR frequencies for in vivo samples vs. RDP14. LKR frequencies estimated for UPARSE OTUs constructed from the Kozich et al. samples of soil (7,564 OTUs), mouse gut (757 OTUs) and human gut (452 OTUs). Column headings are: Id., the Yarza et al. cutoff identity threshold for the rank, LKR% the fraction of OTUs having an LKR at this rank according to the thresholds, Known the number of known OTUs, Novel the number of novel OTUs, Novel% the fraction of novel OTUs. Novel frequencies >20% are highlighted.

30 Note 8. Accuracy metrics for taxonomy validation. Algorithm predictions are often characterized as true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). Prediction accuracy is conventionally summarized using measures calculated from totals for given types of prediction, e.g. Bokulich et al. (reference in Note 6) use the textbook metrics precision = TP/(TP+FP) and recall = TP/(TP+FN). However, this is not a textbook case (Note 6), and I used different metrics which I found to correspond better with intuitive concepts of accuracy relevant for taxonomy. UTAX and the other algorithms considered in this work do not predict novelty (Note 15). The concept of a true negative therefore does not apply because predictions are never negative in the sense that they are for a binary classifier. To characterize false positive rates, I defined a misclassification as a false positive when the rank is known (FPmis), and an overclassification as a false positive when the rank is novel (FPover). An overclassification error occurs when the algorithm predicts too many ranks it should have climbed higher up the taxonomic tree. In this spirit, a false negative could be described as an underclassification error because too few ranks are predicted, but this is true of all FNs so there is no need for a new category. To characterize the rate of true positives, I defined sensitivity = TP / Nknown where Nknown = TP+FN+FPmis is the number of queries with known names. Sensitivity by my definition has a maximum of 100% which could be achieved by an ideal algorithm, while the RDP accuracy measure is necessarily <100% if there are novel query sequences (Note 6). My definition of sensitivity captures the intuitive idea of "fraction of achievable predictions which are correct". Precision and recall cannot do this because misclassification errors (where an ideal algorithm could make a TP prediction) and overclassification errors (impossible because there are no training examples) are not distinguished.

31 As a summary statistic for errors I chose to use errors per query (EPQ) = FP / NQ where NQ is the total number of query sequences. False negatives are not counted as errors for calculating EPQ because they are already accounted for in sensitivity. When precision and recall are used, false positives are indicated by precision < 100%. As errors increase, precision gets lower. The divisor for precision is (TP+FP) = number of predictions, while the divisor for EPQ is the total number of queries. Some, or many, queries may not get a prediction (which is not the same as a prediction that the rank is novel as noted above; see also Note 15). Both precision and EPQ capture the FP rate, and can readily be converted given the number of queries and number of predictions. In a prediction task with dense reference data they capture a similar intuitive notion because all FPs are misclassifications. However, with sparse reference data / novel query data there is an important difference. If you continue to add novel queries, all new predictions are overclassification errors and the precision is reduced indefinitely and approaches zero for very high novelty, even if the algorithm has a low overclassification rate. In other words, precision reflects a property of the query set as well as a property of the algorithm. For low ranks, novelty may be high enough that overclassifications swamp misclassifications even if the algorithm has low rates for both types of error, making precision hard to interpret. By contrast, when there is high novelty EPQ will converge on the overclassification rate, an intrinsic property of the algorithm.

32 Note 9. Calculation of LCR probabilities and S/E. If a married couple has a height difference of 2cm, what is the probability that the taller spouse is male? To answer this, collect information about a large number of couples, extract the subset where the height difference is 2cm, and calculate the fraction where the taller spouse is a man. If 80% of those couples have a taller man, we conclude that the probability is 0.8. Implicitly, this procedure assumes we have observed events generated by a hidden stochastic process, and the best estimate we can make of the underlying probability distribution (given some reasonable assumptions) is the observed frequency in those samples. This is called an a-posteriori estimate. If a pair of sequences has 90% identity, what is the probability that their lowest common rank is family? To answer this, collect a large number of pairs of sequences, extract the subset with 90% identity and calculate the fraction with LCR=family. Fig. SN9.1 shows schematically how UTAX calculates LCR probabilities from a reference database, using sequence identity as the similarity measure for this example. (In practice, UTAX uses dutax defined by Eq.6 in the main text). An all-pairs triangular matrix (a) is constructed containing pair-wise sequence identities, indicated by colors (green=100%, yellow=95% and orange=90%). The lowest common rank (LCR) is determined for each pair by comparing taxonomy annotations and marked as s (species), g (genus) or f (family). For each identity, the corresponding pairs are identified: (b) for 100%, (c) for 95% and (d) for 90%. For a given identity, the fraction of pairs having each LCR is calculated, i.e. the LCR frequencies. For example, in (c) there are nine pairs with 95% identity. Of these, one has LCR=species, five have LCR=genus and three have LCR=family. The LCR probabilities are estimated to be the observed frequencies, so P(LCR=species 95%) ~ 1/9, P(LCR=genus 95%) ~ 5/9 and P(LCR=family 95%) ~ 3/9 (the symbol ~ means "is estimated to be"). Using integer-rounded percent identities ensures that the set of pairs for a given identity is usually large enough to make a good estimate of its LCR probabilities. Missing values are filled in by interpolation, e.g. if there are no pairs with 76% identity then P(LCR 76%) ~ (P(LCR 75%) + P(LCR 77%))/2.

33 Fig. SN9.1. Calculation of LCR probabilities from a reference database.

34 Fig. SN9.2. Calculation of sensitivity vs. EPQ from a reference database. This figure shows how a sensitivity vs. error plot for common rank (CR) is calculated for genus, using the toy example from Fig. SN9.1. Pairs are considered in order of decreasing identity. If LCR=s or LCR=g, the pair is a true positive CR prediction because the genus is the same, or if LCR=f this is a false positive because the genus is different. At each identity, the number of true positives and false positives (f, red outlines) are counted. There are 14 pairs with common genera (LCR=s or g) and there are 21 queries (the total number of pairs), so the CR sensitivity at a given cutoff is TP/14 and EPQ is FP/21 (see Note 8 for definitions and discussion of sensitivity and EPQ). Here, there are three possible thresholds at identities 100%, 95% and 90% which incrementally include queries from pairs in groups (b), (c) and (d) respectively.

35 Note 10. Software versions and command lines. UTAX version 1.0. Source code and Linux binary are in the Supplementary Files. RDP: Stand-alone classifier version RDP training: java -Xmx8g -cp /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar edu/msu/cme/rdp/classifier/train/classifiertraineemaker treefile dbfile 1 version1 name_not_used traindir/ RDP classification: java -Xmx1g -jar /sw/rdp_classifier_2.11/rdp_classifier-2.11.jar -t traindir/rrnaclassifier.properties -q query.fa -o output.txt QIIME: version QIIME-uc: assign_taxonomy.py -i query.fa -m uclust -r db.fa -t taxonomy.txt QIIME-sm: assign_taxonomy.py -i query.fa -m sortmerna -r db.fa -t taxonomy.txt QIIME-blast: assign_taxonomy.py -i query.fa -m blast -r db.fa -t taxonomy.txt mothur-knn: classify.seqs(fasta=query.fa, template=db.fa, taxonomy=taxonomy.txt, method=knn, processors=6) GAST: Source dated 25 Feb 2011 (no version number given). gast -in query_fa -ref ref_fa -rtax taxonomy.txt -out output.txt

36 Note 11. Compute time and memory use. Method Elapsed time (secs.) Maximum memory UTAX Mb RDP Mb QIIME-uc Mb QIIME-sm Mb QIIME-blast 25, Mb GAST Mb mothur-knn Mb Table SN11.1. Execution time and maximum memory use of the tested methods. The table reports elapsed time in seconds and maximum memory in megabytes for the tested methods using the 9,364 sequences in the V4 reference database extracted from RDP14 as both query and reference. Programs were run under Ubuntu Linux on an Intel Core i7-3930k CPU running at 3.20GHz with 64Gb RAM.

37 Note 12. Performance metrics on RDP14 and War4. Method sensitivities for RDP14 and War4 are given in Supp. Table SN12.1. UTAX is seen to have a relatively low sensitivity yet maintains performance which I considered to be acceptable and comparable to the best alternative with one exception: genus predictions on War4 (~39% sensitivity compared to ~60-70% for RDP_80). I interpreted this anomaly as an underestimate of EPQ by the algorithm, which I was not able to explain but found that it could be addressed by setting P 0.7, which gave 71% sensitivity and EPQ ~5%. Error rates are shown in Table SN12.2. which shows that UTAX consistently achieves lower error rates than most other methods, dramatically so in many cases, with the exception of mothur-knn, which has much lower sensitivity. Genus overclassification rates with LKR=genus increased from V4 to full-length for all methods except UTAX for which the overclassification rate was lower (19% V4, 13% full-length; Supplementary Files). Notably, the overclassification rate of RDP_80 jumped from 31% on V4 to 50% on full-length sequences and RDP_50 to 81%.

38 Genus 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Family 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Order 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Class 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn

39 Phylum 16S (V5) 16S (V4) 16S (V3V5) 16S (FL) ITS1 ITS2 ITS (FL) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Table SN12.1. Sensitivity with LKR=genus on RDP14 (16S) and War4 (ITS). The table shows sensitivity (defined in Note 8) as a percentage with LKR=genus for predicted ranks from genus to phylum. LKR=genus was chosen as representative of the in vivo samples (Note 7). Sensitivities <75% are highlighted in yellow, <50% in orange. The complete matrices for sensitivity, overclassification and misclassification for all pairs (prediction rank, LKR) are included in the Supplementary Files. The V5 region of 16S was truncated to 120nt to simulate reads obtained by older NGS machines. The V4 region (~250nt) is popular with current sequencing technologies. The V3V5 region (~520nt) was sequenced on older 454 machines and models the longer reads which will be achieved by NGS machines in the near future.

40 Predicted genus LKR=genus LKR=family LKR=order LKR=class (Mis.) (Over.) (Over.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted family LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Over.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted order LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Over.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Predicted class LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Mis.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn

41 Predicted phylum LKR=genus LKR=family LKR=order LKR=class (Mis.) (Mis.) (Mis.) (Mis.) UTAX RDP_ RDP_ QIIME-uc QIIME-sm QIIME-blast GAST mothur-knn Table SN12.2. Error rates measured on the RDP14 V4 region. The table shows misclassification (Mis.) and overclassification (Over.) error rates as percentages for predicted ranks from genus to phylum as defined in Note 8. LKRs from genus to class are shown as novel phyla are rare in practice. With the predicted rank is <LKR, errors are overclassifications and when the predicted rank is LKR than the errors are misclassifications (Note 8). Error rates 10% are highlighted in yellow, 20% in orange and 50% in red.

42 Supplementary Note 13. Construction of a rank split. Fig. SN13.1. Rank split with LKR=family. The reference database is divided into two subsets X and Y, colored gold and blue respectively in the figure, such that LKR=family. For each family, its genera are assigned at random to X or to Y. At least one genus from each family must always be present in both X and Y. No genus is present in both. Families containing only one genus are discarded. With LKR=family, ranks of family and above are always known (i.e., present in both X and Y) while ranks of genus and below are always novel (i.e., not present in both).

Robert Edgar. Independent scientist

Robert Edgar. Independent scientist Robert Edgar Independent scientist robert@drive5.com www.drive5.com Reads FASTQ format Millions of reads Many Gb USEARCH commands "UPARSE pipeline" OTU sequences FASTA format >Otu1 GATTAGCTCATTCGTA >Otu2

More information

Introduction to OTU Clustering. Susan Huse August 4, 2016

Introduction to OTU Clustering. Susan Huse August 4, 2016 Introduction to OTU Clustering Susan Huse August 4, 2016 What is an OTU? Operational Taxonomic Units a.k.a. phylotypes a.k.a. clusters aggregations of reads based only on sequence similarity, independent

More information

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life METAGENOMICS Carl Woese Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life His amazing discovery, coupled with his solitary behaviour, made many contemporary

More information

mothur Workshop for Amplicon Analysis Michigan State University, 2013

mothur Workshop for Amplicon Analysis Michigan State University, 2013 mothur Workshop for Amplicon Analysis Michigan State University, 2013 Tracy Teal MMG / ICER tkteal@msu.edu Kevin Theis Zoology / BEACON theiskev@msu.edu mothur Mission to develop a single piece of open-source,

More information

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014 Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014 1 Taxonomic analysis using next-generation sequencing Objective we want to

More information

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis APPLIED AND ENVIRONMENTAL MICROBIOLOGY, May 2011, p. 3219 3226 Vol. 77, No. 10 0099-2240/11/$12.00 doi:10.1128/aem.02810-10 Copyright 2011, American Society for Microbiology. All Rights Reserved. Assessing

More information

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

An introduction into 16S rrna gene sequencing analysis. Stefan Boers An introduction into 16S rrna gene sequencing analysis Stefan Boers Microbiome, microbiota or metagenomics? Microbiome The entire habitat, including the microorganisms, their genomes (i.e., genes) and

More information

Infectious Disease Omics

Infectious Disease Omics Infectious Disease Omics Metagenomics Ernest Diez Benavente LSHTM ernest.diezbenavente@lshtm.ac.uk Course outline What is metagenomics? In situ, culture-free genomic characterization of the taxonomic and

More information

Novel bacterial taxa in the human microbiome

Novel bacterial taxa in the human microbiome Washington University School of Medicine Digital Commons@Becker Open Access Publications 2012 Novel bacterial taxa in the human microbiome Kristine M. Wylie Washington University School of Medicine in

More information

Assigning Sequences to Taxa CMSC828G

Assigning Sequences to Taxa CMSC828G Assigning Sequences to Taxa CMSC828G Outline Objective (1 slide) MEGAN (17 slides) SAP (33 slides) Conclusion (1 slide) Objective Given an unknown, environmental DNA sequence: Make a taxonomic assignment

More information

USEARCH software and documentation Copyright Robert C. Edgar All rights reserved.

USEARCH software and documentation Copyright Robert C. Edgar All rights reserved. USEARCH software and documentation Copyright 2010-11 Robert C. Edgar All rights reserved http://drive5.com/usearch robert@drive5.com Version 5.0 August 22nd, 2011 Contents Introduction... 3 UCHIME implementations...

More information

Weka Evaluation: Assessing the performance

Weka Evaluation: Assessing the performance Weka Evaluation: Assessing the performance Lab3 (in- class): 21 NOV 2016, 13:00-15:00, CHOMSKY ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning objectives

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007 Methods for comparing multiple microbial communities. james robert white, whitej@umd.edu Advisor: Mihai Pop, mpop@umiacs.umd.edu October 1 st, 2007 Abstract We propose the development of new software to

More information

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College. Author: Sulochana Bramhacharya Affiliation: Hiram College, Hiram OH. Address: P.O.B 1257 Hiram, OH 44234 Email: bramhacharyas1@my.hiram.edu ACM number: 8983027 Category: Undergraduate research Advisors:

More information

Runs of Homozygosity Analysis Tutorial

Runs of Homozygosity Analysis Tutorial Runs of Homozygosity Analysis Tutorial Release 8.7.0 Golden Helix, Inc. March 22, 2017 Contents 1. Overview of the Project 2 2. Identify Runs of Homozygosity 6 Illustrative Example...............................................

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

SHAMAN : SHiny Application for Metagenomic ANalysis

SHAMAN : SHiny Application for Metagenomic ANalysis SHAMAN : SHiny Application for Metagenomic ANalysis Stevenn Volant, Amine Ghozlane Hub Bioinformatique et Biostatistique C3BI, USR 3756 IP CNRS Biomics CITECH Ribosome ITS (1) : located between 18S and

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

MICROBIOMICS Current and future tools of the trade

MICROBIOMICS Current and future tools of the trade MICROBIOMICS Current and future tools of the trade Ingeborg Klymiuk Core Facility Molecular Biology ZMF - CENTER FOR MEDICAL RESEARCH Medical University Graz MICROBIOMICS DEFINITION OF OMIC TECHNOLOGIES

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing

Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing Susan M. Huse 1, Les Dethlefsen 2, Julie A. Huber 1, David Mark Welch 1, David A. Relman 2,3,4, Mitchell L. Sogin

More information

A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA

A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA A. MURAT EREN Department of Computer Science, University of New Orleans, 2000 Lakeshore Drive, New Orleans, LA 70148, USA Email: aeren@uno.edu MICHAEL

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Comparative Analysis of Fungal Primers

Comparative Analysis of Fungal Primers Comparative Analysis of Fungal Primers Background Most eukaryotes encode ribosomal genes in an operon, with a relatively unconserved internal transcribed spacer (ITS) between conserved genes (order = 18S

More information

Phylogenetic methods for taxonomic profiling

Phylogenetic methods for taxonomic profiling Phylogenetic methods for taxonomic profiling Siavash Mirarab University of California at San Diego (UCSD) Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu Phylogeny reconstruction

More information

A Semi-automated Peer-review System Bradly Alicea Orthogonal Research

A Semi-automated Peer-review System Bradly Alicea Orthogonal Research A Semi-automated Peer-review System Bradly Alicea bradly.alicea@ieee.org Orthogonal Research Abstract A semi-supervised model of peer review is introduced that is intended to overcome the bias and incompleteness

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Chapter 5 Evaluating Classification & Predictive Performance

Chapter 5 Evaluating Classification & Predictive Performance Chapter 5 Evaluating Classification & Predictive Performance Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Why Evaluate? Multiple methods are available

More information

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang Supplementary Materials for: Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities

Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities Golob et al. BMC Bioinformatics (2017) 18:283 DOI 10.1186/s12859-017-1690-0 RESEARCH ARTICLE Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial

More information

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES Sample Size Issues for Conjoint Analysis Studies Bryan Orme, Sawtooth Software, Inc. 1998 Copyright 1998-2001, Sawtooth Software, Inc. 530 W. Fir St. Sequim, WA

More information

Applications of Next Generation Sequencing in Metagenomics Studies

Applications of Next Generation Sequencing in Metagenomics Studies Applications of Next Generation Sequencing in Metagenomics Studies Francesca Rizzo, PhD Genomix4life Laboratory of Molecular Medicine and Genomics Department of Medicine and Surgery University of Salerno

More information

axe Documentation Release g6d4d1b6-dirty Kevin Murray

axe Documentation Release g6d4d1b6-dirty Kevin Murray axe Documentation Release 0.3.2-5-g6d4d1b6-dirty Kevin Murray Jul 17, 2017 Contents 1 Axe Usage 3 1.1 Inputs and Outputs..................................... 4 1.2 The barcode file......................................

More information

Online Student Guide Types of Control Charts

Online Student Guide Types of Control Charts Online Student Guide Types of Control Charts OpusWorks 2016, All Rights Reserved 1 Table of Contents LEARNING OBJECTIVES... 4 INTRODUCTION... 4 DETECTION VS. PREVENTION... 5 CONTROL CHART UTILIZATION...

More information

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc. Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data

More information

N- The rank of the specified protein relative to all other proteins in the list of detected proteins.

N- The rank of the specified protein relative to all other proteins in the list of detected proteins. PROTEIN SUMMARY file N- The rank of the specified protein relative to all other proteins in the list of detected proteins. Unused (ProtScore) - A measure of the protein confidence for a detected protein,

More information

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all Rachel L. Harris 1 Insights into the phylogeny and coding potential of microbial dark matter: a replication of phylogenetic anchoring methods described by Rinke et al., 2013 Rachel Harris, David Zhao,

More information

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis Lina L. Faller Department of Computer Science University of New Hampshire June 2008

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

3 Ways to Improve Your Targeted Marketing with Analytics

3 Ways to Improve Your Targeted Marketing with Analytics 3 Ways to Improve Your Targeted Marketing with Analytics Introduction Targeted marketing is a simple concept, but a key element in a marketing strategy. The goal is to identify the potential customers

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Time Series Motif Discovery

Time Series Motif Discovery Time Series Motif Discovery Bachelor s Thesis Exposé eingereicht von: Jonas Spenger Gutachter: Dr. rer. nat. Patrick Schäfer Gutachter: Prof. Dr. Ulf Leser eingereicht am: 10.09.2017 Contents 1 Introduction

More information

Bioinformatic tools for metagenomic data analysis

Bioinformatic tools for metagenomic data analysis Bioinformatic tools for metagenomic data analysis MEGAN - blast-based tool for exploring taxonomic content MG-RAST (SEED, FIG) - rapid annotation of metagenomic data, phylogenetic classification and metabolic

More information

Microbially Mediated Plant Salt Tolerance and Microbiome based Solutions for Saline Agriculture

Microbially Mediated Plant Salt Tolerance and Microbiome based Solutions for Saline Agriculture Microbially Mediated Plant Salt Tolerance and Microbiome based Solutions for Saline Agriculture Contents Introduction Abiotic Tolerance Approaches Reasons for failure Roots, microorganisms and soil-interaction

More information

MICROBIOME SOFTWARE: END OF BEGINNING.

MICROBIOME SOFTWARE: END OF BEGINNING. MICROBIOME SOFTWARE: END OF BEGINNING. DR. CHARLES ROBERTSON DIVISION OF INFECTIOUS DISEASES, UNIVERSITY OF COLORADO SCHOOL OF MEDICINE DR. DANIEL N. FRANK, DIVISION OF INFECTIOUS DISEASES, SCHOOL OF MEDICINE

More information

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS Anirvan Banerji New York 24th CIRET Conference Wellington, New Zealand March 17-20, 1999 Geoffrey H. Moore,

More information

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

GOTA: GO term annotation of biomedical literature

GOTA: GO term annotation of biomedical literature Di Lena et al. BMC Bioinformatics (2015) 16:346 DOI 10.1186/s12859-015-0777-8 METHODOLOGY ARTICLE Open Access GOTA: GO term annotation of biomedical literature Pietro Di Lena *, Giacomo Domeniconi, Luciano

More information

Introduction to RNA sequencing

Introduction to RNA sequencing Introduction to RNA sequencing Bioinformatics perspective Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden November 2017 Olga (NBIS) RNA-seq November 2017 1 / 49 Outline Why sequence

More information

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences Published online 28 November 2016 Nucleic Acids Research, 2017, Vol. 45, No. 1 39 53 doi: 10.1093/nar/gkw1002 Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts

More information

CSC-272 Exam #1 February 13, 2015

CSC-272 Exam #1 February 13, 2015 CSC-272 Exam #1 February 13, 2015 Name Questions are weighted as indicated. Show your work and state your assumptions for partial credit consideration. Unless explicitly stated, there are NO intended errors

More information

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes CORRECTION NOTICE Nat. Biotechnol. doi:10.1038/nbt. 3567 Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes David W Morgens, Richard M Deans, Amy Li & Michael C Bassik In the version

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

Getting Started with HLM 5. For Windows

Getting Started with HLM 5. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 About this Document... 3 1.2 Introduction to HLM... 3 1.3 Accessing HLM... 3 1.4 Getting Help with HLM... 3 Section 2: Accessing

More information

Soil - Plasticity 2017 (72) PROFICIENCY TESTING PROGRAM REPORT

Soil - Plasticity 2017 (72) PROFICIENCY TESTING PROGRAM REPORT www.labsmartservices.com.au Soil - Plasticity 2017 (72) PROFICIENCY TESTING PROGRAM REPORT Accredited for compliance with ISO/IEC 17043 Copyright: LabSmart Services Pty Ltd Copyright: LabSmart Services

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Predicting Yelp Ratings From Business and User Characteristics

Predicting Yelp Ratings From Business and User Characteristics Predicting Yelp Ratings From Business and User Characteristics Jeff Han Justin Kuang Derek Lim Stanford University jeffhan@stanford.edu kuangj@stanford.edu limderek@stanford.edu I. Abstract With online

More information

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Assembly of Ariolimax dolichophallus using SOAPdenovo2 Assembly of Ariolimax dolichophallus using SOAPdenovo2 Charles Markello, Thomas Matthew, and Nedda Saremi Image taken from Banana Slug Genome Project, S. Weber SOAPdenovo Assembly Tool Short Oligonucleotide

More information

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

How to view Results with Scaffold. Proteomics Shared Resource

How to view Results with Scaffold. Proteomics Shared Resource How to view Results with Scaffold Proteomics Shared Resource Starting out Download Scaffold from http://www.proteomes oftware.com/proteom e_software_prod_sca ffold_download.html Follow installation instructions

More information

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS

More information

Credit Card Marketing Classification Trees

Credit Card Marketing Classification Trees Credit Card Marketing Classification Trees From Building Better Models With JMP Pro, Chapter 6, SAS Press (2015). Grayson, Gardner and Stephens. Used with permission. For additional information, see community.jmp.com/docs/doc-7562.

More information

Tutorial #3: Brand Pricing Experiment

Tutorial #3: Brand Pricing Experiment Tutorial #3: Brand Pricing Experiment A popular application of discrete choice modeling is to simulate how market share changes when the price of a brand changes and when the price of a competitive brand

More information

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers

Overview. Presenter: Bill Cheney. Audience: Clinical Laboratory Professionals. Field Guide To Statistics for Blood Bankers Field Guide To Statistics for Blood Bankers A Basic Lesson in Understanding Data and P.A.C.E. Program: 605-022-09 Presenter: Bill Cheney Audience: Clinical Laboratory Professionals Overview Statistics

More information

Gene Prediction: Preliminary Results

Gene Prediction: Preliminary Results Gene Prediction: Preliminary Results Outline Preliminary Pipeline Programs Program Comparison Tests Metrics Gene Prediction Tools: Usage + Results GeneMarkS Glimmer 3.0 Prodigal BLAST ncrna Prediction

More information

Microbiome Analysis. Research Day 2012 Ranjit Kumar

Microbiome Analysis. Research Day 2012 Ranjit Kumar Microbiome Analysis Research Day 2012 Ranjit Kumar Human Microbiome Microorganisms Bad or good? Human colon contains up to 100 trillion bacteria. Human microbiome - The community of bacteria that live

More information

Human Microbiome Project: First Map of the World Within Us. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Human Microbiome Project: First Map of the World Within Us. Hsin-Jung Joyce Wu Microbiota and man: the story about us Human Microbiome Project: First Map of the World Within Us Immune disorders: The new epidemic Gut microbiota: health and disease Disease Health Human Microbiome Project: The concept of superorganism :

More information

Trust-Networks in Recommender Systems

Trust-Networks in Recommender Systems San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 Trust-Networks in Recommender Systems Kristen Mori San Jose State University Follow this and additional

More information

Lab Rotation Report. Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma

Lab Rotation Report. Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma Lab Rotation Report Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma Ray S. Lin Biomedical Informatics Training Program, Stanford June 24, 2006 ABSTRACT The findings in the

More information

a-dB. Code assigned:

a-dB. Code assigned: This form should be used for all taxonomic proposals. Please complete all those modules that are applicable (and then delete the unwanted sections). For guidance, see the notes written in blue and the

More information

Life Cycle Assessment A product-oriented method for sustainability analysis. UNEP LCA Training Kit Module f Interpretation 1

Life Cycle Assessment A product-oriented method for sustainability analysis. UNEP LCA Training Kit Module f Interpretation 1 Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module f Interpretation 1 ISO 14040 framework Life cycle assessment framework Goal and scope definition

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes Fig. S2. Additional information supporting the use case Application to Comparative Genomics: Erythritol Utilization in Brucella. (A) Genes encoding enzymes involved in erythritol transport and catabolism

More information

GLMs the Good, the Bad, and the Ugly Ratemaking and Product Management Seminar March Christopher Cooksey, FCAS, MAAA EagleEye Analytics

GLMs the Good, the Bad, and the Ugly Ratemaking and Product Management Seminar March Christopher Cooksey, FCAS, MAAA EagleEye Analytics Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to

More information

Unravelling Airbnb Predicting Price for New Listing

Unravelling Airbnb Predicting Price for New Listing Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 paridhic@andrew.cmu.edu Aniket Jain H John Heinz III College

More information

David Jacob Meltzer m. Supervisor: Dr. Umer Zeeshan Ijaz

David Jacob Meltzer m. Supervisor: Dr. Umer Zeeshan Ijaz AMPLIpyth: A Python Pipeline for Amplicon Processing David Jacob Meltzer 0803837m MSc Bioinformatics, Polyomics and Systems Biology Supervisor: Dr. Umer Zeeshan Ijaz A report submitted in partial fulfillment

More information

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017 Whole Metagenome Functional Analysis (beta) November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com

More information

Quality Control Assessment in Genotyping Console

Quality Control Assessment in Genotyping Console Quality Control Assessment in Genotyping Console Introduction Prior to the release of Genotyping Console (GTC) 2.1, quality control (QC) assessment of the SNP Array 6.0 assay was performed using the Dynamic

More information

Individual Charts Done Right and Wrong

Individual Charts Done Right and Wrong Quality Digest Daily, Feb. 2, 2010 Manuscript No. 206 Individual Charts Done Right and Wrong How does your software rate? In my column of January 7 I looked at the problems in computing limits for Average

More information

Knowledge-Guided Analysis with KnowEnG Lab

Knowledge-Guided Analysis with KnowEnG Lab Han Sinha Song Weinshilboum Knowledge-Guided Analysis with KnowEnG Lab KnowEnG Center Powerpoint by Charles Blatti Knowledge-Guided Analysis KnowEnG Center 2017 1 Exercise In this exercise we will be doing

More information

Big Data. Methodological issues in using Big Data for Official Statistics

Big Data. Methodological issues in using Big Data for Official Statistics Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics

More information

OMNIgene GUT stabilizes the microbiome profile at ambient temperature for 60 days and during transport

OMNIgene GUT stabilizes the microbiome profile at ambient temperature for 60 days and during transport OMNIgene GUT stabilizes the microbiome profile at ambient temperature for 60 days and during transport Evgueni Doukhanine, Anne Bouevitch, Ashlee Brown, Jessica Gage LaVecchia, Carlos Merino and Lindsay

More information

How to view Results with. Proteomics Shared Resource

How to view Results with. Proteomics Shared Resource How to view Results with Scaffold 3.0 Proteomics Shared Resource An overview This document is intended to walk you through Scaffold version 3.0. This is an introductory guide that goes over the basics

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Examiner s report F1/FAB Accountant in Business For CBE and Paper exams covering July to December 2014

Examiner s report F1/FAB Accountant in Business For CBE and Paper exams covering July to December 2014 Examiner s report F1/FAB Accountant in Business For CBE and Paper exams covering July to December 2014 General Comments The examination comprised two parts. The first part required candidates to answer

More information

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination UC Davis Computer Science Technical Report CSE-2003-29 1 Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination Dan Gusfield, Satish Eddhu, Charles Langley November 24,

More information

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification

More information

Forecasting Revenues in Ancillary Markets

Forecasting Revenues in Ancillary Markets Forecasting Revenues in Ancillary Markets Ajay Patel and Eddie Solares Forecasting Revenues in Ancillary Markets 1 Summary Most companies use an expected value formula to forecast revenues from a sales

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Expression summarization

Expression summarization Expression Quantification: Affy Affymetrix Genechip is an oligonucleotide array consisting of a several perfect match (PM) and their corresponding mismatch (MM) probes that interrogate for a single gene.

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

Exploring a fatal outbreak of Escherichia coli using PATRIC

Exploring a fatal outbreak of Escherichia coli using PATRIC Exploring a fatal outbreak of Escherichia coli using PATRIC On May 19, 2011, the Robert Koch Institute, Germany's national-level public health authority, was informed about a cluster of three cases of

More information

Improved taxonomic assignment of human intestinal 16S rrna sequences by a dedicated reference database

Improved taxonomic assignment of human intestinal 16S rrna sequences by a dedicated reference database Ritari et al. BMC Genomics (2015) 16:1056 DOI 10.1186/s12864-015-2265-y RESEARCH ARTICLE Open Access Improved taxonomic assignment of human intestinal 16S rrna sequences by a dedicated reference database

More information

Genome-Wide Association Studies (GWAS): Computational Them

Genome-Wide Association Studies (GWAS): Computational Them Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014 Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus

More information

Enabling reproducible data analysis for metagenomics. eresearch Africa Conference 2017 Gerrit Botha CBIO H3ABioNet 3 May 2017

Enabling reproducible data analysis for metagenomics. eresearch Africa Conference 2017 Gerrit Botha CBIO H3ABioNet 3 May 2017 Enabling reproducible data analysis for metagenomics eresearch Africa Conference 2017 Gerrit Botha CBIO H3ABioNet 3 May 2017 Outline 16S rrna analysis Current CBIO 16S rrna analysis setup H3ABioNet hackathon

More information

Bioinformatic Suggestions on MiSeq-Based Microbial Community S

Bioinformatic Suggestions on MiSeq-Based Microbial Community S J. Microbiol. Biotechnol. (2015), 25(6), 765 770 http://dx.doi.org/10.4014/jmb.1409.09057 Review Research Article jmb Bioinformatic Suggestions on MiSeq-Based Microbial Community S Analysis Tatsuya Unno*

More information