At Least 1 in 20 16S rrna Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies

Size: px
Start display at page:

Download "At Least 1 in 20 16S rrna Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies"

Transcription

1 APPLIED AND ENVIRONMENTAL MICROBIOLOGY, Dec. 2005, p Vol. 71, No /05/$ doi: /aem Copyright 2005, American Society for Microbiology. All Rights Reserved. At Least 1 in 20 16S rrna Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies Kevin E. Ashelford, 1 * Nadia A. Chuzhanova, 3 John C. Fry, 1 Antonia J. Jones, 2 and Andrew J. Weightman 1 Cardiff School of Biosciences, Cardiff University, Main Building, Park Place, P.O. Box 915, Cardiff CF10 3TL, United Kingdom 1 ; Cardiff School of Computer Science, Cardiff University, Queen s Buildings, 5 The Parade, Roath, Cardiff CF24 3AA, United Kingdom 2 ; and Biostatistics and Bioinformatics Unit and Institute of Medical Genetics, Cardiff School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, United Kingdom 3 Received 7 April 2005/Accepted 28 July 2005 A new method for detecting chimeras and other anomalies within 16S rrna sequence records is presented. Using this method, we screened 1,399 sequences from 19 phyla, as defined by the Ribosomal Database Project, release 9, update 22, and found 5.0% to harbor substantial errors. Of these, 64.3% were obvious chimeras, 14.3% were unidentified sequencing errors, and 21.4% were highly degenerate. In all, 11 phyla contained obvious chimeras, accounting for 0.8 to 11% of the records for these phyla. Many chimeras (43.1%) were formed from parental sequences belonging to different phyla. While most comprised two fragments, 13.7% were composed of at least three fragments, often from three different sources. A separate analysis of the Bacteroidetes phylum (2,739 sequences) also revealed 5.8% records to be anomalous, of which 65.4% were apparently chimeric. Overall, we conclude that, as a conservative estimate, 1 in every 20 public database records is likely to be corrupt. Our results support concerns recently expressed over the quality of the public repositories. With 16S rrna sequence data increasingly playing a dominant role in bacterial systematics and environmental biodiversity studies, it is vital that steps be taken to improve screening of sequences prior to submission. To this end, we have implemented our method as a program with a simple-to-use graphic user interface that is capable of running on a range of computer platforms. The program is called Pintail, is released under the terms of the GNU General Public License open source license, and is freely available from our website at Downloaded from Analysis of the 16S rrna gene is currently fundamental to an understanding of bacterial taxonomy, phylogeny, and diversity (3, 5). Sequence anomalies, if undetected, can generate misleading impressions of environmental diversity and complicate attempts to reconstruct bacterial evolutionary trees. It is vital, therefore, that public repositories such as those managed by EMBL (9), GenBank (2), and the Ribosomal Database Project (RDP) (3) contain reliable sequences if correct conclusions are to be made within studies that rely on 16S rrna sequence analysis. Unfortunately, corrupt sequences, such as chimeras formed during PCR amplification (12, 14, 15, 20, 21) or anomalies produced by other steps in the sequencing process, have long been present in the public databases. Poor sequencing methodology often produces highly degenerate sequences; these are easy to spot. More insidious are other sequencing errors that cannot be detected by a visual inspection of the sequence alone. Chimeras, sometimes referred to as jumping PCR products, shuffle genes, or in vitro recombination products have been a recognized PCR amplification problem for some time (17), with damage or degradation to the DNA template and * Corresponding author. Mailing address: Cardiff School of Biosciences, Cardiff University, Main Building, Park Place, P.O. Box 915, Cardiff CF10 3TL, United Kingdom. Phone: 44 (0) Fax: 44 (0) ashelford@cardiff.ac.uk. contamination with other templates being likely causes of their formation (14). Chimeras have been shown to occur in PCR-amplified gene libraries with frequencies of up to 30% or more (12, 20, 21) and therefore pose a potentially significant problem. Chimeric anomalies have long been recognized, and several computational methods have been developed over the years to detect and analyze suspect sequences (6, 7, 10, 11, 13, 16). Historically, the RDP s Chimera_Check program (13) has been used most widely, although the more recent Bellerophon program (7) appears to be gaining in popularity. However, existing tools for chimera detection, although often effective, have limitations (8, 11, 16, 21). Also, most of these tools have not been developed into sufficiently accessible computer programs that can be used easily by researchers regardless of computing background. One reason for the widespread use of RDP s Chimera_Check program is that it has a user-friendly interface and is available to anyone with a web browser. Most importantly, the problem of chimeras and other sequence anomalies is still underestimated by the research community. Despite recent papers highlighting the problem, some very obvious anomalies continue to be submitted to sequence repositories. Until the extent of this problem is known, the impetus to improve screening procedures prior to submission and to better curate those that have been submitted is unlikely to come. on April 30, 2018 by guest 7724

2 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7725 The aim of the current study was twofold: (i) to develop a 16S rrna sequence anomaly-detecting method currently used in our laboratory into a new software tool that is sufficiently user friendly and reliable to be used easily by as many researchers as possible, and (ii) to use this tool to estimate the true level of sequence corruption within public repositories. To this end, we present our software to the wider community and detail the results from a survey of selected bacterial taxa, as defined by the RDP database. MATERIALS AND METHODS Developing detection method. All software was written in the Java computer language, using Sun s Java software development kit, J2SE SDK (Java Technology [ The final program, called Pintail, was tested on RedHat 9.0 Linux, Microsoft Windows XP, and Apple Mac OS X, version Pintail, along with its source code and help files, is freely available from and is released under the terms of the GNU General Public License ( The program uses ClustalW (19) to generate sequence alignments. Our method works by aligning a query sequence (S q ) with a trusted subject sequence (S s ) and then analyzing differences between query and subject over the entire length of the 16S rrna gene, by employing a sliding window of specified size w progressing a fixed number of bases l at a time along the resulting alignment S qs of length n. The total number of windows will be m n w 1/l, where signifies the ceiling of the enclosed expression, i.e., the smallest whole number greater than or equal to the value of the expression. At the ith window w i (1 i m), the percentage of mismatched bases is calculated, giving rise to an observed percentage difference o i that can be thought of as an uncorrected measure of evolutionary distance between query and subject within w i. The resulting set of observed percentage differences O qs {o i : o 1, o 2,...,o m } when plotted provide a visual representation of the variation in evolutionary distance between S q and S s over the length of the 16S rrna gene. The core algorithm for generating O qs can be summarized as follows. Algorithm 1. (i) Input query sequence S q, the sequence to be checked for anomalies. (ii) Input subject sequence S s, a reliable sequence closely related to the query. (iii) Globally align S q with S s using ClustalW to generate alignment S qs of length n. (iv) By sliding a window of size w with step l along S qs, determine the percentage of mismatched bases o i within window w i as described above and compute the resulting data set O qs {o i : o 1, o 2,..., o m } of the observed percentage differences detected between S q and S s. (v) Plot O qs against base position i to display graphically the changes in evolutionary distance between S q and S s over their mutual length n. Note that the mean of the observed percentage differences, expressed as ( i o i )/m, is essentially a measure of the overall uncorrected evolutionary distance between the two sequences. Although this value will not be exactly the same as that derived by a simple global alignment, for simplicity we will use the term overall evolutionary distance to refer to this mean, as the distinction between the two concepts is irrelevant as far as the rest of the paper is concerned. Expected percentage differences. To assess whether the observed percentage difference plot indicates an anomalous query, a method was developed for predicting expected percentage differences that one might expect if both query and subject were reliable. To generate expected percentage differences E qs {e i : e 1, e 2,..., e m } for any pair of sequences S q and S s, it was necessary to map accurately the hypervariable regions within the 16S rrna gene sequence. This was done as follows. All type strain sequences of 1,200 nucleotides were downloaded from the RDP web site (3) as a single aligned file, with Escherichia coli U00096 included as a reference sequence. At the time of this study, RDP release 9, update 22 (September 2004), was current, with 4,383 full-length type strain sequences available for downloading. We totalled the number of each nucleotide residue r{r: A, C, G, T/U} at each base position j (1 j 1,542) within the RDP aligned type strain sequences, using E. coli U00096 as a reference (hence, 1,542 base positions). From these raw counts, we identified the frequency f j r of the most common residue r at each base position j within the alignment (ignoring gap characters). Note that when position j is most variable, each of the four possible residues is equally likely to occur. By a simple correction, p j ( f j r 0.25)/0.75 relative frequencies were converted into probabilities, and so the entire type strain data set was described by the probability profile P { p j : p 1, p 2,...,p 1,542 }, which reflects the probability of a 16S rrna sequence being conserved at any particular residue position. If p j describes residue conservation at position j, then q j 1 p j describes residue variability at that position. In other words, Q {q j : q 1, q 2,...,q 1564 } is a probability profile that reflects the variability of a 16S rrna sequence at any particular residue position. Thus, profile Q can be used to map accurately the hypervariable regions within the 16S rrna gene. The expected percentage differences E qs can be generated from Q by applying the following algorithm. Algorithm 2. (i) By sliding a window of size w with step l along the probability profile Q, determine the average probability a i for each window w i such that the resulting data set Q av {a i : a 1, a 2,...,a m } is a set of average probabilities that can be related directly to the observed percentage differences data set O qs generated by Algorithm 1. (ii) Define a fitting coefficient as the overall evolutionary distance between query and subject, as defined by ( i o i )/m, divided by the mean of data set Q av. Thus, io i /m i a i /m. (iii) Multiply each element of Q av by to generate the expected percentage differences E qs (i.e., e i a i ). (iv) Plot E qs alongside O qs. Algorithm 2 generates expected percentage differences for any query and subject pair. By plotting the expected values E qs against their observed values O qs generated by algorithm 1, a visual assessment of the quality of sequence S q with respect to sequence S s can be made. In addition, subtracting e i from o i for each position i generates a series of deviations, the standard deviation of which quantifies the overall deviation of O qs from E qs. We refer to this standard deviation as the deviation from expectation (DE) statistic. Thus, DE 1 m o i e i 2. m 1 Calibrating the method. Of the 4,383 type strain sequences from the RDP, 2,361 contained at least one degenerate base. As a means of discarding potentially unreliable records, these degenerate sequences were removed, leaving an RDP aligned data set of 2,022 sequences, plus the E. coli reference. The type strains were then analyzed by applying the following two procedures. Procedure 1. (i) Applying algorithms 1 and 2, each sequence in the data set was compared to each other, resulting in a DE value for each comparison. (ii) All DE values were plotted against their corresponding overall evolutionary distances. (iii) Obvious outlier DE values were identified from the plot. (iv) Sequences responsible for the outlier DE values were then identified. Since each DE value was generated by a pair of sequences, the sequence responsible for the high DE value was identified by using a ranking system that scored sequences according to the number of times they were involved the generation of a DE outlier. Identified sequences were then investigated by applying procedure 2. Procedure 2. (i) A National Center for Biotechnology Information (NCBI) BlastN search (1) was undertaken with each query sequence to identify its nearest neighbors within the public database. (ii) A suitable nearest neighbor was chosen for comparison (labeled the first subject). Sequences originating from different research groups, and hence a different 16S rrna gene library from that which had generated the query, were preferred. (iii) The first subject was compared to the query using the Pintail program, and the output was assessed for evidence of any sequence anomaly. (iv) To confirm the reliability of the first subject, and hence the conclusions drawn, a second nearest neighbor was selected again from a separate study. This second subject was compared to the first subject by using Pintail, and output was checked. (v) Finally, as a final check, the query was compared to the second subject. It can be seen that, ideally, only three comparisons are necessary per query sequence to unambiguously identify an anomaly. In practice, this was not always possible, either because a lack of suitable database entries meant that the only nearest neighbors available were those generated by the same author(s) and thus were probably from the same gene library or because the best available nearest neighbor was only distantly related to the query. Under such circumstances, up to nine nearest neighbors were compared to the query sequence and each other, and the final conclusion was made after assessing the overall trend in the resulting matrix of pairwise comparisons. Where necessary, the NCBI s BLAST 2 SEQUENCES program (bl2seq) (18) was used to resolve uncertainties. Procedures 1 and 2 were applied to the type strain data, and outlier DE values found to be generated by anomalous sequences were excluded from subsequent analysis. The median, upper quartile, and 95, 99, 99.9, and 100% quantiles of the corrected DE plot were then determined for each 1% interval along the x axis of the plot. In this way, the corrected DE plot could be described in terms of a series of quantile plots and could be included within the final Pintail program. Thus, a DE value subsequently generated by Pintail could be compared to DE values previously generated from the type strain comparisons, and conclusions could be drawn as to the likelihood of the new DE value being generated by a pair of nonanomalous sequences.

3 7726 ASHELFORD ET AL. APPL. ENVIRON. MICROBIOL. Downloaded from FIG. 1. Program screenshot illustrating a typical analysis. In this example, query AY (top left) is compared with subject AJ (bottom left), generating a plot of evolutionary distances that demonstrate high similarity between these two sequences at the 5 end only. AY693838, introduced into the NCBI on 30 August 2004, is classified by the RDP as belonging to the proposed new OP11 phylum. AJ551147, in contrast, belongs to the -Proteobacteria genus Janthinobacterium. on April 30, 2018 by guest Testing Pintail with known chimeras. The Pintail program was tested with 50 known bacterial chimeric sequences originally identified by Hugenholtz and Huber (8) and listed in the RDP database, release 9, update 22. A further five archaeal sequences listed by Hugenholtz and Huber (8) but not included on the RDP website were also tested. Each chimera was analyzed by following procedure 2. Screening selected bacterial phyla. Using the RDP s online hierarchy browser, all bacterial phyla containing up to 200 sequence records were downloaded as separate aligned files. For each aligned data set, procedure 1 was applied to identify putatively anomalous sequences. In this screening, outlier DE values were defined as those falling above the 99.9% quantile line calculated from the type strain data. Anomalous sequences identified in this way were checked by procedure 2. Procedure 1 was also applied to the 2,739 almost-complete ( 1,200-nucleotide) sequence records making up the Bacteroidetes phylum as defined by RDP, release 9, update 22. In this much larger single analysis, potentially anomalous sequences were confirmed by application of a simplified version of procedure 2 (i.e., steps i to iii only). RESULTS Implementation of methodology. The development of the methodology described in this paper culminated in the computer program Pintail, the operation of which is now described. Figure 1 shows a screenshot of Pintail, showing the outcome of a typical analysis. The query sequence S q (in this instance, a chimera) was entered into the top-left text box, and the subject sequence S s (a reliable sequence, identified by BlastN as closely related to the query) was entered into the bottom-left

4 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7727 FIG. 2. Typical 16S rrna gene sequence comparison plots generated by Pintail (all graphs generated with window size 300 and step size 25). (A to C) Plots between pairs of trusted sequences of increasing evolutionary distance, while D to F show examples where the query sequence is a chimera. Observed percentage differences between sequences are plotted as black lines. Gray lines show the expected percentage differences for the sequence pairs. Light gray shading indicates expected percentage differences 5%. Escherichia coli ATCC 11775T (X80725) is compared to Escherichia vulneris ATCC 33821T (X80734) (A), Pseudomonas aeruginosa LMG 1242T (Z76651) (B), and Aquifex pyrophilus (T) Kol5a (M83548) (C). (D to F) Three typical chimeric patterns. (D) The three-fragment Nitrospira chimeric sequence AY (estimated breakpoints, 340 and 740) is compared to its BLAST identified nearest neighbor, X (E) The three fragment chimeric record U10877 generated from Riemerella anatipestifer (T) ATCC is shown to diverge from the sequence of its nearest neighbor, R. anatipestifer strain 115/02 (AY856450) around E. coli positions 790 to (F) The two-fragment Fusobacteria chimeric sequence AY (estimated breakpoint, 800) is compared to the sequence from its nearest neighbor, AY text box. The results of the analysis are displayed in the panel on the right and show graphically that the query is indeed a chimera with its 3 end phylogenetically more distant from the subject sequence than its 5 end. Figure 2 illustrates in more detail typical graphs generated by the program, with panels A to C showing the output from a reliable query sequence being compared with equally reliable subject sequences of various evolutionary distances. Conversely, panels D to F show typical plots obtained when the query sequence is chimeric. The trends shown in panels D to F are very characteristic of chimeras. Other anomalies, such as missing sequence data or blocks of degenerate bases, are easily recognized from much sharper plot variations, which are particularly noticeable when smaller sampling window sizes are employed. Each graph generated by the program consists of four plots. The plot of observed percentage differences (O qs, shown as a black line in Fig. 2) shows the change in percentage difference between query and subject as the sampling window moves along the alignment. In all examples shown in Fig. 2, a window size w of 300 nucleotides was used, moving along the alignment l for 25 bases at a time. This combination was found to be most suitable for displaying overall trends. Reducing window size to 100 bases supplies more detail and is useful for estimating chimeric breakpoints. The mean of the observed percentage differences displayed by the program is roughly equivalent to the uncorrected evolutionary distance between query and subject. From this mean, the expected percentage differences (E qs ) which might be expected for sequences of this evolutionary distance were calculated. These expected percentage differences are displayed as a second plot line within the program s output graph (Fig. 1) and as gray lines in Fig. 2. Similarly, two further expected lines were

5 7728 ASHELFORD ET AL. APPL. ENVIRON. MICROBIOL. Downloaded from FIG. 3. Illustrating variable regions within the 16S rrna gene and location of chimeric breakpoints. (A) The frequency of occurrence of the most common nucleotide residue at each base position within the 16S rrna gene, as determined from RDP-listed 4,383 type strains, with E. coli U00096 as a reference. These frequencies are measures of variability within the gene. (B) Smoothing the data, by taking the mean frequency within a window of 50 bases, moving one base at a time along the gene, creates the plot shown in panel B. The locations of the hypervariable regions are labeled, with gray bars on the x axis defining these regions as V1 to V9 (the Comparative RNA Web Site [ (C) Histogram of all chimera breakpoints identified in this study and that of Hugenholtz and Huber (8). plotted based on the mean observed percentage differences 5% and represent graphically this level of variation around the expected line as an area shaded light gray (Fig. 2). The expected line (E qs plot) helps to indicate if and where the observed line deviates from what might be expected from reliable sequences with the same overall evolutionary distance as the query and subject. The DE statistic calculated by the program quantifies this deviation. The higher the DE value, the greater will be the departure of the observed data from that expected of trusted sequences. To aid interpretation, the DE statistic is best viewed in the context of reliable query-versussubject comparisons sharing similar evolutionary distances. So the program summarizes the DE values obtained between type strains of the same evolutionary distance as exhibited between query and subject; from this information, the probability that the observed DE value is likely to have been generated by two reliable sequences is inferred (Fig. 1). Development of methodology and testing the underlying assumption. The assumption underlying the method implemented in Pintail is that two reliable (i.e., nonanomalous) 16S rrna sequences of known overall evolutionary distance will vary by roughly the same amount over the length of the gene, allowing for the effects of the hypervariable regions when homologous bases are compared. Given the empirical nature of the methodology, it was necessary to test this assumption. One test was to select pairs of reliable sequences at random, apply the method, and assess the output for any contradiction of our assumption. Figure 2A to C illustrates typical results obtained this way. However, this approach was inevitably limited in scope. To test the assumption more thoroughly and at the same time calibrate our method, we needed to consider a much larger data set of reliable sequences. To do this necessitated finding a way of quantifying our observations so that a more automated checking procedure could be employed. This led to the concept of expected percentage differences and the deviation from expectation statistic, described in Materials and Methods and now considered in more detail below. (i) Expected percentage differences. To generate expected percentage differences for any two sequences, it was necessary to take account of the regions of conservation and variability on April 30, 2018 by guest

6 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7729 FIG. 4. DE values generated from type strain data set containing 2,022 16S rrna gene sequences without any degenerate base positions (see text). DE value was generated for each of the 2,043,231 pairwise sequence comparisons and plotted against evolutionary distance between sequences. (A) The data set prior to the removal of the 15 anomalous sequences (see text); (B) the plot after removal; (C) the quantile values used to describe these data and incorporated into the Pintail program as a means of calibration. inherent in the 16S rrna gene and the evolutionary distance represented by sequence dissimilarity between the two sequences. As Fig. 2A to C illustrates, the character of the observed percentage difference plot was informed by both of these concepts. Therefore, we needed to model 16S rrna intragene variability and then use this model to predict expected percentage differences from overall evolutionary distance (as represented by the mean of the observed percentage differences). Type strain sequences, a priori, can be considered reliable in that they will normally have been generated from pure cultures and therefore will have been less prone to the errors common to environmental samples, due to quality and purity of the template. RDP release 9, update 22, contains 4,383 type strain sequences with a length of 1,200 nucleotides. We downloaded all 4,383 records from the RDP website retaining the RDP s alignment, along with a reliable Escherichia coli record (U00096) as a reference sequence. From this, we were able to allocate to each base position in the E. coli reference sequence a frequency for the most common nucleotide residue (A, C, G, or T/U) (Fig. 3A). For example, a position that is occupied by an adenine in all type strain sequences would have a frequency of 1. Conversely, a position where all four bases are equiprobable would have a frequency of Smoothing these data revealed peaks and troughs which corresponded to the known hypervariable and conserved regions for the 16S rrna gene (Fig. 3B), matching peaks and troughs in observed percentage difference plots. Converting these frequencies to a probability profile allocating a probability to each 16S rrna base position created a profile of 16S rrna intragene variability for use in the final program. Expected percentage differences for any two sequences were generated from this profile by multiplying each probability by the fitting coefficient to ensure the resulting data set had the same mean as the observed data. (ii) DE statistic. Subtracting a set of expected values from corresponding observed data points generated a set of error values, the standard deviation of which summarized the extent to which observation deviated from expectation. This is how the DE statistic was derived and used in this study as a way of summarizing any analysis of sequence pairs as a single value. We were now in a position to automate our method and consider a much larger data set of reliable sequences. The 4,383 type strain sequences initially served as the data set; however, since our method detects any sequence anomaly, it quickly became apparent that high levels of type strain degeneracy were hampering our survey and needed to be discounted. Only 2,022 of 4,383 type strain sequences were completely without degenerate base characters. Of the remaining 2,361 sequences, levels of degeneracy as high as 483 bases were detected, although 2,173 had 50 degenerate characters. Further analysis concentrated on the 2,022 degeneracy-free sequences, since these were considered to be least likely to have anomalies. (iii) Calibration. Pairwise comparisons of the 2,022 sequences without degeneracies generated 2,043,231 DE values. Plotting all these against the mean of the observed percentage differences for each comparison (Fig. 4) revealed that most DE values, and hence most comparisons, clustered together. However, a number of outlier clusters quite distinct from the main cluster were also observed (Fig. 4A), and investigation showed the same 15 sequences were responsible for these outliers (Table 1). Application of procedure 2 (Fig. 5) showed 2 of these 15 sequences to be chimeric. Record AJ (classified as Lactobacillus psittaci) is a two-fragment chimera with a 5 end practically identical to that of Lactobacillus jensenii (AF243159) and a 3 end similarly close to that of Lactobacillus vaginalis (AF243154). Record U10877 (classified as Riemerella anatipestifer ATCC 11845T) is a three-fragment chimera with fragments 1 and 3 deriving from a member of the Bacteroidetes and fragment 2 of -Proteobacteria origin (Fig. 2E). It is worth noting here that ATCC 11845T has subsequently been resequenced as record U60101 and that analysis of this record

7 7730 ASHELFORD ET AL. APPL. ENVIRON. MICROBIOL. TABLE 1. Anomalous Bacteria 16S rrna gene sequence records from type strains Accession no. Name shows no anomaly. The remaining 13 sequences contained anomalies most likely to be sequencing errors. Eight originated from the same research group, and all contained some sort of sequencing error in the first 220 to 240 bases at the 5 end. Intriguingly, two of these anomalies were observed when the original 2,022-type strain RDP alignment was used but not when checked with ClustalW. Further investigation by eye confirmed these anomalies to be real, confirming the RDP alignment to be the more accurate than the ClustalW alignment. When the 15 anomalous sequences were removed from the data set, the plotted DE values clustered together as one group (Fig. 4B). Figure 4C shows the same data reduced to a series of quantile plots, which were used to estimate the probability of the query sequence being anomalous, as indicated in Fig. 1. Testing program with known chimeras. We tested our approach with 39 chimeric 16S rrna sequences identified by Hugenholtz and Huber (8) and applied procedure 2 as summarized in Fig. 5. All were confirmed as chimeric by our method. In addition, we found that Hugenholtz and Huber had incorrectly characterized record AF as a two-fragment chimera, whereas our method reveals it to be a three-fragment chimera (Fig. 6). AF sequence up to E. coli position 340 is of Firmicutes origin (closely matching AF323775). Bases from 341 to 1,080 come from an unknown source, the closest match being AF323760, previously identified as from the OP9 phylum (8) but remaining unclassified by the RDP. The remainder of AF derives from the Spirochaetes phylum and closely matches M We also tested an additional 15 chimeras identified by Hugenholtz and Huber and listed within the RDP hierarchy browser but not included in their paper (8). We confirmed 12 to be chimeric. However, we could not find evidence that X84498, AF333535, or AY were chimeric (although Location of anomaly relative to E. coli (bases) Description D17751 Leucobacter komagatae IFO15245T Anomaly near 5 end; likely sequencing error D21342 Microbacterium imperiale IFO 12610T 230 Anomaly at 5 end; likely sequencing error D21344 Microbacterium laevaniformans IFO 14471T Anomaly near 5 end; likely sequencing error AJ Arthrobacter flavus CMS-19Y 1,130 1,420 Anomaly near 3 end; likely sequencing error AJ Nannocystis exedens Na e Anomaly near middle; likely sequencing error D21245 Luteococcus japonicus IFO , Anomaly at 5 end and in middle; likely sequencing errors AF Thermoanaerobacter subterraneus SEBR 7858; LA Anomaly near middle; likely sequencing error D21343 Microbacterium lacticum IFO 14135T Anomaly near 5 end; likely sequencing error Z49116 Halanaerobium saccharolyticum subsp. senegalense 1,320 1,450 Anomaly near 3 end; likely sequencing error DSM 7379 D21339 Microbacterium arborescens IFO 3750T 230 Practically identical to D21342 D21341 Microbacterium dextranolyticum IFO 14592T Anomaly near 5 end; likely sequencing error (only visible with RDP alignment) AB Vibrio rumoiensis S-1 500? Anomaly near 5 end; likely sequencing error (only visible with RDP alignment) D17527 Kineococcus aurantiacus IFO Anomaly near 5 end; likely sequencing error AJ Lactobacillus psittaci 790 Two-fragment chimera with 5 end practically identical to Lactobacillus jensenii (AF243159) and 3 end practically identical to Lactobacillus vaginalis (AF243154) U10877 Riemerella anatipestifer ATCC , 1,130 Three-fragment chimera with middle fragment of -Proteobacteria origin; fragments one and three derive from the same Bacteroidetes origin with AY there was evidence of a possible sequencing anomaly at the extreme 5 end), and a series of comparisons using bl2seq (18) under a range of parameter settings failed to contradict this analysis. Database analysis. The RDP website hierarchy browser (3) classifies 16S rrna sequence records according to the current Bergey s 16S rrna-based classification system (5). We used this facility to obtain aligned sequence files for 19 phyla, amounting to 1,399 records in all. Phyla were selected purely by size, with any phylum containing 200 sequences chosen. Thus, all were selected without prior knowledge of any sequence anomalies. Initial screening by DE value, as detailed in procedure 1, identified 73 putatively anomalous sequences. Application of procedure 2 showed 70 of these 73 to be unambiguously anomalous and distributed within 16 of the 19 phyla (Fig. 7; Table 2). The three false positives all occurred within the Aquificae and were caused by the absence of sufficiently closely related subject sequences for comparison with the query sequences concerned. Of the 70 confirmed anomalies, 45 were clearly chimeric. A further 15 anomalies were highly degenerate. The remaining 10 contained other sequence anomalies, such as that found within the Aquificae record AY268103, the 5 end of which, up to E. coli position 560, was the reverse complement of 16S rrna. The Pintail program identified 22 of the 45 chimeras as derived from parents belonging to different phyla. For example, sequence AF is part Acidobacteria and part Actinobacteria. A further 16 chimeras contained one parent of either unknown (no close record in current database) or unclassified (RDP was unable to classify according to Bergey s

8 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7731 FIG. 5. Illustrating procedure 2 for unambiguously confirming a chimeric sequence (all graphs were generated with window size 300 and step size 25). (A) In this example, the query, an Acidobacteria sp. (AF523990), is compared to its nearest neighbor (AF523976) identified by BlastN search, and an anomaly at the 5 end is identified. (B) AF is next compared to its nearest neighbor, AY234512, to confirm that it is reliable. No anomaly is detected. (C) As a final check, AF is compared to AY234512; as expected, the 5 end anomalous feature is seen. (D) To determine whether this anomaly is chimeric, the identified 5 region is excised, a BLAST search is undertaken, and the identified nearest neighbor (in this case Actinobacteria X68459) is compared to AF Again, an anomaly is detected, but this time the reverse of that seen in panel A, clearly indicating our query to be a chimera. (E) Comparing X68459 with its neighbor, AF498683, confirms its reliability, and as expected, (F) comparing the original query with AF generates the same profile as that seen in panel D. The chimeric breakpoint can be estimated by superimposing A on D. classification) origin. Thirteen out of 45 were formed from parents belonging to the same phylum. While most chimeras were composed of two fragments from unrelated source sequences, nine three-fragment chimeras were also detected. A striking example of this is the Fusobacteria sequence AJ with its 5 end originating from a Fusobacterium, the middle region being of Spirochaetes origin, and the 3 end belonging to a member of the Bacteroidetes. Table 2 lists a further 10 anomalous sequences discovered during our investigations but not included in our original 19- phylum data set. All but two are obvious chimeras. One is another example of the 5 end being a reverse complement of the correct sequence. Three of these records were submitted to the public repositories during our study. The Bacteroidetes phylum, as identified by RDP release 9, update 22, was also screened by applying procedure 1 and steps i to iii of procedure 2. Of the 2,739 near-complete sequences checked, 159 (5.8%) were identified as likely anomalies. Of these, 12 were highly degenerate, 104 appeared to be chimeric, 21 contained missing sequence blocks due to assembly errors, and the remainder were miscellaneous anomalies. Chimera breakpoints. Approximate breakpoints for chimeras in this study were determined by analyzing the plots produced by Pintail. Reducing window size to 50 to 100 was most effective in providing sufficient visual detail to make this assessment. Breakpoints were most easily assessed when both parent sequences were identified (e.g., Fig. 5), since their corresponding observed percentage difference plots could easily be superimposed on one another and breakpoints could be identified where the lines crossed. Identified breakpoint positions were combined with values identified by Hugenholtz and Huber (8) and plotted alongside the known hypervariable regions within the 16S rrna gene (Fig. 3C). Most were found to fall between hypervariable re-

9 7732 ASHELFORD ET AL. APPL. ENVIRON. MICROBIOL. FIG. 6. Analysis of the three-fragment chimera AF (all graphs were generated with window size 100 and step size 25). The query is shown compared to AF (A), AF (B), and M88719 (C). gions. Given that variability of each 16S rrna base position can be described in terms of the frequency of the most common residue at that position (Fig. 3A), the overall median and 95% confidence interval notches of these frequencies were In contrast, the median of those frequencies corresponding to breakpoint positions was significantly higher at DISCUSSION It has long been recognized that corrupt sequences are present within the public repositories. What has not been known is how many there may be. Of the 19 phyla studied, 5% of records were found to be corrupt; most of these (78.6%) were chimeras or similarly insidious sequencing errors. Eleven Downloaded from on April 30, 2018 by guest FIG. 7. Distribution of sequence anomalies with the nineteen Bacteria phyla, as defined by the Ribosomal Database Project (3). Numbers in brackets after the phylum (or candidate division) name are the total number of sequences within that phylum present in RDP release 9, update 22, of September 2004.

10 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7733 TABLE 2. Anomalous sequences identified by this study Accession no. Phylum Approx. break position relative to E. coli (bases) AY Aquificae 560 PCR or sequencing error with first 465 bases the reverse complement of what they should be AB Aquificae 425 Anomaly at 5 end, though origin unknown; either chimera or sequencing error AF Aquificae 1,080 Two-fragment chimera, with both fragments Aquificae in origin AJ Thermotogae 930 Two-fragment chimera, with 3 end Firmicutes in origin L10662 Thermodesulfobacteria Degenerate sequence; several large blocks of N bases Z15060 Deinococcus-Thermus Degenerate sequence; one large block of N bases X58340 Deinococcus-Thermus Degenerate sequence; several large blocks of N bases AF Nitrospira Degenerate sequence; one large block of N bases AF Nitrospira Degenerate sequence; one large block of N bases L14619 Nitrospira Degenerate sequence; several large blocks of N bases AY Nitrospira 320, 540 Two- or possibly three-fragment chimera, with 3 end of unknown origin AF Nitrospira 250 Sequencing anomaly only visible when RDP alignment used AY Nitrospira 340, 740 Three-fragment chimera, with 5 end -Proteobacteria, middle -Proteobacteria, and 3 end unknown in origin AY Nitrospira 370 Two-fragment chimera, with 3 end unclassified (candidate division OP5 according to NCBI) AF Nitrospira 1,080 Two-fragment chimera, with 3 end unclassified; record now replaced in database AY Nitrospira 700 Two-fragment chimera, with 5 end Firmicutes in origin; record already marked as chimeric in database AY Nitrospira a 790 Two-fragment chimera with 5 end -Proteobacteria in origin AY Nitrospira a 660, 940 Three-fragment chimera derived from two parents, with middle fragment of unclassified origin X86774 Nitrospira 790, 1,220 Three-fragment chimera derived from two parents, with middle fragment -Proteobacteria in origin; already identified as chimera AF Nitrospira 500, 790 Three-fragment chimera, with middle fragment also Nitrospira in origin AB Nitrospira 500 Two-fragment chimera, with 5 end of unknown origin AF Nitrospira 540 Two-fragment chimera, with both fragments Nitrospira in origin AF Nitrospira 760 Two-fragment chimera, with both fragments Nitrospira in origin L22045 Nitrospira Degenerate sequence; one large block of N bases M79383 Nitrospira Degenerate sequence; one large block of N bases Y10652 Chlorobi Degenerate with Ns clustered at 5 and 3 ends, giving superficial appearance of a chimera Y10643 Chlorobi Degenerate with Ns clustered at 5 and 3 ends, giving superficial appearance of a chimera Y10651 Chlorobi Degenerate with Ns clustered at 5 and 3 ends, giving superficial appearance of a chimera Y10647 Chlorobi Degenerate sequence; one large block of N bases and numerous other Ns Y10640 Chlorobi Degenerate with Ns clustered at 5 and 3 ends, giving superficial appearance of a chimera AY Chlamydiae Sequencing anomaly only visible when RDP alignment used AY Chlamydiae Sequencing anomaly only visible when RDP alignment used AB Acidobacteria 940 1,100 Two-fragment chimera, with 3 end of unclassified origin; lack of clear break due to number of degenerate bases AY Acidobacteria 600 1,000 Two-fragment chimera, with 3 end of unclassified origin; no obvious reason for lack of clear break AF Acidobacteria 370 Two-fragment chimera, with 5 end of Actinobacteria origin AJ Acidobacteria 280 Two-fragment chimera, with 5 end of unclassified origin Y07575 Acidobacteria 560 Two-fragment chimera, with 3 end of unknown origin AY Fusobacteria 800 Two-fragment chimera, with 3 end -Proteobacteria in origin. AJ Fusobacteria 930, 1,210 Three-fragment chimera, with 5 end Fusobacteria, middle Spirochaetes, and 3 end Bacteroidetes in origin AJ Fusobacteria 580 Two-fragment chimera, with 3 end of unclassified origin; exact position of break unclear due to lack of full-length subjects AY Fusobacteria 280, 790 Three-fragment chimera, with 5 end ε-proteobacteria, middle Fusobacteria, and 3 end ε-proteobacteria in origin AJ Fusobacteria 150, 930 Two-, possibly three-fragment chimera with 5 end unknown, middle unclassified, and 3 end ε-proteobacteria in origin AY Fusobacteria 1,140 Two-fragment chimera, with 3 end of Spirochaetes origin AF Fusobacteria 350 Two-fragment chimera, with both fragments of Fusobacteria origin AF Fusobacteria 920 Two-fragment chimera, with both fragments of Fusobacteria origin AF Fusobacteria 1,160 Two-fragment chimera, with both fragments of Fusobacteria origin AF Fusobacteria 350 Very similar to AF Z94005 Verrucomicrobia 1,025, 1,150 Region 1,025 1,150 is alien to sequence but no close match found within database; unusual nature of plot suggests sequencing error Details Continued on following page

11 7734 ASHELFORD ET AL. APPL. ENVIRON. MICROBIOL. TABLE 2 Continued Accession no. Phylum Approx. break position relative to E. coli (bases) AJ Verrucomicrobia 550 Two-fragment chimera, with both fragments of Verrucomicrobia origin AJ Verrucomicrobia 920 Two-fragment chimera, with 5 end of unknown origin AF Verrucomicrobia 300 Two-fragment chimera, with 5 end of unclassified origin AJ Verrucomicrobia 590 Two-fragment chimera, with both fragments of Verrucomicrobia origin AB Verrucomicrobia 570 Two-fragment chimera, with 3 end of unknown origin AF Verrucomicrobia 1,080 Two-fragment chimera, with 3 end of -Proteobacteria origin AJ Verrucomicrobia a 1,080 Two-fragment chimera, with 3 end of -Proteobacteria origin AF Gemmatimonadetes Degenerate sequence; one large block of N bases AF Gemmatimonadetes Degenerate sequence; two large blocks of N bases AY Gemmatimonadetes 700 Two-fragment chimera, with both fragments of Gemmatimonadetes origin AY Gemmatimonadetes 900 Two-fragment chimera, with both fragments of Gemmatimonadetes origin; break point uncertain due to quality of available subject sequences AJ Gemmatimonadetes 600, 950 Likely sequencing error AY Gemmatimonadetes 275 Likely sequencing error at 5 end AF OP10 1,100 Two-fragment chimera, with 3 end of probable Bacteroidetes origin AF OP10 260, 970 Likely three fragment chimera with 5 and 3 ends originating from some unknown source AF OP10 260, 970 Same as AF AY OP Two-fragment chimera, with 5 end of -Proteobacteria origin AY OP Two-fragment chimera, with 5 end of ε-proteobacteria origin AJ OP Likely two-fragment chimera, with 3 end of unknown origin AF TM7 900 Two-fragment chimera, with 3 end of unclassified origin AJ TM7 1,090 Two-fragment chimera, with 3 end of Actinobacteria origin AY WS3 380 Two-fragment chimera, with 5 end of Actinobacteria origin AY Dehalococcoides 500, 675 Likely sequencing error AY Dehalococcoides 1,400 Likely sequencing error AY ε-proteobacteria b 320 Two-fragment chimera, with 5 of -Proteobacteria origin; 3 of ε-proteobacteria origin AJ Unclassified Bacteria b 380 Two-fragment chimera, with 5 end of -Proteobacteria origin, 3 end of Chloroflexi origin AY Proteobacteria b 780 Two-fragment chimera, with both fragments of -Proteobacteria origin AY Proteobacteria b 780 Practically identical to AY AJ Unclassified Bacteria b 540 Two-fragment chimera, with 5 of Firmicutes origin and end 3 of Gemmatimonadetes origin AY Unclassified Bacteria b 630 Sequencing error in which first 152 bases are the reverse complement AY Unclassified Bacteria b 520 Two-fragment chimera, with 5 of Bacteroidetes origin, and 3 end of WS3 origin AB Actinobacteria b Likely sequencing error a Added after September release (not included in calculations). b Uncovered during analysis (not included in calculations). of the 19 phyla investigated contained obvious chimeras with chimeric content, ranging from 0.8 to 11.8% of the total. Six phyla contained sequence anomalies presumably generated during sequencing. Five phyla contained records with highly degenerate sequences. In total, 16 of the 19 phyla considered contained some sort of substantial sequence anomaly. Since the 19 selected phyla might not be representative of the full database, a separate analysis of the entire Bacteroidetes phylum was carried out. With 2,739 near-complete 16S rdna sequences, this well-characterized taxon is the fourth-largest phylum currently within the RDP, with half of these records (50.1%) derived from uncultured sources. In all, 5.8% of the Bacteroidetes sequences were anomalous. Excluding degeneracy (7.5%), these anomalies were likely either chimeras (65.4%), assembly errors (13.2%), or other miscellaneous anomalies (13.8%). Extrapolating these results to the public database as a whole this would suggest, at a conservative estimate, 1 in 20 sequences have substantial errors. We believe these figures underestimate the true number of anomalous records, given that Details we concentrated our efforts on uncovering the more obvious sequence anomalies. This study confirms that anomalous sequences continue to be added to the public databases; of the chimeras identified in this study, 27.7% were submitted to the NCBI during 2004 alone (Fig. 8), and 91.5% of these were submitted in the last 5 years. These figures reflect recent interest in many of the phyla considered in this study and the steady yearly increase in sequence submissions generally. They also highlight the ongoing nature of the problem. Indeed, we noted five chimeric additions to the RDP database while our study progressed (two were added to Nitrospira, one was added to Verrucomicrobia, and two were added to the -Proteobacteria, a taxon not otherwise investigated in this study). It is fair to say that many researchers have been insufficiently cognizant of the problem of sequence anomalies within the public databases. This situation is changing, however, as evidenced by the renewed burst of activity in generating software tools for recognizing chimeras. Within the last year or so, three new tools have been introduced (6, 7, 10), presumably driven

12 VOL. 71, 2005 SUBSTANTIAL ANOMALIES IN PUBLIC 16S rrna REPOSITORIES 7735 FIG. 8. First appearance in the NCBI database of the anomalous records identified by this study. by these authors desire, like ours, to screen sequences generated through their own researches. Certainly, our experiences with chimeric sequences within 16S rrna clone libraries led us to develop Pintail. It is important that the extent of sequence anomalies within public repositories is fully realized. The research community s phylogenetic view of the bacterial world is increasingly informed by 16S rrna information (3, 5, 15). At least half of the 53 phyla named in 2003 are currently known only from 16S rrna gene sequences amplified from the environment by PCR (15), and this number is growing (4). It is notable that, of the six proposed new taxa analyzed in this study, four harbored chimeras, some of which were extreme. For example, a third of the OP11 sequence AY derives from a -proteobacterium. Another OP11 sequence, AY218572, is almost half an ε-proteobacterial. The 5 end of WS3 bacterium AY is from the Actinobacteria. In all, 48.9% of identified chimeras were derived from bacteria belonging to different phyla (a particularly striking example being AJ289180, a jumble of Fusobacteria, Spirochaetes, and Bacteroidetes). This figure is undoubtedly an underestimate as, for a further 35.6%, either we could not identify the source (no suitable subject record in the database) or the source was as yet unclassified. Some of these chimeras were so extreme that it is surprising that they have not been detected before. We find this worrying, as our concern is that there are far more subtle chimeras in the database, constructed from close phylogenetic neighbors, that have less chance of being spotted and that could give rise to all sorts of spurious intrataxon clustering errors. Our study also shows that a significant proportion of chimeras were generated from three fragments, often from three separate sources (consider AJ289180, above). Chimeras with more than three fragments may also be possible, since the positions of chimeric breakpoints in conserved regions suggested that there are several areas within the 16S rrna gene where splicing may occur (Fig. 3C). The methodology presented here depends on the type strain 16S rrna database used. Clearly, current type strain sequences are not representative of all members of the Bacteria; our RDP-derived type strain database reflects past cultivation successes and there is a definite slant towards members of the Bacteria of medical interest. Furthermore, as this study shows, the quality of some type strain sequences is not good. Nevertheless, our method was effective over a wide phylogenetic range and could even be applied to Archaea sequences, as analysis of those archaeal chimeras listed in Hugenholtz and Huber s paper (8) proved. Since we used sequence alignments from the RDP database that currently only lists members of the Bacteria, our model and calibration data were constructed from members of this domain only. However, there is no theoretical reason why a more comprehensive model incorporating Archaea sequences could not be created or indeed generate models for specific domains, phyla, or other taxa to improve sensitivity. Note also that although this study concentrated on near-complete 16S rdna sequence records, partial sequences can also be analyzed by Pintail in the same manner (although for very short partial sequences, a smaller sampling window will be necessary to give meaningful results). DE values generated from type strain data, once anomalous sequences were removed, proved useful in calibrating our method; that is, placing observed DE values in the context of sequences identified as reliable. This raises the possibility of screening database records on a much larger scale than that tackled in this study. How should the research community tackle the problem of monitoring anomalous sequences in databases? Curators have a role to play. For example, we found three chimeras within the NCBI, labeled as such, yet not similarly flagged within the RDP database (although this is an understandable omission, given the RDP s automated nature). But the practicalities of current database management are such that the curators contribution must be limited. Primary responsibility must and indeed should lie with researchers submitting sequences. To this end, software tools must be available and used by researchers to assist in screening PCRgenerated sequences for anomalies before database deposition. Chimera_Check (13) and Bellerophon (7) are currently the programs most commonly used for detecting chimeric anomalies. Both require a database of sequences to be used, in addition to the query sequence, a requirement that can be both time consuming to prepare and prone to error. It is hoped that Pintail s simpler requirements, along with its user-friendly interface and its ability to run on all major computer platforms, will encourage greater screening of sequence data before and after submission to the public repositories. Unless chimeras and other anomalous sequences can be eliminated from public databases, microbial ecologists will have an erroneous picture of natural prokaryotic biodiversity. ACKNOWLEDGMENT This study was supported by grant BBS/B/11494 from the Biotechnology and Biological Sciences Research Council (BBSRC).

MicroSEQ Rapid Microbial Identification System

MicroSEQ Rapid Microbial Identification System MicroSEQ Rapid Microbial Identification System Giving you complete control over microbial identification using the gold-standard genotypic method The MicroSEQ ID microbial identification system, based

More information

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis Lina L. Faller Department of Computer Science University of New Hampshire June 2008

More information

Infectious Disease Omics

Infectious Disease Omics Infectious Disease Omics Metagenomics Ernest Diez Benavente LSHTM ernest.diezbenavente@lshtm.ac.uk Course outline What is metagenomics? In situ, culture-free genomic characterization of the taxonomic and

More information

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all Rachel L. Harris 1 Insights into the phylogeny and coding potential of microbial dark matter: a replication of phylogenetic anchoring methods described by Rinke et al., 2013 Rachel Harris, David Zhao,

More information

Introduction to OTU Clustering. Susan Huse August 4, 2016

Introduction to OTU Clustering. Susan Huse August 4, 2016 Introduction to OTU Clustering Susan Huse August 4, 2016 What is an OTU? Operational Taxonomic Units a.k.a. phylotypes a.k.a. clusters aggregations of reads based only on sequence similarity, independent

More information

The right answer, every time

The right answer, every time MICROSEQ MICROBIAL IDENTIFICATION SYSTEM The right answer, every time Definitive identification of bacterial and fungal isolates. Accuracy adds confidence. When microbial contamination poses problems,

More information

Supplemental Materials. DNA preparation. Dehalogenimonas lykanthroporepellens strain BL-DC-9 T (=ATCC

Supplemental Materials. DNA preparation. Dehalogenimonas lykanthroporepellens strain BL-DC-9 T (=ATCC Supplemental Materials DNA preparation. Dehalogenimonas lykanthroporepellens strain BL-DC-9 T (=ATCC BAA-1523 = JCM 15061) was grown in defined basal medium amended with 0.5 mm 1,1,2- trichloroethane (1,1,2-TCA)

More information

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life METAGENOMICS Carl Woese Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life His amazing discovery, coupled with his solitary behaviour, made many contemporary

More information

Biotechnology Explorer

Biotechnology Explorer Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual

More information

Assigning Sequences to Taxa CMSC828G

Assigning Sequences to Taxa CMSC828G Assigning Sequences to Taxa CMSC828G Outline Objective (1 slide) MEGAN (17 slides) SAP (33 slides) Conclusion (1 slide) Objective Given an unknown, environmental DNA sequence: Make a taxonomic assignment

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Identification of the major bacterial groups in the digestive tract of cod and haddock

Identification of the major bacterial groups in the digestive tract of cod and haddock Identification of the major bacterial groups in the digestive tract of cod and haddock Neil McEwan Institute of Biological, Environmental and Rural Sciences, Llanbadarn Campus, Aberystwyth University,

More information

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

An introduction into 16S rrna gene sequencing analysis. Stefan Boers An introduction into 16S rrna gene sequencing analysis Stefan Boers Microbiome, microbiota or metagenomics? Microbiome The entire habitat, including the microorganisms, their genomes (i.e., genes) and

More information

Online Student Guide Types of Control Charts

Online Student Guide Types of Control Charts Online Student Guide Types of Control Charts OpusWorks 2016, All Rights Reserved 1 Table of Contents LEARNING OBJECTIVES... 4 INTRODUCTION... 4 DETECTION VS. PREVENTION... 5 CONTROL CHART UTILIZATION...

More information

Quality Control Assessment in Genotyping Console

Quality Control Assessment in Genotyping Console Quality Control Assessment in Genotyping Console Introduction Prior to the release of Genotyping Console (GTC) 2.1, quality control (QC) assessment of the SNP Array 6.0 assay was performed using the Dynamic

More information

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014 Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014 1 Taxonomic analysis using next-generation sequencing Objective we want to

More information

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS Anirvan Banerji New York 24th CIRET Conference Wellington, New Zealand March 17-20, 1999 Geoffrey H. Moore,

More information

Microbial Diversity and Assessment (III) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Microbial Diversity and Assessment (III) Spring, 2007 Guangyi Wang, Ph.D. POST103B Microbial Diversity and Assessment (III) Spring, 2007 Guangyi Wang, Ph.D. POST103B guangyi@hawaii.edu http://www.soest.hawaii.edu/marinefungi/ocn403webpage.htm Overview of Last Lecture Taxonomy (three

More information

Basic Bioinformatics: Homology, Sequence Alignment,

Basic Bioinformatics: Homology, Sequence Alignment, Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi

More information

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web

More information

Bioinformatic tools for metagenomic data analysis

Bioinformatic tools for metagenomic data analysis Bioinformatic tools for metagenomic data analysis MEGAN - blast-based tool for exploring taxonomic content MG-RAST (SEED, FIG) - rapid annotation of metagenomic data, phylogenetic classification and metabolic

More information

Providing clear solutions to microbiological challenges TM. cgmp/iso CLIA. Polyphasic Microbial Identification & DNA Fingerprinting

Providing clear solutions to microbiological challenges TM. cgmp/iso CLIA. Polyphasic Microbial Identification & DNA Fingerprinting Providing clear solutions to microbiological challenges TM Cert. No. 2254.01 Polyphasic Microbial Identification & DNA Fingerprinting Microbial Contamination Tracking & Trending cgmp/iso-17025-2005 CLIA

More information

a-dB. Code assigned:

a-dB. Code assigned: This form should be used for all taxonomic proposals. Please complete all those modules that are applicable (and then delete the unwanted sections). For guidance, see the notes written in blue and the

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences Published online 28 November 2016 Nucleic Acids Research, 2017, Vol. 45, No. 1 39 53 doi: 10.1093/nar/gkw1002 Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts

More information

SAMPLE LITERATURE Please refer to included weblink for correct version.

SAMPLE LITERATURE Please refer to included weblink for correct version. Edvo-Kit #340 DNA Informatics Experiment Objective: In this experiment, students will explore the popular bioninformatics tool BLAST. First they will read sequences from autoradiographs of automated gel

More information

cgmp/iso CLIA Experience Unsurpassed Quality

cgmp/iso CLIA Experience Unsurpassed Quality Cert. No. 2254.01 cgmp/iso-17025-2005 CLIA Experience Unsurpassed Quality Polyphasic Microbial Identification & DNA Fingerprinting Microbial Contamination Tracking & Trending Microbial Identification

More information

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis APPLIED AND ENVIRONMENTAL MICROBIOLOGY, May 2011, p. 3219 3226 Vol. 77, No. 10 0099-2240/11/$12.00 doi:10.1128/aem.02810-10 Copyright 2011, American Society for Microbiology. All Rights Reserved. Assessing

More information

Microbiome Analysis. Research Day 2012 Ranjit Kumar

Microbiome Analysis. Research Day 2012 Ranjit Kumar Microbiome Analysis Research Day 2012 Ranjit Kumar Human Microbiome Microorganisms Bad or good? Human colon contains up to 100 trillion bacteria. Human microbiome - The community of bacteria that live

More information

Phylogenetic methods for taxonomic profiling

Phylogenetic methods for taxonomic profiling Phylogenetic methods for taxonomic profiling Siavash Mirarab University of California at San Diego (UCSD) Joint work with Tandy Warnow, Nam-Phuong Nguyen, Mike Nute, Mihai Pop, and Bo Liu Phylogeny reconstruction

More information

Outliers and Their Effect on Distribution Assessment

Outliers and Their Effect on Distribution Assessment Outliers and Their Effect on Distribution Assessment Larry Bartkus September 2016 Topics of Discussion What is an Outlier? (Definition) How can we use outliers Analysis of Outlying Observation The Standards

More information

Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing

Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing Susan M. Huse 1, Les Dethlefsen 2, Julie A. Huber 1, David Mark Welch 1, David A. Relman 2,3,4, Mitchell L. Sogin

More information

NCBI web resources I: databases and Entrez

NCBI web resources I: databases and Entrez NCBI web resources I: databases and Entrez Yanbin Yin Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1 Homework assignment 1 Two parts: Extract the gene IDs reported in table

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Analysis of Microarray Data

Analysis of Microarray Data Analysis of Microarray Data Lecture 3: Visualization and Functional Analysis George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute Outline Review

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Examination Assignments

Examination Assignments Bioinformatics Institute of India H-109, Ground Floor, Sector-63, Noida-201307, UP. INDIA Tel.: 0120-4320801 / 02, M. 09818473366, 09810535368 Email: info@bii.in, Website: www.bii.in INDUSTRY PROGRAM IN

More information

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

Quantitative Genetics

Quantitative Genetics Quantitative Genetics Polygenic traits Quantitative Genetics 1. Controlled by several to many genes 2. Continuous variation more variation not as easily characterized into classes; individuals fall into

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Risk Management User Guide

Risk Management User Guide Risk Management User Guide Version 17 December 2017 Contents About This Guide... 5 Risk Overview... 5 Creating Projects for Risk Management... 5 Project Templates Overview... 5 Add a Project Template...

More information

THE RELIABILITY AND ACCURACY OF REMNANT LIFE PREDICTIONS IN HIGH PRESSURE STEAM PLANT

THE RELIABILITY AND ACCURACY OF REMNANT LIFE PREDICTIONS IN HIGH PRESSURE STEAM PLANT THE RELIABILITY AND ACCURACY OF REMNANT LIFE PREDICTIONS IN HIGH PRESSURE STEAM PLANT Ian Chambers Safety and Reliability Division, Mott MacDonald Studies have been carried out to show how failure probabilities

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Theory and Application of Multiple Sequence Alignments

Theory and Application of Multiple Sequence Alignments Theory and Application of Multiple Sequence Alignments a.k.a What is a Multiple Sequence Alignment, How to Make One, and What to Do With It Brett Pickett, PhD History Structure of DNA discovered (1953)

More information

A new multiplatform computer program for numerical identification of microorganisms.

A new multiplatform computer program for numerical identification of microorganisms. JCM Accepts, published online ahead of print on 14 October 2009 J. Clin. Microbiol. doi:10.1128/jcm.01250-09 Copyright 2009, American Society for Microbiology and/or the Listed Authors/Institutions. All

More information

RiboPrinter microbial characterization system

RiboPrinter microbial characterization system DuPont Qualicon RiboPrinter microbial characterization system Staphylococcus epidermidis RiboGroup 102-45-3 DEDICATED TO MICROBIOLOGY OK. You found it. Now what? RiboPrinter microbial characterization

More information

Temporal and spatial diversity of the tap water microbiota in a. Norwegian hospital

Temporal and spatial diversity of the tap water microbiota in a. Norwegian hospital AEM Accepts, published online ahead of print on 16 October 2009 Appl. Environ. Microbiol. doi:10.1128/aem.01174-09 Copyright 2009, American Society for Microbiology and/or the Listed Authors/Institutions.

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

ISOLATION AND MOLECULAR CHARACTERIZATION OF THERMOTOLERANT BACTERIA FROM MANIKARAN THERMAL SPRING OF HIMACHAL PRADESH, INDIA ABSTRACT

ISOLATION AND MOLECULAR CHARACTERIZATION OF THERMOTOLERANT BACTERIA FROM MANIKARAN THERMAL SPRING OF HIMACHAL PRADESH, INDIA ABSTRACT ISOLATION AND MOLECULAR CHARACTERIZATION OF THERMOTOLERANT BACTERIA FROM MANIKARAN THERMAL SPRING OF HIMACHAL PRADESH, INDIA Ambika Verma 1*, & Poonam Shirkot 2 1*Ph.D Scholar, Department of Biotechnology,

More information

Gene-centered resources at NCBI

Gene-centered resources at NCBI COURSE OF BIOINFORMATICS a.a. 2014-2015 Gene-centered resources at NCBI We searched Accession Number: M60495 AT NCBI Nucleotide Gene has been implemented at NCBI to organize information about genes, serving

More information

HOW TO ELEVATE SALES TEAM PERFORMANCE: APPLYING THE CSO INSIGHTS SALES RELATIONSHIP/PROCESS MATRIX. A whitepaper from

HOW TO ELEVATE SALES TEAM PERFORMANCE: APPLYING THE CSO INSIGHTS SALES RELATIONSHIP/PROCESS MATRIX. A whitepaper from HOW TO ELEVATE SALES TEAM PERFORMANCE: APPLYING THE CSO INSIGHTS SALES RELATIONSHIP/PROCESS MATRIX A whitepaper from and LEARNING THE LEVELS CSO Insights first presented our Sales Relationship/Process

More information

Pareto Charts [04-25] Finding and Displaying Critical Categories

Pareto Charts [04-25] Finding and Displaying Critical Categories Introduction Pareto Charts [04-25] Finding and Displaying Critical Categories Introduction Pareto Charts are a very simple way to graphically show a priority breakdown among categories along some dimension/measure

More information

Product Applications for the Sequence Analysis Collection

Product Applications for the Sequence Analysis Collection Product Applications for the Sequence Analysis Collection Pipeline Pilot Contents Introduction... 1 Pipeline Pilot and Bioinformatics... 2 Sequence Searching with Profile HMM...2 Integrating Data in a

More information

Novel bacterial taxa in the human microbiome

Novel bacterial taxa in the human microbiome Washington University School of Medicine Digital Commons@Becker Open Access Publications 2012 Novel bacterial taxa in the human microbiome Kristine M. Wylie Washington University School of Medicine in

More information

Chapter 17. PCR the polymerase chain reaction and its many uses. Prepared by Woojoo Choi

Chapter 17. PCR the polymerase chain reaction and its many uses. Prepared by Woojoo Choi Chapter 17. PCR the polymerase chain reaction and its many uses Prepared by Woojoo Choi Polymerase chain reaction 1) Polymerase chain reaction (PCR): artificial amplification of a DNA sequence by repeated

More information

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015 Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A Computational Genomics 2/18/2015 The $1,000 genome is here! http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn Bioinformatics bottleneck

More information

Individual Charts Done Right and Wrong

Individual Charts Done Right and Wrong Quality Digest Daily, Feb. 2, 2010 Manuscript No. 206 Individual Charts Done Right and Wrong How does your software rate? In my column of January 7 I looked at the problems in computing limits for Average

More information

Quantitative PCR Analysis of Meat Speciation Data Using the 2 - Cq Method

Quantitative PCR Analysis of Meat Speciation Data Using the 2 - Cq Method Quantitative PCR Analysis of Meat Speciation Data Using the 2 - Cq Method Abstract In recent years the food production industry has been placed under intense scrutiny. This has led to an increased pressure

More information

Chimeric 16S rrna Sequence Formation and Detection in Sanger and 454- Pyrosequenced PCR Amplicons

Chimeric 16S rrna Sequence Formation and Detection in Sanger and 454- Pyrosequenced PCR Amplicons Chimeric 16S rrna Sequence Formation and Detection in Sanger and 454- Pyrosequenced PCR Amplicons Brian J. Haas 1, Dirk Gevers 1, Ashlee M. Earl 1, Mike Feldgarden 1, Doyle V. Ward 1, Georgia Giannoukos

More information

J. Eco. Heal. Env. 4, No. 1, 1-5 (2016) 1

J. Eco. Heal. Env. 4, No. 1, 1-5 (2016) 1 J. Eco. Heal. Env. 4, No. 1, 1-5 (2016) 1 Journal of Ecology of Health & Environment An International Journal http://dx.doi.org/10.18576/jehe/040101 Characterization and Molecular Identification of Unknown

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE 1 TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE Bioinformatics Senior Project Wasay Hussain Spring 2009 Overview of RNA 2 The central Dogma of Molecular biology is DNA RNA Proteins The RNA (Ribonucleic

More information

IT portfolio management template User guide

IT portfolio management template User guide IBM Rational Focal Point IT portfolio management template User guide IBM Software Group IBM Rational Focal Point IT Portfolio Management Template User Guide 2011 IBM Corporation Copyright IBM Corporation

More information

Sequence Analysis Lab Protocol

Sequence Analysis Lab Protocol Sequence Analysis Lab Protocol You will need this handout of instructions The sequence of your plasmid from the ABI The Accession number for Lambda DNA J02459 The Accession number for puc 18 is L09136

More information

Chief Executive Officers and Compliance Officers of All National Banks, Department and Division Heads, and All Examining Personnel

Chief Executive Officers and Compliance Officers of All National Banks, Department and Division Heads, and All Examining Personnel O OCC 2000 16 OCC BULLETIN Comptroller of the Currency Administrator of National Banks Subject: Risk Modeling Description: Model Validation TO: Chief Executive Officers and Compliance Officers of All National

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title SNP-VISTA: An Interactive SNPs Visualization Tool Permalink https://escholarship.org/uc/item/74z965bg Authors Shah, Nameeta

More information

Introduction to Molecular Biology

Introduction to Molecular Biology Introduction to Molecular Biology Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 2-1- Important points to remember We will study: Problems from bioinformatics. Algorithms used to solve

More information

BIM QUICKSCAN: BENCHMARK OF BIM PERFORMANCE IN THE NETHERLANDS

BIM QUICKSCAN: BENCHMARK OF BIM PERFORMANCE IN THE NETHERLANDS BIM QUICKSCAN: BENCHMARK OF BIM PERFORMANCE IN THE NETHERLANDS Léon van Berlo, MSc, Researcher, leon.vanberlo@tno.nl Tim Dijkmans, MSc, Researcher, tim.dijkmans@tno.nl Netherlands organization for Applied

More information

The Impact of Agile. Quantified.

The Impact of Agile. Quantified. The Impact of Agile. Quantified. Agile and lean are built on a foundation of continuous improvement: You need to inspect, learn from and adapt your performance to keep improving. Enhancing performance

More information

Probes can be designed in an evolutionary hierarchy

Probes can be designed in an evolutionary hierarchy Probes can be designed in an evolutionary hierarchy Probes can be designed to be highly redundant to increase the certainly of identification The match between clone counts and hybridization intensity

More information

Ediscovery White Paper US. The Ultimate Predictive Coding Handbook. A comprehensive guide to predictive coding fundamentals and methods.

Ediscovery White Paper US. The Ultimate Predictive Coding Handbook. A comprehensive guide to predictive coding fundamentals and methods. Ediscovery White Paper US The Ultimate Predictive Coding Handbook A comprehensive guide to predictive coding fundamentals and methods. 2 The Ultimate Predictive Coding Handbook by KLDiscovery Copyright

More information

Antisense RNA Insert Design for Plasmid Construction to Knockdown Target Gene Expression

Antisense RNA Insert Design for Plasmid Construction to Knockdown Target Gene Expression Vol. 1:7-15 Antisense RNA Insert Design for Plasmid Construction to Knockdown Target Gene Expression Ji, Tom, Lu, Aneka, Wu, Kaylee Department of Microbiology and Immunology, University of British Columbia

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Contents Cell biology Organisms and cells Building blocks of cells How genes encode proteins? Bioinformatics What is bioinformatics? Practical applications Tools and databases

More information

SPECIAL CONTROL CHARTS

SPECIAL CONTROL CHARTS INDUSTIAL ENGINEEING APPLICATIONS AND PACTICES: USES ENCYCLOPEDIA SPECIAL CONTOL CHATS A. Sermet Anagun, PhD STATEMENT OF THE POBLEM Statistical Process Control (SPC) is a powerful collection of problem-solving

More information

RNA Sequencing Analyses & Mapping Uncertainty

RNA Sequencing Analyses & Mapping Uncertainty RNA Sequencing Analyses & Mapping Uncertainty Adam McDermaid 1/26 RNA-seq Pipelines Collection of tools for analyzing raw RNA-seq data Tier 1 Quality Check Data Trimming Tier 2 Read Alignment Assembly

More information

I nternet Resources for Bioinformatics Data and Tools

I nternet Resources for Bioinformatics Data and Tools ~i;;;;;;;'s :.. ~,;;%.: ;!,;s163 ~. s :s163:: ~s ;'.:'. 3;3 ~,: S;I:;~.3;3'/////, IS~I'//. i: ~s '/, Z I;~;I; :;;; :;I~Z;I~,;'//.;;;;;I'/,;:, :;:;/,;'L;;;~;'~;~,::,:, Z'LZ:..;;',;';4...;,;',~/,~:...;/,;:'.::.

More information

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits Incorporating Molecular ID Technology Accel-NGS 2S MID Indexing Kits Molecular Identifiers (MIDs) MIDs are indices used to label unique library molecules MIDs can assess duplicate molecules in sequencing

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

From AP investigative Laboratory Manual 1

From AP investigative Laboratory Manual 1 Comparing DNA Sequences to Understand Evolutionary Relationships. How can bioinformatics be used as a tool to determine evolutionary relationships and to better understand genetic diseases? BACKGROUND

More information

Life Cycle Assessment A product-oriented method for sustainability analysis. UNEP LCA Training Kit Module f Interpretation 1

Life Cycle Assessment A product-oriented method for sustainability analysis. UNEP LCA Training Kit Module f Interpretation 1 Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module f Interpretation 1 ISO 14040 framework Life cycle assessment framework Goal and scope definition

More information

Genome Annotation Genome annotation What is the function of each part of the genome? Where are the genes? What is the mrna sequence (transcription, splicing) What is the protein sequence? What does

More information

Assessment of the Capability Review programme

Assessment of the Capability Review programme CABINET OFFICE Assessment of the Capability Review programme LONDON: The Stationery Office 14.35 Ordered by the House of Commons to be printed on 2 February 2009 REPORT BY THE COMPTROLLER AND AUDITOR GENERAL

More information

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG Chapman & Hall/CRC Mathematical and Computational Biology Series ALGORITHMS IN BIO INFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG CRC Press Taylor & Francis Group Boca Raton London New York CRC Press

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer Abstract More often than not sales forecasting in modern companies is poorly implemented despite the wealth of data that is readily available

More information

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test CHAPTER 8 T Tests A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test 8.1. One-Sample T Test The One-Sample T Test procedure: Tests

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What

More information

WORKING WITH TEST DOCUMENTATION

WORKING WITH TEST DOCUMENTATION WORKING WITH TEST DOCUMENTATION CONTENTS II. III. Planning Your Test Effort 2. The Goal of Test Planning 3. Test Planning Topics: b) High Level Expectations c) People, Places and Things d) Definitions

More information

Genome Sequence Assembly

Genome Sequence Assembly Genome Sequence Assembly Learning Goals: Introduce the field of bioinformatics Familiarize the student with performing sequence alignments Understand the assembly process in genome sequencing Introduction:

More information

Genome-Wide Association Studies (GWAS): Computational Them

Genome-Wide Association Studies (GWAS): Computational Them Genome-Wide Association Studies (GWAS): Computational Themes and Caveats October 14, 2014 Many issues in Genomewide Association Studies We show that even for the simplest analysis, there is little consensus

More information

Gasoline Consumption Analysis

Gasoline Consumption Analysis Gasoline Consumption Analysis One of the most basic topics in economics is the supply/demand curve. Simply put, the supply offered for sale of a commodity is directly related to its price, while the demand

More information

What is it? Sherlock knows! Microbial Identification System

What is it? Sherlock knows! Microbial Identification System What is it? Sherlock knows! Microbial Identification System Sherlock Microbial Identification System Microbial ID 10 Minutes... $2.50 Is it accurate? Established technology with 15 years of clinical and

More information

i) Wild animals: Those animals that do not live under human supervision or control and do not have their phenotype selected by humans.

i) Wild animals: Those animals that do not live under human supervision or control and do not have their phenotype selected by humans. The OIE Validation Recommendations provide detailed information and examples in support of the OIE Validation Standard that is published as Chapter 1.1.6 of the Terrestrial Manual, or Chapter 1.1.2 of

More information

Chapter 11. Restriction mapping. Objectives

Chapter 11. Restriction mapping. Objectives Restriction mapping Restriction endonucleases (REs) are part of bacterial defense systems. REs recognize and cleave specific sites in DNA molecules. REs are an indispensable tool in molecular biology for

More information

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST

Last Update: 12/31/2017. Recommended Background Tutorial: An Introduction to NCBI BLAST BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by T. Cordonnier, C. Shaffer, W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Recommended Background

More information

The Polymerase Chain Reaction. Chapter 6: Background

The Polymerase Chain Reaction. Chapter 6: Background The Polymerase Chain Reaction Chapter 6: Background Invention of PCR Kary Mullis Mile marker 46.58 in April of 1983 Pulled off the road and outlined a way to conduct DNA replication in a tube Worked for

More information

Design Microarray Probes

Design Microarray Probes Design Microarray Probes Erik S. Wright October 30, 2017 Contents 1 Introduction 1 2 Getting Started 1 2.1 Startup................................................ 1 2.2 Creating a Sequence Database....................................

More information