Bellerophon; a program to detect chimeric sequences in multiple sequence

Similar documents
MicroSEQ Rapid Microbial Identification System

Introduction to OTU Clustering. Susan Huse August 4, 2016

Supplemental Materials. DNA preparation. Dehalogenimonas lykanthroporepellens strain BL-DC-9 T (=ATCC

Creation of a PAM matrix

Microbial Diversity and Assessment (III) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Infectious Disease Omics

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Applications of Next Generation Sequencing in Metagenomics Studies

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Sampling sites used in this study. Sequence counts are from Sanger clone. libraries. UTM: Universal Transverse Mercator coordinate system; m.a.s.l.

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

ISOLATION AND MOLECULAR CHARACTERIZATION OF THERMOTOLERANT BACTERIA FROM MANIKARAN THERMAL SPRING OF HIMACHAL PRADESH, INDIA ABSTRACT

Genome Sequence Assembly

Functional vs Organismal views of Ecology. One organism: Population genetics Many organisms: Ecology No organisms: Ecosystems

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Why learn sequence database searching? Searching Molecular Databases with BLAST

At Least 1 in 20 16S rrna Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies

Executive Summary. clinical supply services

[4] SCHEMA-Guided Protein Recombination

CSE182-L16. LW statistics/assembly

Optimizing Synthetic DNA for Metabolic Engineering Applications. Howard Salis Penn State University

Introduction to Molecular Biology

Dynamic Programming Algorithms

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Genetics Lecture 21 Recombinant DNA

Theory and Application of Multiple Sequence Alignments

MICROBIOME SOFTWARE: END OF BEGINNING.

Manipulating DNA. Nucleic acids are chemically different from other macromolecules such as proteins and carbohydrates.

2054, Chap. 14, page 1

Fatchiyah

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Problem Set 8. Answer Key

Non-Organic-Based Isolation of Mammalian microrna using Norgen s microrna Purification Kit

8/21/2014. From Gene to Protein

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

7 Gene Isolation and Analysis of Multiple

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

The right answer, every time

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

2. Outline the levels of DNA packing in the eukaryotic nucleus below next to the diagram provided.

Computational Biology I LSM5191

DNA Transcription. Dr Aliwaini

DNA Transcription. Visualizing Transcription. The Transcription Process

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Efficient Multi-site-directed Mutagenesis directly from Genomic Template.

Lecture Four. Molecular Approaches I: Nucleic Acids

Chapter 10: Classification of Microorganisms

Bioinformatic tools for metagenomic data analysis

Fig Ch 17: From Gene to Protein

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION

Contents. 1 Basic Molecular Microbiology of Bacteria... 1 Exp. 1.1 Isolation of Genomic DNA Introduction Principle...

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Assigning Sequences to Taxa CMSC828G

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Contact us for more information and a quotation

mothur Workshop for Amplicon Analysis Michigan State University, 2013

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

Ecology, growth rates, and rrna operons: a marine view

BIOTECHNOLOGY OLD BIOTECHNOLOGY (TRADITIONAL BIOTECHNOLOGY) MODERN BIOTECHNOLOGY RECOMBINANT DNA TECHNOLOGY.

Chapter 9. Biotechnology and DNA Technology

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

Genetic Engineering for Biofuels Production

Synthetic Biology for

Genomics and Gene Recognition Genes and Blue Genes

USEARCH software and documentation Copyright Robert C. Edgar All rights reserved.

RNA folding & ncrna discovery

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

MATH 5610, Computational Biology

Examination Assignments

Chapter 20: Biotechnology

a-dB. Code assigned:

Genetics 101. Prepared by: James J. Messina, Ph.D., CCMHC, NCC, DCMHS Assistant Professor, Troy University, Tampa Bay Site

Taxonomy. Classification of microorganisms 3/12/2017. Is the study of classification. Chapter 10 BIO 220

CS100 B Fall Professor David I. Schwartz Programming Assignment 4. Due: Thursday, November 4

SECTION I CITIZENSHIP: CITIZENSHIP: SECTION II

1. A brief overview of sequencing biochemistry

Protein Synthesis. DNA to RNA to Protein

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Introduction to Bioinformatics

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

E. Incorrect! The four different DNA nucleotides follow a strict base pairing arrangement:

SSA Signal Search Analysis II

A proposal to the Gordon and Betty Moore Foundation

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

A Brief Introduction to Bioinformatics

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Unit 6: Molecular Genetics & DNA Technology Guided Reading Questions (100 pts total)

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Introduction to BIOINFORMATICS

Transcription:

Revised ms: BIOINF-03-0817 Bellerophon; a program to detect chimeric sequences in multiple sequence alignments. Thomas Huber 1 *, Geoffrey Faulkner 1 and Philip Hugenholtz 2 1 ComBinE group, Advanced Computational Modelling Centre, The University of Queensland, Brisbane 4072, Australia and 2 Department of Environmental Science, Policy and Management, 11 Hilgard Hall, University of California Berkeley, Berkeley, CA 947-31, USA. *To whom correspondence should be addressed Tel: Int + 617-336-7060 1 Email: huber@maths.uq.edu.au Running head: Bellerophon: chimera detection program Journal section: Application note

ABSTRACT Summary: Bellerophon is a program for detecting chimeric sequences in multiple sequence datasets by an adaption of partial treeing analysis. Bellerophon was specifically developed to detect 16S rrna gene chimeras in PCR-clone libraries of environmental samples but can be applied to other nucleotide sequence alignments. Availability: Bellerophon is available as an interactive web server at http://foo.maths.uq.edu.au/~huber/bellerophon.pl Contact: huber@maths.uq.edu.au INTRODUCTION A PCR-generated chimeric sequence is usually comprised of two phylogenetically distinct parent sequences and occurs when a prematurely terminated amplicon reanneals to a foreign DNA strand, and is copied to completion in the following PCR cycles. The point at which the chimeric sequence changes from one parent to the next is called the 1 conversion, recombination or break point. Chimeras are problematic in culture- independent surveys of microbial communities because they suggest the presence of nonexistent organisms (von Wintzingerode et al., 1997). Several methods have been developed for detecting chimeric sequences (Cole et al., 03; Komatsoulis and Waterman, 1997; Liesack et al., 1991; Robinson-Cox et al., 199) that generally rely on direct comparison of individual sequences to one or two putative parent sequences at a time. Here we present an alternative approach based on how well sequences fit into their complete phylogenetic context.

METHOD Bellerophon detects chimeras based on a partial treeing approach (Wang and Wang, 1997; Hugenholtz and Huber, 03), that is, phylogenetic trees are inferred from independent regions (fragments) of a multiple sequence alignment and the branching patterns are compared for incongruencies that may be indicative of chimeric sequences. No trees are actually built during the procedure and the only calculations required are distance (sequence similarity) calculations. A full matrix of distances (dm) between all pairs of sequences are calculated for fragments left and right of an assumed break point. The total absolute deviation of the distance matrices (distance matrix error dme) of n sequences is then n n left right dme = ÂÂ dm [ i][ j] - dm [ i][ j] i j where dm[i][j] denotes the distance between two sequences i and j. The largest contribution to the dme is expected to arise from chimeras, since fragments from these sequences have distinctly different locations relative to all other sequences in 1 the dataset, and therefore distinctly different distance matrices. To rank the sequences by their contribution to the dme value, we calculate the ratio of the dme value from all sequences over the dme value (dme[i]) of a reference dataset lacking the sequence i under consideration. This ratio is called the preference score of the sequence. dme preference [ i] = (1) dme[ i] The ratio for chimeric sequences will have a preference score >1, whereas non-chimeric sequence scores are expected to be ~1. To detect all putative chimeras in a dataset, preference scores have to be calculated for all sequences. Naïvely, the calculation would

require a computationally expensive distance matrix comparison for each sequence in the dataset. This can, however, be implemented more efficiently by taking advantage of previously performed calculations. Because the calculation of the dme involves column sums in the form of n left right col [ i] = Â dm [ i][ j] - dm [ i][ j] j and the distances between identical sequences dm[i][i] are by definition zero, equation (1) can be rewritten as: dme preference[ i] = ( dme - 2 col[ i]) which only involves calculation of a single matrix and some intermediate storage of the column sums. 1 To determine the optimal break point for putative chimeras, all sequences are scanned along their length by dividing the alignment into fragment pairs at character intervals. Distances are calculated from equally sized windows (0, 300 or 400 characters) of the fragments left and right of the break point to obtain similar signal-tonoise ratios for each fragment. The highest preference score calculated for each sequence in all fragment pairs indicates the optimal break point. Sequences are ranked according to their highest recorded preference score and reported as potentially chimeric if that score is >1. Absolute preference scores are dataset-dependent and should only be used for relative ranking of putative chimeras within a given dataset. For manual confirmation of identified chimeras and phylogenetic placement of the chimeric halves, it is necessary to specify the most likely parent sequences in the dataset, giving rise to the chimera. Parent sequences are assigned to each putative

chimera by selecting the two sequences with the highest opposing paired distance contributions ( dm[i][j]) to the dme at the optimal break point. The parent sequences of a chimera are most likely to be found in the same PCR-clone library and therefore as many sequences as possible from the one library should be included in the analysis. However, even if the exact parent sequences of a given chimera are not present in the dataset, Bellerophon will identify and report the closest phylogenetic neighbors of the parents. In addition, the output from Bellerophon includes the location of the optimal break point relative to an E. coli reference alignment (Brosius et al., 1978) and the percentage identities of the parent sequences to the chimera either side of the break point. These features aid in verification of chimeras. Mutually incompatible chimeras are screened from the Bellerophon output. That is, once a sequence (A) has been identified as chimeric, subsequent putative chimeras with lower preference scores, that identify sequence A as one of the parents, are removed from the output list. 1 PHYLOGENETIC CONSIDERATIONS Alignment As with all phylogeny-based methods, alignment ensuring positional homology is critical. If sequences are poorly aligned, Bellerophon will produce meaningless results. By default, sequences are assumed to be aligned. However, the align sequence option (set to yes) will automatically align up to 300 unaligned sequences using ClustalW (Thompson et al., 1994) in preparation for running on Bellerophon. Filtering Regions of 16S rrna with variable secondary structure between different organisms are

routinely removed from sequence alignments prior to phylogenetic inference (e.g. Lane, 1991) because they often cannot be unambiguously aligned and therefore introduce noise into the analysis. We have implemented an automatic consensus filter to remove variable positions from the alignment prior to calculating pairwise distances. Sequence positions (columns) that have nucleotides in less than 0% of all sequences are removed, and from the resulting consensus only sequences longer than two times the window size are used in the analysis. However, we advise removing sequences shorter than 2*window size + 0 prior to submission if the dataset contains many partial sequences, since they will cause over-filtering of the alignment. The length of the filtered alignment also dictates how close to either end of the alignment break points can be detected. For instance if the largest window size of 400 is selected, break points can only be detected 400 characters from either end of the alignment. Distance correction In phylogenetic analysis, the distance (nucleotide divergence) between two sequences is 1 commonly corrected for unseen nucleotide substitutions to more accurately reflect their evolution. For the detection of chimeric sequences, evolutionary distance corrections may not be ideal since they accentuate long distance relationships (e.g. Jukes and Cantor, 1969). Chimeras are more likely to result from recombination of closely related parent sequences, which led us to develop an empirical distance function (Huber-Hugenholtz correction): d = 1- id where d is the corrected distance between two sequences with a fraction id of identically aligned bases. This function strongly differentiates between highly similar sequences and

puts less emphasis on differences between remote homologs. It should be noted that this is not a phylogenetic correction as it does not attempt to correct for unseen nucleotide substitutions. Evolutionary distance calculations are sensitive to partial length sequences in an alignment, and in the worst case, non-overlapping partial sequences result in infinite pairwise distances. This problem is alleviated in the Huber-Hugenholtz correction where non-overlapping sequences have a bound value of 1. Even though this allows the analysis of datasets containing sequences of different lengths, non-overlapping partial sequences can produce artificial skews in distance matrices of fragments of the alignment resulting in potentially unreliable chimera calls. We strongly recommend using alignments of similar length overlapping sequences. CONCLUSIONS Bellerophon detects all putative chimeras in a multiple sequence alignment in a single 1 analysis. This means that sequences from a PCR-clone library can be analyzed and results interpreted in parallel instead of serially as is the case with other commonly used programs such as CHIMERA_CHECK (Cole et al., 03). This also has the advantage that individual sequences do not become invisible to the program since they are not compared in isolation to a reference dataset (Hugenholtz and Huber, 03). Bellerophon has been available to the scientific community since January 03 and has been trialed by over 0 different users (ca. 70 submissions per month). As with all chimera-detection programs, putatively identified chimeras should be manually verified, for example by inspecting the sequences for signature shifts (Wang and Wang, 1997).

REFERENCES Brosius,J., Palmer,M.L., Kennedy,P.J. and Noller,H.F. (1978) Complete nucleotide sequence of a 16S ribosomal RNA gene from Escherichia coli. Proc. Natl. Acad. Sci. USA, 7, 4801-480. Cole,J.R., Chai,B., Marsh,T.L., Farris,R.J., Wang,Q., Kulam,S.A., Chandra,S., McGarrell,D.M., Schmidt,T.M., Garrity,G.M. and Tiedje,J.M. (03) The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res., 31, 442-443. Hugenholtz,P. and Huber,T. (03) Chimeric 16S rdna sequences of diverse origin are accumulating in the public databases. Int. J. Syst. Evol. Microbiol., 3, 289-293. Jukes, T.H. and Cantor, C.R. (1969) Evolution of protein molecules. In Munro, H.N. (ed), Mammalian Protein Metabolism, vol. 3. Academic Press, New York, pp. 21-132. Komatsoulis,G. and Waterman,M. (1997) A new computational method for detection of chimeric 16S rrna artifacts generated by PCR amplification from mixed bacterial 1 populations. Appl. Environ. Microbiol., 63, 2338-2346. Lane,D.J. (1991) 16S/23S rrna sequencing. In Stackebrandt,E. and Goodfellow,M (eds), Nucleic Acid Techniques in Bacterial Systematics, John Wiley and Sons, New York,.pp. 11-17. Liesack,W., Weyland,H. and Stackebrandt,E. (1991) Potential risks of gene amplification by PCR as determined by 16S rdna analysis of a mixed-culture of strict barophilic bacteria. Microb. Ecol., 21, 191-198.

Robinson-Cox,J.F., Bateson,M.M. and Ward,D.M. (199) Evaluation of nearest-neighbor methods for detection of chimeric small-subunit rrna sequences. Appl. Environ. Microbiol., 61, 1240-124. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680. von Wintzingerode,F., Göbel,U.B. and Stackebrandt,E. (1997) Determination of microbial diversity in environmental samples: pitfalls of PCR-based rrna analysis. FEMS Microbiol. Rev., 21, 213-229. Wang,G.C.Y. and Wang,Y. (1997) Frequency of formation of chimeric molecules is a consequence of PCR coamplification of 16S rrna genes from mixed bacterial genomes. Appl. Environ. Microbiol., 63, 464-460.