Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

Similar documents
Bioinformatic tools for metagenomic data analysis

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

Genome Sequence Assembly

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Why learn sequence database searching? Searching Molecular Databases with BLAST

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

COMPUTER RESOURCES II:

Chimp Sequence Annotation: Region 2_3

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Gene-centered resources at NCBI

Infectious Disease Omics

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

NCBI web resources I: databases and Entrez

Bioinformatics for Proteomics. Ann Loraine

Applications of Next Generation Sequencing in Metagenomics Studies

The use of bioinformatic analysis in support of HGT from plants to microorganisms. Meeting with applicants Parma, 26 November 2015

NCBI & Other Genome Databases. BME 110/BIOL 181 CompBio Tools

Why Use BLAST? David Form - August 15,

Worksheet for Bioinformatics

Discover the Microbes Within: The Wolbachia Project. Bioinformatics Lab

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Enabling reproducible data analysis for metagenomics. eresearch Africa Conference 2017 Gerrit Botha CBIO H3ABioNet 3 May 2017

High throughput omics and BIOINFORMATICS

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

CHAPTER 21 LECTURE SLIDES

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015


Introduction to Bioinformatics

Gene Prediction: Preliminary Results

Basic Bioinformatics: Homology, Sequence Alignment,

Assigning Sequences to Taxa CMSC828G

Theory and Application of Multiple Sequence Alignments

Introduction to BIOINFORMATICS

ELE4120 Bioinformatics. Tutorial 5

Biotechnology Explorer

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Introduction to OTU Clustering. Susan Huse August 4, 2016

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Computers in Biology and Bioinformatics

Big picture and history

Standards: SC.912.L.16.5: Explain the basic processes of transcription and translation, and how they result in the expression of genes.

MicroSEQ Rapid Microbial Identification System

The Chemistry of Genes

CSC 121 Computers and Scientific Thinking

Wednesday, November 22, 17. Exons and Introns

Algorithms in Bioinformatics

DNA and RNA. Chapter 12

mothur Workshop for Amplicon Analysis Michigan State University, 2013

SHAMAN : SHiny Application for Metagenomic ANalysis

DNA: The Molecule of Heredity

M I C R O B I O L O G Y WITH DISEASES BY TAXONOMY, THIRD EDITION

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

2. From the first paragraph in this section, find three ways in which RNA differs from DNA.

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

BLAST. Subject: The result from another organism that your query was matched to.

Functional analysis using EBI Metagenomics

Microbiome Analysis. Research Day 2012 Ranjit Kumar

SAMPLE LITERATURE Please refer to included weblink for correct version.

Chapter 14 Active Reading Guide From Gene to Protein

user s guide Question 3

DNA Transcription. Visualizing Transcription. The Transcription Process

Soil and Environmental Biochemistry and Microbial Ecology (ENR 6610) Course Syllabus Winter/Spring Semester 2016

ABSTRACT METHODS FOR MICROBIAL GENOMICS. Professor Steven L. Salzberg Department of Computer Science

Computational analysis of non-coding RNA. Andrew Uzilov BME110 Tue, Nov 16, 2010

O C. 5 th C. 3 rd C. the national health museum

From AP investigative Laboratory Manual 1

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

DNA Model Stations. For the following activity, you will use the following DNA sequence.

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

AP BIOLOGY. Investigation #3 Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Slide 1 / 32. Slide 2 / 32.

2. In Figure 10-4, why is edna made only from mrna and not also from trnas and ribosomal RNAs?

DNA Structure and Analysis. Chapter 4: Background

2017 Amplyus, all rights reserved

Teaching Bioinformatics in the High School Classroom. Models for Disease. Why teach bioinformatics in high school?

Introduction to Bioinformatics

user s guide Question 3

Virtual Bacterial Identification Lab

RNA & PROTEIN SYNTHESIS

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA

ENZYMES AND METABOLIC PATHWAYS

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

RNA Structure and the Versatility of RNA. Mitesh Shrestha

PROTEIN SYNTHESIS. copyright cmassengale

European Union Reference Laboratory for Genetically Modified Food and Feed (EURL GMFF)

Ch. 10 Notes DNA: Transcription and Translation

Module I: Introduction

produces an RNA copy of the coding region of a gene

Alignment to a database. November 3, 2016

Molecular Biology: DNA sequencing

1.1 What is bioinformatics? What is computational biology?

NUCLEIC ACIDS AND PROTEIN SYNTHESIS

MCDB 1041 Class 21 Splicing and Gene Expression

Click here to read the case study about protein synthesis.

2 Gene Technologies in Our Lives

Protein Synthesis: Transcription and Translation

Evolutionary Genetics. LV Lecture with exercises 6KP

Types of Databases - By Scope

Chapter 17: From Gene to Protein

Fundamentals of Genetics. 4. Name the 7 characteristics, giving both dominant and recessive forms of the pea plants, in Mendel s experiments.

Transcription:

Practical Bioinformatics for Life Scientists Week 14, Lecture 27 István Albert Bioinformatics Consulting Center Penn State

No homework this week Project to be given out next Thursday (Dec 1 st ) Due following Thursday (Dec 8 th ) Turn in all homework you might have missed (partial credit will be given)

Introduction to metagenomics genetic material recovered from environmental samples nothing lives on its own human body is estimated to contain 10 trillion cells the human body contains more bacteria than cells (estimates of 10:1) 100 trillion bacteria!

http://xkcd.com/978/

I got intrigued! So what is the real number? How many cells are there in the human body? Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2): e1000667. doi:10.1371/journal.pcbi.1000667 it says to the benefits in agriculture, food industry, and medicine to name a few. We humans have more bacterial cells (10 14 ) inhabiting our body than our own cells (10 13 ) [2],[3]. It has been stated The citation are [2] Savage DC (1977) Microbial ecology of the gastrointestinal tract. Annu Rev Microbiol 31: 107 133 [3] Berg R (1996) The indigenous gastrointestinal microflora. Trends Microbiol 4: 430 435.

Let s go back to 1977! Savage DC (1977) Microbial ecology of the gastrointestinal tract. Annu Rev Microbiol 31: 107 133 and down the rabbit hole we go, back to 1971

Genetics of the evolutionary process By Theodosius Grigorievich Dobzhansky No citation or source given. it appears that in 1971 Theodosius Grigorievich Dobzhansky stated that the human body contains ten trillion cells and this has been cited as a fact ever since.

Metagenomics approaches metagenome assembly reconstitute entire genomes from randomly fragmented short reads: many (mostly) unknown genomes genomes with various abundances sequences with systematic errors more data worse results? How may genomes could there be in the first place? Some say: over 1 billion! personal note: 1 billion sounds like a really big number and we just saw an example of where big numbers come from

16S RNA the gene 16S rrna gene is highly conserved between different species of bacteria and archaea and: parts of these genes are identical in just about all bacteria universal primers other sections of the 16S rrna gene contain species-specific signature LSU, SSU large subunit and small subunit rnra decodes mrna into aminoacids essential to life

16S rrna sequencing isolate only the 16S rrna genes, sequence only these Pros: require far less coverage, allows, characterize population Cons: we don t know what the unknown bacteria are like other then the region that we sequenced

Phylotyping Map the reads against all known genetic sequences Find exact or partial (homologous) matches, characterize sample by the known genomes There are not that many fully sequenced bacteria - thousands? But there is data on lots of partially sequenced ones millions. The tool to use is BLAST (and we will cover it later in this lecture)

OTU based approaches OTU operational taxonomical units, most common approach, no taxonomomy required: sequences are characterized relative to their similarity to one another groups are formed based on similarity thresholds

Strenghts and weaknesses Phylotyping shifts/biases the view towards known sequences, has to rely on external taxonomy OTU is far more abstract, it cannot tell what is actually there, grouping is subjective Most often people use a combination of all thechniques

Complication: competing standards Multiple sources of bacterial taxonomical classification

Competing 16s rrna taxonomical databases silva comprehensive ribosomal RNA database ARB, Germany rdp ribosomal database project Michigan State, USA greengenes comprehensive 16S rrna gene sequence alignment also can download from GenBank Each of these resources creates a slightly different taxonomy Silva has 9 levels

Some tools mandate the use of the NCBI taxonomy

Using BLAST It used to be that bioinformatics was synonymous with using BLAST: Basic Local Alignment Search Tool Accessible via a web interface or command line Aligns sequences in a FASTA query file against a target database It can align nucleotides and proteins to other nucleotides and proteins and all combinations + it can also translate a query/target into all reading frames when searching for a match there are a fairly large number of tools are included mastering Blast is a extremely valuable skill!

Download the binary for your platform from the NCBI website Comes in two versions BLAST and BLAST+, the new version has different program names and slightly different output (and that breaks a lot of other programs) NCBI recommends BLAST+

BLAST programs old name new name

Create an index to your target On cygwin pass a relative path to blast rather then the one with ~ (the windows blast cannot handle unix based paths)

Run BLAST Take the first 50 bases of chromosome 1 of the yeast genome and put it into the reads.fa

Turn off low complexity filtering

Change output formats next lecture we continue on with BLAST and a metagenomics sample Please install MEGAN: http://ab.inf.uni-tuebingen.de/software/megan/ the MEtaGenome Analyzer