Review of whole genome methods

Similar documents
Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

Genome Reassembly From Fragments. 28 March 2013 OSU CSE 1

Genome Sequence Assembly

Each cell of a living organism contains chromosomes

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

De novo genome assembly with next generation sequencing data!! "

Genome Sequencing-- Strategies

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Introduction to Bioinformatics

A Brief Introduction to Bioinformatics

Mapping strategies for sequence reads

Worksheet for Bioinformatics

Next Generation Sequencing Technologies

De Novo Assembly of High-throughput Short Read Sequences

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Pathway Tools Schema and Semantic Inference Layer: Pathways and the Overview. SRI International Bioinformatics

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Assembly of Ariolimax dolichophallus using SOAPdenovo2

The first generation DNA Sequencing

M. Phil. (Computer Science) Programme < >

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

COMPUTER RESOURCES II:

Hunting Down the Papaya Transgenes

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to RNA sequencing

Logistics. Final exam date. Project Presentation. Plan for this week. Evolutionary Algorithms. Crossover and Mutation

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Optimal Production Scheduling (OPS) for Brewery Operations

Algorithms for Bioinformatics

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Mate-pair library data improves genome assembly

Class 35: Decoding DNA

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Theory and Application of Multiple Sequence Alignments

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

2 Gene Technologies in Our Lives

Expressed Sequence Tags: Clustering and Applications

Intelligent Techniques Lesson 4 (Examples about Genetic Algorithm)

Basic Bioinformatics: Homology, Sequence Alignment,

Big picture and history

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Database Searching and BLAST Dannie Durand

Why learn sequence database searching? Searching Molecular Databases with BLAST

Genome Assembly, part II. Tandy Warnow

Creation of a PAM matrix

DNA Sequence Assembly using Particle Swarm Optimization

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Sequence Analysis Lab Protocol

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Introducing Bioinformatics Concepts in CS1

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

What about streaming data?

O C. 5 th C. 3 rd C. the national health museum

Zool 3200: Cell Biology Exam 2 2/20/15

DNA Structure and Analysis. Chapter 4: Background

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

1. A brief overview of sequencing biochemistry

Connect-A-Contig Paper version

Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences

CHAPTER 21 LECTURE SLIDES

Workflow of de novo assembly

3. human genomics clone genes associated with genetic disorders. 4. many projects generate ordered clones that cover genome

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

Introduction to Bioinformatics. Genome sequencing & assembly

Genetics Lecture 21 Recombinant DNA

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Lab #2 Bioreactors and Fermentation

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION

Genes and Gene Technology

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

September 19, synthesized DNA. Label all of the DNA strands with 5 and 3 labels, and clearly show which strand(s) contain methyl groups.

Gene Identification in silico

Genomics AGRY Michael Gribskov Hock 331

Genome assembly reborn: recent computational challenges Mihai Pop

Fuzzy Methods for Meta-genome Sequence classification and Assembly

3 Designing Primers for Site-Directed Mutagenesis

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Finishing Drosophila Ananassae Fosmid 2728G16

The common structure of a DNA nucleotide. Hewitt

Evaluation of the genesig q16 quantitative PCR unit

Chapter 6. Genes and DNA. Table of Contents. Section 1 What Does DNA Look Like? Section 2 How DNA Works

Haploid Assembly of Diploid Genomes

VISHVESHWARAIAH TECHNOLOGICAL UNIVERSITY S.D.M COLLEGE OF ENGINEERING AND TECHNOLOGY. A seminar report on GENETIC ALGORITHMS.

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

Understanding DNA Structure

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms

Transcription:

Review of whole genome methods Suffix-tree based MUMmer, Mauve, multi-mauve Gene based Mercator, multiple orthology approaches Dot plot/clustering based MUMmer 2.0, Pipmaker, LASTZ 10/3/17 0

Rationale: MUMmer 2.0 Original implementation required large amounts of memory Advantages: Chromosome scale inversions in bacteria Large scale duplications in Arabidopsis Ancient human duplications when amino acid space explored >70% of human chr 14 derives from chr 2 10/3/17 1

Improvements Uses suffix trees for linear time and space solution but room for improvement Memory reduced from 293MB to 100MB using suffix tree improvements of Kurtz (20 bytes/ bp) Time down from 74s to 27s using streaming 10/3/17 2

Idea of algorithm We take a streaming string and run McCreight s algorithm to find where it would go. If it branches in a leaf edge, it is unique in the string in the suffix tree (reference) We then check the character immediately to the left in both strings for left maximality 10/3/17 3

A mini quiz You are given two genomes that your biologist colleagues think have perfectly matching repeats (>2 copies in each). How would you find the length of the longest matching repeat within one genome? (and in how much time) How would you find the longest repeat shared between two genomes? 10/3/17 4

Pros and cons Question 1: If you stream one or more strings against a suffix tree, are matches guaranteed to be unique in the queries? Question 2: What are the advantages and disadvantages (if any) of using protein sequences instead of nucleotide ones? 10/3/17 5

Yeast paper Beer may have cemented human societies through social act, rituals, medicine and uncontaminated water Yeast, along with crops, may have also been domesticated 10/3/17 6

Background Brewing evolved in middle ages Europe to produce ale-type beer via Saccharomyces cerevisiae, the same yeast used in wine and leavened bread. Lager-brewing arose in 15 th century Bavaria, and is the most popular technique Lager, however, requires slow, low temperature fermentation by cryotolerant yeast(s). 10/3/17 7

Results Saccharomyes are associated with oak trees in Northern hemisphere. This study focused on Patagonia in South America with 123 cryotolerant species and two isolates of S. cerevisiae. The fact so many were cryotolerant is unique relative to the northern hemisphere. These group with biological assays with the two known contaminants of lager/cider/wine fermentation 10/3/17 8

Genome sequencing Relationships are contentious as the lager yeast and related yeasts previously were only found in human fermentation efforts. To address this issue, the authors sequenced representatives from Patagonia and breweries using short read/ next gen technology. Comparisons were done to inform the biology here. 10/3/17 9

Domestication and analysis Lager yeast is a mix of at least three yeast species Interestingly, all cryotolerant species have the same chunk of S. cervisiae useful for processing maltose Maltose is one of the most abundant sugars in wort used in brewing Fusion seems to have happened at least twice (see optional paper on course site) 10/3/17 10

Sequence Assembly Required! 11 ISMB 2007

Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome

Greedy solution is bounded

Typical assembly strategy & n# $! % 2" pairs θ(n 2 l 2 ) run-time Directly detect promising pairs Exact Matching Filter O(n) pairs O(nl 2 ) run-time

Traditional Assemblers TIGR Assembler CAP3/PCAP PHRAP Celera Assembler ARACHNE JAZZ PHUSION ATLAS Advantages Effective heuristics to solve this NPC problem Brute-force parallelization is easy to implement Limitations θ(n 2 ) space required in the worst case Limited scaling as a result of using disk

A Look at the maize genome Repeats Gene islands

Problems due to repeats

Types of sequencing gaps Slide from Mihai Pop and Michael Schatz

Modern assemby using de Bruijn graphs G = (V, E) where V is the set of all length k subfragments and E are directed edges if nodes overlap by k-1 characters. Relevant papers: De Bruijn, 1946; Idury and Waterman, 1995; Pevzner, Tang, Waterman, 2001 Good news: the correct assembly exists as a path through G Bad news: there are many such paths!

Try it out! Consider the text: It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness Nodes in the graph are overlapping phrases of length 4, aka It was the best and was the best of Draw an edge between nodes if the last three words of one node match the first three of another.

Iowa State University

Consider the text: Try it out! (part 2) It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness How could you construct an assembly based on this graph? Are there multiple answers? How many possible answers are correct