De novo genome assembly with next generation sequencing data!! "

Similar documents
De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De Novo Assembly of High-throughput Short Read Sequences

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

De novo whole genome assembly

De novo whole genome assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo whole genome assembly

de novo paired-end short reads assembly

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Lecture 14: DNA Sequencing

10/20/2009 Comp 590/Comp Fall

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

NOW GENERATION SEQUENCING. Monday, December 5, 11

Next Generation Sequencing Technologies

De novo assembly in RNA-seq analysis.

CSE182-L16. LW statistics/assembly

Genome Assembly, part II. Tandy Warnow

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Repetitive DNA sequence assembly

NEXT GENERATION SEQUENCING. Farhat Habib

Genome Assembly With Next Generation Sequencers

De novo sequence assembly

Genome Assembly Background and Strategy

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

State of the art de novo assembly of human genomes from massively parallel sequencing data

Genome Sequencing and Assembly

Haploid Assembly of Diploid Genomes

CloG: a pipeline for closing gaps in a draft assembly using short reads

Alignment and Assembly

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

ChIP-seq and RNA-seq

de novo metagenome assembly

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Mate-pair library data improves genome assembly

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Rapid Transcriptome Characterization for a nonmodel organism using 454 pyrosequencing

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Contact us for more information and a quotation

Concepts and methods in genome assembly and annotation

ChIP-seq and RNA-seq. Farhat Habib

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Workflow of de novo assembly

ABSTRACT. Genome Assembly:

Genome Sequencing-- Strategies

NGS developments in tomato genome sequencing

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

BIOINFORMATICS ORIGINAL PAPER

UC Riverside UC Riverside Electronic Theses and Dissertations

Genome Assembly and Annotation of Isochrysis Galbana

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

Towards Accurate De Novo Assembly for Genomes with Repeats

From Infection to Genbank

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A.

Next-generation sequencing technologies

Introduction to Bioinformatics

Purpose of sequence assembly

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Analysis of RNA-seq Data

CSCI2950-C DNA Sequencing and Fragment Assembly

De novo genome assembly. Dr Torsten Seemann

Genome Assembly: Background and Strategy

Improving Genome Assemblies without Sequencing

Compute- and Data-Intensive Analyses in Bioinformatics"

Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Building and Improving Reference Genome Assemblies

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons Francesco Vezzi 1,, Giuseppe Narzisi 2, Bud Mishra 2,3,4

Why are we here? Introduction

Genomics and Transcriptomics of Spirodela polyrhiza

The Basics of Understanding Whole Genome Next Generation Sequence Data


Assemblathon Summary Report

It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Genome Assembly. Background and Approach 28 Jan Jillian Walker Diana Williams

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Next-generation sequencing technologies

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Genome Assembly Workshop Titles and Abstracts

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

De Novo Co-Assembly Of Bacterial Genomes From Multiple Single Cells

Transcription:

De novo genome assembly with next generation sequencing data!! " Jianbin Wang" HMGP 7620 (CPBS 7620, and BMGN 7620)" Genomics lectures" 2/7/12" Outline" The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The take home lessons" 1

The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" Why/When do we need de novo genome assembly? Lots of interesting organisms don t have their genome sequences available! They have to be done using NGS de novo assembly! Within species, each individual has its own genome! For one individual, different cells may have genome alterations! 2

5/29/12 New genomes" Within species" 3

Within an individual" The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" 4

The Nature of NGS Data" Higher parallel operation/yield! Much lower cost per base! Shorter (unfortunately)! 454: 200 400 bp! Illumina: 50 150 bp! Sanger sequencing: 600 1000 bp! ABI SOLiD: 35 75 bp! Platform-based characteristic errors! Illumina paired-end vs. mate pair sequencing" Paired-end! Mate pair! 5

The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes" De novo genome assembly concepts" Whole genome shortgun" sequencing" Genomic DNA! Genomic reads! Mate pair De novo assembly" Paired-end! Contig1! Contig2! Contig3! Contig4! Scaffold! Gaps! 6

Some vocabulary" Coverage (C)! C = 4" C k = 2" (k = 10)" C k = 3" (k = 5)" Kmer coverage (C k )! N50, N90!! Contig" N50 = 18,063 bp" N50 number = 4,175" N90 = 3,548 bp" N90 number = 16,950" Contig number Methods: Overlap-layout-consensus" Pair-wise sequence alignments (computationally expensive)! Construction an overlap graph to produce the reads layout! Multiple sequence alignments and generate consensus! Illumina! Examples: Phrap, Celera, Arachne, CAP, PCAP, Newbler,! 7

Methods: Eulerian path/de Bruijn graph" Kmer hash table! de Bruijn graph/ Eulerian path search! Examples: Euler, Velvet, Allpath, Abyss, SOAPdenovo,...! AGATGATTCG!! AGA! GAT! ATG! TGA! GAT! ATT! TTC! TCG! Illumina! Differences between an overlap graph and a de Bruijn graph" Schatz et. al 2010! 8

Methods - challenge" Repetitive sequence! DNA polymorphisms/sequencing errors! Non-uniform coverage (worse in Sanger sequencing)! Computational complexity of processing large volume of data! Reduced the complexity of the data" Sub-assembly (grouped assembly)! Repeat-masking! Reference based! 9

Additional Scaffolding" Related-genome as reference! cdnas/transcriptomes! Conserved proteins! Paired-end information! Reference genome - - cdna conserved protein! -.. -.... Contig1! Contig2! Contig3! Contig4! - - Scaffold! Genome assessment - coverage" Reads coverage/reads used! Physical coverage! Functional coverage! cdnas! Small RNAs!! 10

Genome assessment - continuity" Consistency to available genetic maps! Paired-end discrepancy! mrna/cdna intactness! The need for de novo genome assembly! The nature of next generation sequencing data! The concepts and methods" The takes! 11

12

De novo genome assembly on NGS data" is feasible! is still a very hard problem! algorithm matters, but more important is the source of DNA and quality of the library! reference genome or other higher-order genetic map is of great value! put it into the biological context! References/Additional reading" Schatz, M. C., A. L. Delcher, et al. (2010). "Assembly of large genomes using second-generation sequencing." Genome research 20(9): 1165-1173.! Earl, D., K. Bradnam, et al. (2011). "Assemblathon 1: a competitive assessment of de novo short read assembly methods." Genome research 21 (12): 2224-2241.! Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of genome assemblies and assembly algorithms." Genome research.! Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and nextgeneration sequencing: computational challenges and solutions." Nature reviews. Genetics 13(1): 36-46.! 13