Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Similar documents
Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

Lecture 14: DNA Sequencing

10/20/2009 Comp 590/Comp Fall

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

The Diploid Genome Sequence of an Individual Human

Genome Sequencing-- Strategies

Alignment and Assembly

NGS developments in tomato genome sequencing

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

De novo genome assembly with next generation sequencing data!! "

NEXT GENERATION SEQUENCING. Farhat Habib

Contact us for more information and a quotation

CSCI2950-C DNA Sequencing and Fragment Assembly

Genome Projects. Part III. Assembly and sequencing of human genomes

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

De Novo Assembly of High-throughput Short Read Sequences

De novo whole genome assembly

DNA Sequencing and Assembly

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

The first generation DNA Sequencing

CSE182-L16. LW statistics/assembly

De novo whole genome assembly

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

De novo whole genome assembly

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Molecular Biology: DNA sequencing

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Genomics AGRY Michael Gribskov Hock 331

Introduction to Next Generation Sequencing

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

Mate-pair library data improves genome assembly

Bioinformatics for Genomics

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

RNA-Sequencing analysis

The Human Genome and its upcoming Dynamics

High throughput sequencing technologies

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

De novo assembly in RNA-seq analysis.

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Next Gen Sequencing. Expansion of sequencing technology. Contents

Genomic resources. for non-model systems

Biol 478/595 Intro to Bioinformatics

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

The Basics of Understanding Whole Genome Next Generation Sequence Data

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

ARACHNE: A Whole-Genome Shotgun Assembler

Genome Analyzer. RNA ChIP

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

BIOINFORMATICS ORIGINAL PAPER

Finishing Drosophila Ananassae Fosmid 2728G16

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Machine Learning. HMM applications in computational biology

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Services Presentation Genomics Experts

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Transcriptome analysis

Technologies, resources and tools for the exploitation of the sheep and goat genomes.

DNA sequencing. Course Info

Genome Assembly With Next Generation Sequencers

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Next-generation sequencing technologies

Modern Epigenomics. Histone Code

Matthew Tinning Australian Genome Research Facility. July 2012

Introduction to Bioinformatics

Next-generation sequencing technologies

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

State of the art de novo assembly of human genomes from massively parallel sequencing data

Genome Sequence Assembly

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

Each cell of a living organism contains chromosomes

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

* Custom assays developed based on customer requirements. * Rigorous QC process to ensure test performance and accuracy

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Workflow of de novo assembly

Mapping strategies for sequence reads

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09

Slide 1. Slide 2. Slide 3

From Sample Preparation to Data Analysis

Analysing genomes and transcriptomes using Illumina sequencing

Using New ThiNGS on Small Things. Shane Byrne

Unfortunately, plate data was not available to generate an initial low coverage

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

14 March, 2016: Introduction to Genomics

Analysis of RNA-seq Data

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

A draft sequence of bread wheat chromosome 7B based on individual MTP BAC sequencing using pair end and mate pair libraries.

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Haploid Assembly of Diploid Genomes

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Transcription:

Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab

The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb

Outline Basic concepts in genome sequencing and assembly Hierarchical vs. whole-genome shotgun methods Sources of error in assemblies Repeats Polymorphism Sequencing errors Alignment and assembly of next-generation sequencing data Tiling reads onto reference vs. de novo assemblies some methods

Sequence assembly: the basic approach Generate reads Terminology and concepts Find overlapping reads genomic clone: A vector containing an insert of genomic DNA Assemble reads into contigs Join contigs into scaffolds using mate pairs mate pair contig scaffold BAC: 150-200 kb Fosmid: 40 kb Plasmid: 3-5 kb mate pair: reads from two ends of a clone (plasmid, BAC or fosmid) containing an insert physically mapped to the genome; used to order and orient contigs and scaffolds coverage: average number of reads covering a particular position in the assembly Join scaffolds into finished sequence AGTTGTATTATTAGAAACTGAGGGCTAAAAACTGTGCACATACACAGACACACATATTATTTTAATATAGATTTTCAATAATTGGTCTAGGATAAGGATAATATACAG

Hierarchical shotgun sequencing minimum tiling path plasmids: 3-5 kb inserts

Assembling the human genome

Whole genome shotgun sequencing Shear genome into 3-5kb fragments & clone into plasmids from end sequencing of BACs or fosmids

Combined hierarchical - whole genome shotgun

Assembly from individual reads Identify pairs of reads sharing a common sequence (k-mer; k > 20) Extend to full alignment - discard if alignment < 98% identical Create multiple alignment from overlapping reads TACA TAGATTACACAGATTAC T GA TAGT TAGATTACACAGATTAC TAGA TAG TTACACAGATTATTGA TAG TTACACAGATTATTGA build contigs, scaffolds, etc. Issues: repeats sequence errors polymorphism

Repeats Assembly from individual reads: issues a k-mer represented 1,000,000 times results in 1,000,000 2 comparisons remove overrepresented k-mers increase read length = increase k problematic for short read methods Sequencing errors increase coverage TAGATTACACAGATTATTGA TAG-TTACACAGATTACTGA Polymorphism produce consistent high-quality mismatches in one contig or multiple virtually identical contigs TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA increase coverage sequence multiple people TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA repeats can also cause this

Assembly quality Human draft ~7.5x coverage Mouse draft ~7.7x coverage Assemblers Phrap Celera Arachne N50 length: contig length containing a typical nucleotide, i.e. the maximum length L such that 50% of all bases lie in contigs at least L bases long. designed for Sanger sequencing (read length, errors, quality scores)

Alignment and assembly with short reads Two tasks: Map to reference genome many tools De novo assembly much harder reference-guided assembly (MOSAIK) true de novo assembly (Velvet)

Analysis depends on application Mapping to reference genome useful for interrogating the known genome RNA sequencing ChIP sequencing SNP detection (targeted and whole-genome) methyl-seq CNV detection (sometimes) De novo assembly no genome sequence unbiased ascertainment of variation in known genome by whole-genome reseq

Eland aligner for Illumina data alignment policies: allows up to 2 mismatches/alignment non-unique alignments are discarded Mapping short reads to a reference Maq quality aware - takes seq quality into account allows non-unique alignments Index methods reference genome is loaded into active memory as k-mers very fast alignments SOAP Bowtie SNP detection, paired-end mapping, RNA-seq, ChIP-seq, etc.

Maq dataflow

De novo assembly Sequencing a new genome Resequencing an existing genome Accomodate repeats, polymorphism, sequence errors Reference guided assembly use pairwise alignments to reference genome to guide assembly allows gapped alignments True de novo assembly Velvet: graph-based analysis observed k-mers, rather than pairwise alignment of reads

Velvet assembly process