Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Similar documents
Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Alignment and Assembly

De novo genome assembly with next generation sequencing data!! "

De novo whole genome assembly

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

BIOINFORMATICS ORIGINAL PAPER

Lecture 14: DNA Sequencing

De novo whole genome assembly

De Novo Assembly of High-throughput Short Read Sequences

10/20/2009 Comp 590/Comp Fall

De novo Genome Assembly

De novo whole genome assembly

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Mate-pair library data improves genome assembly

The Basics of Understanding Whole Genome Next Generation Sequence Data

Genome Projects. Part III. Assembly and sequencing of human genomes

Biol 478/595 Intro to Bioinformatics

NGS developments in tomato genome sequencing

The Diploid Genome Sequence of an Individual Human

Contact us for more information and a quotation

DNA Sequencing and Assembly

CSE182-L16. LW statistics/assembly

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Mapping strategies for sequence reads

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Workflow of de novo assembly

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

De Novo and Hybrid Assembly

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

State of the art de novo assembly of human genomes from massively parallel sequencing data

Genomic resources. for non-model systems

ChIP-seq and RNA-seq. Farhat Habib

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

NEXT GENERATION SEQUENCING. Farhat Habib

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

AMOS Assembly Validation and Visualization

We begin with a high-level overview of sequencing. There are three stages in this process.

Gap Filling for a Human MHC Haplotype Sequence

ChIP-seq and RNA-seq

Transcriptome analysis

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

De novo meta-assembly of ultra-deep sequencing data

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Bionano Access : Assembly Report Guidelines

Lecture 10 : Whole genome sequencing and analysis. Introduction to Computational Biology Teresa Przytycka, PhD

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Genome Sequencing-- Strategies

Genomics and Transcriptomics of Spirodela polyrhiza

De novo assembly in RNA-seq analysis.

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

ABSTRACT. Genome Assembly:

Analysis of structural variation. Alistair Ward - Boston College

Why are we here? Introduction

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

Each cell of a living organism contains chromosomes

De novo genome assembly. Dr Torsten Seemann

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Finishing Drosophila ananassae Fosmid 2410F24

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Finishing Drosophila Ananassae Fosmid 2728G16

Direct determination of diploid genome sequences. Supplemental material: contents

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA

Genome Sequencing and Assembly

Genome Assembly. Microbial Single Cell Genomics Workshop 2010 Sergey Koren JCVI, CBCB at UMD

de novo paired-end short reads assembly

Next Gen Sequencing. Expansion of sequencing technology. Contents

Purpose of sequence assembly

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004

The Irys System. Rapid Genome Wide Mapping for de novo Assembly and Structural Variation Analysis. Jack Peart, Ph.D. Director of Sales EMEA

Bionano Solve Theory of Operation: Variant Annotation Pipeline

The Human Genome and its upcoming Dynamics

ARACHNE: A Whole-Genome Shotgun Assembler

Understanding Accuracy in SMRT Sequencing

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

3) This diagram represents: (Indicate all correct answers)

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

BroadE Workshop: Genome Assembly. March 20 th, 2013

SUPPLEMENTARY INFORMATION

Genome annotation & EST

Transcription:

Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7

Computational requirements mykl_roventine@flickr Main requirements: Memory. Time. The kmer-based assemblers requirements depend on the genome complexity and not so much on the reads.

Read cleaning Some assemblers choke with: bad quality stretches, adapters, low complexity regions, contamination. Bad quality vector Trim Unknown sequence Trim or mask

The assembly problem Only one read is usually not capable of producing the complete sequence. Even if the problem sequence is short one read might have sequencing errors. repeats Unknown sequence Reads contig5 (repeat) Contigs / scaffolds Reasons for the gaps contig1 Repeated region contig2 contig3 no coverage contig4

Long reads

Read size influence

Mate pairs and paired-ends Read length is critical, but constrained by sequencing technology, we can sequence molecule ends. Clone (template) reads Useful for dealing with repetitive genomic DNA and with complex transcriptome structure. Paired-ends: Illumina can sequence from both ends of the molecules. (150-500 bp) Mate-pairs: Can be generated from libraries with different lengths (2-20 Kb). BAC-ends: Usually sequenced with Sanger.

Mate-pairs

Library types Single reads Illumina Pair Ends (150-500 pb) Mate Pairs (2-10 kb) Illumina

Illumina mate-pair chimeras

Long reads Third-generation sequencing and the future of genomics. Lee et al.

Short reads are harder to assemble Taken from Jason Miller

The need for paired reads Taken from Jason Miller

Algorithm I Overlap Layout - Consensus

Overlap Layout - Consensus Algorithm: All-against-all, pair-wise read comparison. Construction of an overlap graph with approximate read layout. Multiple sequence alignment determines the precise layout and the consensus sequence. clean reads Overlaps detection All vs all alignment Contigs Consensus

Overlap problems Two reads are similar if they have a good overlap. We assume that overlap implies common origin in the genome. This goodness depends on the overlap quality and similarity. But several things might go wrong repetitive Too many sequencing errors Chimeric clone Short overlap false negative false positive Vector contamination or repetitive

Overlap problems False positives will be induced by chance and repeats Avoid false positives with stringent criteria: Overlaps must be long enough Sequence similarity must be high (identity threshold) Overlaps must reach the ends of both reads Ignore high-frequency overlaps (repetitive regions in genomic?) But stringency will induce false negatives.

Assembly terminology Consensus Scaffold = Contigs + Parings (gap lengths derived from pairings) Contigs = Reads + Overlaps Overlaps Reads & Pairing Taken from Jason Miller ACTGATTAGCTGACGNNNNNNNNNNTCGTATCTAGCATT

Unitigs Contigs: High-confidence overlaps Maximal contigs with no contradiction (or almost no contradiction) in the data Ungapped Contigs capture the unique stretches of the genome. Usually where unitigs end, repeats begin Contig example. The ideal Contig is A, B, C. C, D and E have a repetitive sequence.

Scaffolds Build with mate pairs Mate pairs help resolve ambiguity in overlap patterns Scaffolds Every contig is a scaffold Every scaffold contains one or more contigs Scaffolds can contain sequence gaps of any size (including negative)

Overlap-Consensus-Layout limitation 30 Million 5Kb (long) reads Number of alignments: 30 M reads x 30 M reads = 900 trillion alignments It does not scale

Algorithm II kmer-based

K-mer based assembly K-mer graphs are de Bruijn graph. Nodes represent all K-mers. Edges represent fixed-length overlaps between consecutive K-mers. Assembly is reduced to a graph reduction problem. Read 1 Read 2 AACCGG CCGGTT Graph: AACC -> ACCG -> CCGG -> CGGT -> GGTT K-mer length = 4

K-mer based assembly

K-mer bubbles

K-mer repetitive

Graph problems Repeats, sequencing errors and polymorphisms increase graph complexity, leading to tangles difficult to resolve K-mer graphs are more sensitive to repeats and sequencing errors than overlap based methods. Optimal graph reductions algorithms are NP-hard, so assemblers use heuristic algorithms Polymorphism Sequencing error Repeat

K-mer analysis 21-mer distribution of sea urchin (Strongylocentrotus purpuratus) genome (900 Mbase long) unique repetitive

K-mer analysis Genome K-mer distribution Simulated reads. 50X coverage No errors reads unique Repetitive (duplication)

K-mer analysis Genome K-mer distribution real reads. 50X coverage With errors Real reads (with errors) Simulated reads (no error) noise signal

Final remarks

N50 N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 statistic is a measure of the average length of a set of sequences, with greater weight given to longer sequences. It is widely used in genome assembly N50: 10829 [ [ [ [ [ [ [ [ [ [ [ [ [ [ 100 7414 14728 22042 29356 36670 43984 51298 58612 65926 73240 80554 87868 95182 N95: 20, 7414[, 14728[, 22042[, 29356[, 36670[, 43984[, 51298[, 58612[, 65926[, 73240[, 80554[, 87868[, 95182[, 146380[ Taken from Wikipedia minimum: 100 maximum: 146,370 average: 1618.01 num. seqs.: 141503 (131545): ************************************************************************ ( 6247): *** ( 2353): * ( 842): ( 324): ( 118): ( 42): ( 13): ( 10): ( 4): ( 0): ( 1): ( 0): ( 4):

Common assembly errors Collapsed repeats Align reads from distinct (polymorphic) repeat copies Fewer repeats in assembly than in genome Fewer tandem repeat copies in assembly than genome Missed joins Missed overlaps due to sequencing error Contradictory evidence from overlaps Contradictory or insufficient evidence from pairs Chimera Enter repeat at copy 1 but exit repeat at copy 2 Assembly joins unrelated sequences

Ingredients for Good Assembly Coverage and read length 20X for (corrected) PacBio, 150X for Illumina Paired reads Read lengths long enough to place uniquely Inserts longer than long repeats Pair density sufficient to traverse repeat clusters Tight insert size variance Diversity of insert sizes Read Quality: Sequencing errors No vector contamination Contamination: mitochondrial or chloroplastic The requirements depend greatly on the quality: it is not the same a draft that high quality genome.

Common assemblers Genomic SOAPdenovo, Illumina CABOG: Celera assembler, 454 and few Illumina Staden for Sanger reads. Transcriptomic assemblers are specialized software

From scafflods to chromosomes

Genetic maps

Mapping platforms Third-generation sequencing and the future of genomics. Lee et al.

Bionano

Bionano Bionano helps you solve long range structures (although not chromosome wide assemblies) 1 2

Bionano

Bionano

10X linked reads Reads generated from the same long DNA sequence are tagged with the same index and sequenced using Illumina Reads with the same index convey medium range information (Up to 100Kb)

10X GemCode

10X linked reads assembly

10x possible uses De novo genome assembly Haplotype phasing Structural Variants: Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases (https://doi.org/10.1186/s13073-017-0447-8) Integrative analysis of genomic alterations in triple-negative breast cancer in association with homologous recombination deficiency. (DOI: 10.1371/journal.pgen.1006853)

Tomato assembly example PacBio 10X PacBio-Bionano 10X-Bionano N75 1.789.934 813.655 16.105.568 6.758.801 L75 141 288 18 40 N s per 100 Kb 0 6071 2325 14885 Cost $$$ $ $$$$ $$

Hi-C Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. (DOI: 10.1126/science.1181369)

Hi-C

Hi-C Assembly De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. (DOI: 10.1126/science.aal3327) Hi-C data provide links across a variety of length scale Unlike paired-end reads Hi-C contact spans an unknown length and may connect loci on different chromosomes Human genome assembly: Pair-end Illumina reads (67X coverage) Hi-C data (6.7 coverage) 23 scaffold that span the 99.5% of the 23 human chromosomes Errors remain in the ordering of short distances than could be fixed by matepairs, long reads or bionano

This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Jose Blanca COMAV institute bioinf.comav.upv.es