De novo genome assembly. Dr Torsten Seemann

Similar documents
De novo Genome Assembly

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Mate-pair library data improves genome assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

De novo genome assembly with next generation sequencing data!! "

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

De novo whole genome assembly

From Infection to Genbank

De novo whole genome assembly

Lecture 14: DNA Sequencing

ChIP-seq and RNA-seq

Mapping strategies for sequence reads

Genome Sequencing and Assembly

De Novo Assembly of High-throughput Short Read Sequences

De novo assembly in RNA-seq analysis.

De novo whole genome assembly

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Connect-A-Contig Paper version

Purpose of sequence assembly

10/20/2009 Comp 590/Comp Fall

De Novo and Hybrid Assembly

ChIP-seq and RNA-seq. Farhat Habib

de novo metagenome assembly

Transcriptome analysis

NOW GENERATION SEQUENCING. Monday, December 5, 11

de novo paired-end short reads assembly

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Analysis of RNA-seq Data

Biol 478/595 Intro to Bioinformatics

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Haploid Assembly of Diploid Genomes

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

How much sequencing do I need? Emily Crisovan Genomics Core

Deep Sequencing technologies

Alignment and Assembly

A Roadmap to the De-novo Assembly of the Banana Slug Genome

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

DNA Sequencing and Assembly

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Next Gen Sequencing. Expansion of sequencing technology. Contents

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Next-generation sequencing technologies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

De novo meta-assembly of ultra-deep sequencing data

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Assembly of Ariolimax dolichophallus using SOAPdenovo2

RNA-Seq de novo assembly training

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Genomic resources. for non-model systems

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Genome Assembly Background and Strategy

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Repetitive DNA sequence assembly

We begin with a high-level overview of sequencing. There are three stages in this process.

The Basics of Understanding Whole Genome Next Generation Sequence Data

How is genome sequencing done?

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

CSCI2950-C DNA Sequencing and Fragment Assembly

Next-generation sequencing Technology Overview

Concepts and methods in genome assembly and annotation

CSE182-L16. LW statistics/assembly

Matthew Tinning Australian Genome Research Facility. July 2012

Faction 2: Genome Assembly Lab and Preliminary Data

CloG: a pipeline for closing gaps in a draft assembly using short reads

State of the art de novo assembly of human genomes from massively parallel sequencing data

RNA-Seq analysis workshop

Microbiome: Metagenomics 4/4/2018

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Genome Assembly: Background and Strategy

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

A Crash Course in NGS for GI Pathologists. Sandra O Toole

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

Lecture 7. Next-generation sequencing technologies

Why are we here? Introduction

BIOINFORMATICS ORIGINAL PAPER

Contact us for more information and a quotation

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Novel methods for RNA and DNA- Seq analysis using SMART Technology. Andrew Farmer, D. Phil. Vice President, R&D Clontech Laboratories, Inc.

Bionano Access : Assembly Report Guidelines

NEXT GENERATION SEQUENCING. Farhat Habib

SMRT Analysis Barcoding Overview (v6.0.0)

Next-Generation Sequencing. Technologies

Next Generation Sequencing. Tobias Österlund

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

The Bioluminescence Heterozygous Genome Assembler

Transcription:

De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013

Introduction

Ideal world I would not need to give this talk! Human DNA Non-existent USB3 device AGTCTAGGATTCGCTA TAGATTCAGGCTCTGA TATATTTCGCGGGATT AGCTAGATCGCTATGC TATGATCTAGATCTCG AGATTCGTATAAGTCT AGGATTCGCTATAGAT TCAGGCTCTGATATAT TTCGCGGGATTAGCTA 46 complete haplotype chromosome sequences

Real world Can t sequence full-length native DNA no instrument exists (yet) But we can sequence short fragments 100 at a time (Sanger) 100,000 at a time (Roche 454) 1,000,000 at a time (PGM) 10,000,000 at a time (Proton, MiSeq) 100,000,000 at a time (HiSeq)

Make a DNA library DNA preparation depends on sequencing platform being used Typical steps Shearing: chop DNA into smaller fragments Size selection: choose the size range you need Adaptor ligation: add special sequence to ends Now ready to sequence!

Instruments Platform Method Read Length Yield Quality Value Illumina synthesis + fluorescence 250 ++++ +++++ ++++ SOLiD ligation + fluorescence 75 ++++ +++ +++ PGM non-term NTP + ph wells 300 ++ +++ +++ Proton non-term NTP + ph wells 400 +++ ++ +++ Roche 454 non-term NTP + luminescence 600 + +++ ++ PacBio synthesis + ZMW 12000 ++ + ++

Which sequencing platform? Long reads Low cost Pick any 3 High yield High quality

De novo assembly The process of reconstructing the original DNA sequence from the fragment reads alone. Instinctively like a jigsaw puzzle Find reads which fit together (overlap) Could be missing pieces (sequencing bias) Some pieces will be dirty (sequencing errors)

An example

A small genome I ll return them tomorrow! Friends, Romans, countrymen, lend me your ears;

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me Oops! I dropped them.

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me I m good with words. Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me We have a consensus! Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; Majority consensus Friends, Romans, countrymen, lend me your ears;

So far, so good.

The awful truth Genome assembly is impossible. He wears glasses so he must be smart A/Prof. Mihai Pop World leader in de novo assembly research.

Methods

Approaches greedy assembly overlap :: layout :: consensus de Bruijn graphs string graphs seed and extend all essentially doing the same thing, but taking different short cuts.

Assembly recipe Find all overlaps between reads hmm, sounds like a lot of work Build a graph a picture of read connections Simplify the graph sequencing errors will mess it up a lot Traverse the graph trace a sensible path to produce a consensus

Find read overlaps If we have N reads of length L we have to do ½N(N-1) ~ O(N²) comparisons each comparison is an ~ O(L²) alignment use special tricks/heuristics to reduce these! What counts as overlapping? minimum overlap length eg. 20bp minimum %identity across overlap eg. 95% choice depends on L and expected error rate

N=6 15 alignment scores Read# 1 2 3 4 5 6 1 - - - - - - 2 80 - - - - - 3 95 85 - - - - 4 0 30 20 - - - 5 0 0 25 70 - - 6 0 35 25 60 50 -

Graph construction Thicker lines mean stronger evidence for overlap Node/Vertex Edge/Arc

A more realistic graph

What ruins the graph? Read errors introduce false edges and nodes Non-haploid organisms heterozygosity causes lots of detours Repeats if longer than read length causes nodes to be shared, locality confusion

Graph simplification Squash small bubbles collapse small errors (or minor heterozygosity) Remove spurs short dead end hairs on the graph Join unambiguously connected nodes reliable stretches of unique DNA

Graph traversal For each unconnected graph at least one per replicon in original sample Find a path which visits each node once Hamiltonian path/cycle is NP-hard (this is bad) solution will be a set of paths which terminate at decision points Form a consensus sequences from paths use all the overlap alignments each of these collapsed paths is a contig

Contigs Contiguous, unambiguous stretches of assembled DNA sequence Contigs ends correspond to Real ends (for linear DNA molecules) Dead ends (missing sequence) Decision points (forks in the road)

Repeats

What is a repeat? A segment of DNA which occurs more than once in the genome sequence Very common Transposons (self replicating genes) Satellites (repetitive adjacent patterns) Gene duplications (paralogs)

Dot plots Self similarity plot, genome versus itself

Effect on assembly The repeated element is collapsed into a single contig

Repeat mis-assembly collapsed tandem a b c excision I II III a b c d a b c I b c a II b c d III rearrangement I II III IV a b c d e f I a III II IV a d e b c f

The law of repeats It is impossible to resolve repeats of length S unless you have reads longer than S. It is impossible to resolve repeats of length S unless you have reads longer than S.

Scaffolding

Beyond contigs Contig sizes are limited by: the length of repeats in your genome Can t change this! the length (or span ) of the reads Wait for new technology Use tricks with existing technology

Types of reads Example fragment atcgtatgatcttgagattctctcttcccttatagctgctata Single-end read atcgtatgatcttgagattctctcttcccttatagctgctata Sequence one end of the fragment Paired-end read atcgtatgatcttgagattctctcttcccttatagctgctata Sequence both ends of same fragment we can exploit this information!

Scaffolding Paired-end reads known sequences at either end roughly known distance between ends unknown sequence between ends Most ends will occur in same contig if our contigs are longer than pair distance Some ends will be in different contigs evidence that these contigs are linked!

Contigs to scaffolds Paired-end read Contigs Scaffold Gap Gap

Assumptions

What can we assemble? Genomes A single organism eg. its chromosomal DNA Meta-genomes Genomic DNA from a mixture of organisms Transcriptomes A single organism s RNA inc. mrna, ncrna Meta-transcriptomes RNA from a mixture of organisms 2:30pm

Genomes Expect uniformity Each part of genome represented by roughly equal number of reads Average depth of coverage Genome: 4 Mbp Yield: 4 million x 50 bp reads = 200 Mbp Coverage: 200 4 = 50x (reads per bp)

Meta-genomes Expect proportionality & uniformity Each genome represented by proportion of reads similar to their proportion in mixture Example Mix of 3 species: ¼ Staph, ¼ Clost, ½ Ecoli Say we get 4M reads Then we expect about: 1M from Staph, 1M from Clost, 2M from Ecoli

Meta-genome issues Closely related species will have very similar reads lots of shared nodes in the graph Conserved sequence bits of DNA common to lots of organisms hub nodes in the graph Untangling is difficult need longer reads

Assessment

Assessing assemblies We desire Total length similar to genome size Fewer, larger contigs Correct contigs Metrics No generally useful objective measure Longest contig, total bp, N50,

The N50 The length of that contig from which 50% of the bases are in it and shorter contigs Imagine we got 7 contigs with lengths: 1,1,3,5,8,12,20 Total 1+1+3+5+8+12+20 = 50 N50 is the halfway sum = 25 1+1+3+5+8+12 = 30 ( 25) so N50 is 12

N50 concerns Optimizing for N50 encourages mis-assemblies! An aggressive assembler may over-join: 1,1,3,5,8,12,20 (previous) 1,1,3,5,20,20 (now) 1+1+3+5+20+20 = 50 (unchanged) N50 is the halfway sum (still 25) 1+1+3+5+20= 30 ( 25) so N50 is 20 (was 12)

Validation Self consistency Align read back to contigs Check for errors or discordant pairs Second opinion Use two complementary sequencing methods Target troublesome areas for PCR Use a genome wide optical map

How do I do it?

Example Culture your bacterium Extract your genomic DNA Send it to AGRF for Illumina sequencing 100bp paired end Get back two files: MRSA_R1.fastq.gz MRSA_R2.fastq.gz Now what?

Assembly tools Genome Velvet, Abyss, Mira, Newbler, SGA, AllPaths, Ray, SOAPdenovo, Spades, Masurca, Meta-genome MetaVelvet, SGA, custom scripts + above Transcriptome Trans-Abyss, Oases, Trinity Meta-Transcriptome custom scripts + above

Online tutorial The GVL Genomics Virtual Laboratory http://genome.edu.au Protocols Microbial de novo assembly for Illumina data Written by Simon Gladman (VBC/LSCC) https://genome.edu.au/wiki/protocols

Velvet: hash reads velveth MyFolder 71 -shortpaired -fastq.gz -separate MRSA_R1.fastq.gz MRSA_R2.fastq.gz K-mer size Read files Read type

Velvet: assembly velvetg MyFolder -exp_cov auto -cov_cutoff auto Signal level Noise level

Velvet: examine results less MyFolder/contigs.fa >NODE_1_length_43211_cov_27.36569 AGTCGATGCTTAGAGAGTATGACCTTCTATACAAAA ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTTTCTAATG ATCTTATATTAGCGCTAGTCTGATAGCTCCCTAGAT CTGATCTGATATGATCTTAGAGTATCGGCTATTGCT AGTCTCGCGTATAATAAATAATATATTTAGTAGTCT

Velvet: GUI Velvet Assembler Graphical User Environment Where to save Add your reads Click run

Contact Email torsten.seemann@monash.edu Torst! 5½! Blog TheGenomeFactory.blogspot.com Web vicbioinformatics.com vlsci.org.au/lscc