De novo sequence assembly

Similar documents
De novo sequence assembly

De novo genome assembly with next generation sequencing data!! "

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

De novo assembly in RNA-seq analysis.

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Purpose of sequence assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Mapping strategies for sequence reads

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

10/20/2009 Comp 590/Comp Fall

Lecture 14: DNA Sequencing

De Novo Assembly of High-throughput Short Read Sequences

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

De novo Genome Assembly

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

de novo paired-end short reads assembly

A thesis submitted in partial fulfillment of the requirements for the degree in Master of Science

ChIP-seq and RNA-seq

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Introduction to Bioinformatics

ChIP-seq and RNA-seq. Farhat Habib

Analysis of RNA-seq Data

de novo metagenome assembly

Machine Learning. HMM applications in computational biology

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

De novo whole genome assembly

NGS part 2: applications. Tobias Österlund

Contact us for more information and a quotation

De novo genome assembly. Dr Torsten Seemann

Lecture 11: Gene Prediction

RNA-Seq de novo assembly training

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Bioinformatics for Genomics

CSCI2950-C DNA Sequencing and Fragment Assembly

Transcriptome analysis

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny.

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

De novo whole genome assembly

De novo meta-assembly of ultra-deep sequencing data

Repetitive DNA sequence assembly

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Introduction to RNA sequencing

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Challenging algorithms in bioinformatics

BIOINFORMATICS ORIGINAL PAPER

From Infection to Genbank

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

Understanding Accuracy in SMRT Sequencing

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo whole genome assembly

Meta-IDBA: A de Novo Assembler for Metagenomic Data

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Single Cell Transcriptomics scrnaseq

NOW GENERATION SEQUENCING. Monday, December 5, 11

The Bioluminescence Heterozygous Genome Assembler

RNA-Sequencing analysis

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

GenScale Scalable, Optimized and Parallel Algorithms for Genomics. Dominique LAVENIER

Eucalyptus gene assembly

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Genomics and Transcriptomics of Spirodela polyrhiza

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

Alignment and Assembly

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis


Genome Projects. Part III. Assembly and sequencing of human genomes

De novo metagenomic assembly using Bayesian model-based clustering

Genomic resources. for non-model systems

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

NAME:... MODEL ANSWER... STUDENT NUMBER:... Maximum marks: 50. Internal Examiner: Hugh Murrell, Computer Science, UKZN

De Novo Co-Assembly Of Bacterial Genomes From Multiple Single Cells

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Mate-pair library data improves genome assembly

Lecture 10, 20/2/2002: The process of solution development - The CODEHOP strategy for automatic design of consensus-degenerate primers for PCR

Next Generation Sequencing. Tobias Österlund

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

State of the art de novo assembly of human genomes from massively parallel sequencing data

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

Microbiome: Metagenomics 4/4/2018

Hidden Markov Models. Some applications in bioinformatics

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Genome Assembly, part II. Tandy Warnow

High-Throughput Assay Design. Microarrays. Applications. Overview. Algorithms Universal DNA Tag Array Design and Optimization

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Lecture 10 : Whole genome sequencing and analysis. Introduction to Computational Biology Teresa Przytycka, PhD

Transcription:

2015.11.17 De novo sequence assembly 徐唯哲 Paul Wei-Che HSU 中央研究院分子生物研究所研究助技師 Assistant Research Specialist Bioinformatics Service Core, Institute of Molecular Biology, Academia Sinica, Taiwan, R.O.C. Bioinformatics Service Core 1

De novo sequence assembly Genome assembly Transcriptome assembly Metagenome assembly

De novo genome assembly Unknown Genome Shotgun sequencing DNA is sheared into random fragments (reads or tags) assembly 3

Shortest common superstring (SCS) Given a collection of strings S, find SCS(S): the shortest string that contains all strings in S as substrings Example: S: BAA AAB BBA ABA ABB BBB AAA BAB Concatenation: BAAAABBBAABAABBBBBAAABAB 24 Without requirement of shortest SCS(S): AAABBBABAA 10 AAA AAB ABB BBB BBA BAB ABA BAA Finding overlap (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Semiglobal Alignment Finding overlaps Exact string matching Suffix tree

Semiglobal Alignment Needleman Wunsch algorithm (Dynamic programming) Initialize first row to 0s Answer is maximum score in bottom row Trace back starts from maximum score until it falls off top side ACTG CTG

L = 3 Exact string matching

Suffix tree Generalized suffix tree for GACATA ATAGAC GACATA$0ATAGAC$1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 5 $0 C TA GAC$1 9 6 A $0 ATA$0 C $1 $1 13 TA GAC ATA$0 2 12 0 10 $1 4 $0 GAC$1 8 ATA$0 1 11 $1 $0 3 GAC$1 7 GACATA GACATA ATAGAC ATAGAC GACATA

String overlap alogrithm Greedy-extension algorithm Identify overlapping area (select the highest score) Finding overlaps Merge overlapping sequences merging Identify overlapping area again, then merge (rerun again) Until sequences cannot be merged anymore 9

Greedy-extension algorithm (String-based assemblers ) SSAKE (2007), SHARCGS (2007), QSRA (2009) are applicable to illumina platform More time-consuming, suitable for small amount of reads(low throughput), smaller genomes Greedy algorithm is not guaranteed to choose overlaps yielding SCS, but is a good approximation. 10

Shortest common superstring: Using Greedy-extension algorithm Greedy-SCS algorithm in action Input strings ABA ABB AAA AAB BBB BBA BBB 2 BAAB ABA ABB AAA BBB BBA BAB 2 BABB BABB ABA AAA BBB BBA 2 BBAAB 2 BBBAAB BABB BABB ABA ABA AAA BBB AAA 2 BBBAABA BABB AAA 2 BABBBAABA AAA 1 BABBBAABAAA BABBBAABAAA Superstring BAA In red are strings that get merged before the next round Greedy answer: BABBBAABAAA Actual SCS: AAABBBABAA Rounds of merging, one merge per line. Number in first column = length of overlap merged before that round (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Graph-based assemblers High speed, suitable for big amount of reads(high throughput), bigger genomes Overlap-layout-consensus (OLC) Newbler (2006, 454 platform), Forge(2009, 454+ illumina) de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) are applicable to illumina platform 12

Overlap-layout-consensus (OLC) Software: Newbler (454 platform), SGA 1. Finding overlaps 2. Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig

Finding overlaps Semiglobal Alignment Exact string matching Suffix tree

Build overlap graph Find out overlapping relationship between all reads, then draw diagrams reads Overlapping sequences 15

Layout

Layout Hamilton Path It is a graph path between two vertices of a graph that visits each vertex exactly once. An edge (in graph) from the last vertex to the first vertex of the Hamiltonian Path, is so called Hamilton Circuit. B C D A F E G H I 17

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Consensus Pick most likely nucleotide sequence for each contig Deletion? Sequencing error? SNP? Insertion? (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Limitation of OLC More than million reads cannot be resolved effectively. 22

Efficient way? Indexing Comparison of one-to-one

Use K-mer sequences instead of reads True Genome (You Never Know) reads K-mer sequences Break reads into smaller k-mer sequences De Bruijn graph assembly (DBG) 24

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 1: sub-strings length K of read will be replaced (k-mer). A read: which has all 3-mers k =3 AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG 25

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 2 : k-1 as vertex, k as edge, draw diagrams, (k-1 appears only once on the diagram) AGATGATTCG K-mer AGA, GAT, ATG, TGA, GAT, ATT, TTC, TCG, K-1 AG GA GA AT AT TG TG GA GA AT AT TT TT TC TC CG TGA AGA GAT ATG AG GA AT TG ATT TT TTC TC TCG CG 26

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 3: find Euler Tour in an undirected graph that traverses each edge of the graph exactly once AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG AGA GAT ATG AG GA AT TG TT ATT TTC TGA TC TCG CG and go on 27

If it is always assembled in k-mer sequences, it would be more efficient to use dbg (Compeau et al., 2011, Nature) OLC dbg 28

Error correction In order to assemble fewer and longer contigs, most assembly programs will modify the result

Error correction 30

dbg algorithm (Velvet Software) Step 1 sequencing (red stands for a sequencing error) Genome The length of Reads is 7 Step 2 Set up retrieving table(k = 4mers), and link all k-mer 31

dbg algorithm(velvet Software) Step 3 simplify the graph and link overlapping k-mer Simplify the graph: combine the overlapping k-mer into a longer sequence. Attention: there are several possible paths by simplifying the graph. Step 4 remove the error path, get four contigs 32

Required conditions for a perfect dbg All k-mers can cover the entire genome It is not quite possible, because some areas in genome are not so easy to sequence(gc rich or structure problem ) and some areas are very easy to sequence. It comes out that some areas display many reads in the genome, but some areas shows no reads. All k-mers sequences are no errors. It is impossible. So far, the best quality tool illumina can only guarantee till ~80% Q30 (an error appears once in 1000 bases) Each k-mer appears only once in the genome It is impossible. Most biological or viral genomes contain varying lengths of repeated sequences. There are ~ 45% repeated sequences in the human genome. References Human Molecular Genetics 4/e 2010 34

Repeats are very problematic in genome assembly With short reads, all the algorithms cannot resolve repeats exactly. OLC read1 read1 read2 read2 read3 read4

Repeats are very problematic in genome assembly dbg: Reads are immediately split into shorter k-mers; may not resolve repeats as well as overlap graph 36

The common results of different algorithms, when the sequences repeat String overlap algorithm Graphics algorithms Resources: www.langmead-lab.org/teaching-materials 37

How to select K in dbg algorithms Finding the optimal balance between sensitivity and graph complexity Guideline for k-selection Low coverage: smaller k-mer, increased number of overlapping reads that contribute to the graph High coverage: large k-mer, no need to be too sensitive, need to reduce graph complexity. 38

In accordance with the number of base pairs, the CLC will automatically determines the length of k-mer, max. 64 12-24 on 32-bit computers and 12-64 on 64-bit computers. Resources: http://www.clcsupport.com/clcassemblycell/4 20/index.php?manual=How_it_works.html 39

Comparison of assembly algorithms OLC and dbg OLC low-coverage long reads small genome assembly dbg high-coverage short reads large genome assembly 40

優點 merit OLC dbg It can analysis varying length sequences from different platforms. High speed, high efficiency It can use overlapping sequences to assemble, high reliability 缺點 fault OLC dbg Very low speed, difficult to calculate If the length of repeat is longer than k-mer, there will be an error-prone assembly. It s applicable to long read sequencing If there is an error in the read, regardless of the size, it lead to bifurcate. A modification is necessary. The assembled genome sometimes would not match the original reads 100%. 如果 read 序列上有錯誤, 不管大小都會造成圖形分岔, 要進行修改 No assembler/algorithm had consistent good performance in all the statistics. 41

What is N50? 1. After sequence assembly, we get a bunch of contigs 2. According to the length, classify the contigs in descending order. Calculate the sum of the lengths of contigs together. The sum of the lengths 1 2 3 4 5 6 7 8 9 3. The N50 length is defined as the length N for which 50% of the sum of the lengths of the collection of all contigs. Half of the total length (50%) 1 2 3 4 5 6 7 8 9 N50 = The length of contig #2 42

The longer of N50 length, the better assembly quality? 50% length 50% length because The N50 of Assembly B >> The N50 of Assembly A Therefore the result of Assembly B is better?? 43

N75 50% length N25 N75 N25 50% length 如果 N50 與 N25 相近, 表示 contig 長度都很長如果 N50 與 N75 相近, 表示 contig 長度中偏短 If the N50 and N25 are similar, it means the lengths of most contigs are long If the N50 and N75 are similar, it means the lengths of most contigs are shorter than the medium-length. 44

De novo transcriptome assembly Nature Review Genetics, 2011

Overview of the de novo transcriptome assembly strategy Step1: Generate k-mer sequences from the reads (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome Step2: Generate the de Bruijn graph assembly strategy Step3: Simplify the graph the de Bruijn graph (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome assembly strategy Step4: Traverse the graph Step5: Assembled isoforms (Martin & Wang, Nat. Rev. Genet., 2011)

Contrasting Genome and Transcriptome Assembly Genome Assembly Uniform coverage Transcriptome Assembly Exponentially distributed coverage levels Single contig per locus Double-stranded Multiple contigs per locus (alternative splicing) Strand-specific

Genome Assembly Single Massive Graph Transcriptome Assembly Many Thousands of small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene.

Trinity (Haas et al., Nat Protoc, 2013)

Trinity: RNA-Seq De novo Assembly RNA-Seq reads Linear contigs De-Bruijn graphs Transcripts + isoforms (Haas et al., Nat Protoc, 2013) 52

Inchworm Step1: Decompose all reads into k-mers (k=25). Step2: Identify seed k-mer as most abundant k-mer, ignoring low-complexity k-mer. Step3: Extend k-mer at 3 -end, guided by coverage. Step4: Remove assembled k-mers from catalog, then repeat the entire process. G 0 A 5 11 C 0 9 G 4 A 1 AAAATT A 7 T 0 A 6 G 1 GATTACA C 4 T 0 G 1 T 1 C 0 A 1 C 1 T 1 Report contig: AAGATTACAGA

Chrysalis Chrysalis pools Inchworm contigs and overlap linear sequences by overlaps of k-1 to build graph components Integrate isoforms via k-1 overlaps (Haas et al., Nat Protoc, 2013)

Butterfly compacting Build dbg graphs. Ideally, one per gene

De novo metagenome assembly MetaVelvet software DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly (Sakakibara et al., NAR, 2014 )

De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Clustering Single genome assembly (Sakakibara et al., NAR, 2014 )

ATGT GTC T T AACA CG Construct a large de Bruijn Graph for mixed reads of multiple species GGC GACCGTA Decomposing into subgraphs ATGT GTC AACA CG Assembly for a species A Assembly for a species B GGC GTC GACCGTA Assembly for a species C

Velvet vs. MetaVelvet De Bruijn graph of metagenome assembly Low coverage (assume = 10) Species A (MetaVelvet) mis-removed as Error (Velvet) Species B (MetaVelvet) mid coverage (assume = 30) high coverage (assume = 60) Species C (MetaVelvet) mislabeled as Repeat (Velvet)

心理建設 : 做 de novo assembly 請先看這篇文章 Out of touch with the reality: Before running de novo assembly, please read this article first. 60

不然也看看這篇文章的 BOX 1 A short cut to the whole picture: Box1 61

de novo assembly improvement suggestions Good quality data is key to a successful assembly: Trimming based on quality Trimming Adapters from sequences Scan over many k-values (25-65) and pick the one with best N50 High quality data -> larger k-mer Data with homo-polymer errors -> smaller k-mer Genome + transcriptome assembly can vastly improve assemblies Expect lower quality in difficult regions. Repeats High GC content Bubble Size (Using CLC): If you do not expect a repetitive genome -> higher bubble size If your sequence quality is not good -> higher bubble size if you anticipate more repeats -> smaller bubble size

Bubble Size (Using CLC) Increasing the bubble size also increases the change of misassemblies. CLCbio Manual

Don t take as Gospel the output of an assembly program, Benedict Paten Assistant Research Scientist, University of California, Santa Cruz If your paper is going to rely on that, it is absolutely essential that you do PCR and other follow-up experiments.

Thank you for your attention~ My Email: paul@imb.sinica.edu.tw Rm.N107 IMB BSC, No.128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan R.O.C Bioformatics Core @ IMB TEL:886-2-2789-9967