De novo sequence assembly

Similar documents
De novo sequence assembly

De novo genome assembly with next generation sequencing data!! "

Mapping strategies for sequence reads

De Novo Assembly of High-throughput Short Read Sequences

De novo Genome Assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

De novo genome assembly. Dr Torsten Seemann

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction to Bioinformatics

Lecture 11: Gene Prediction

Introduction to RNA sequencing

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Assembly of Ariolimax dolichophallus using SOAPdenovo2

De novo whole genome assembly

Genomics and Transcriptomics of Spirodela polyrhiza

NOW GENERATION SEQUENCING. Monday, December 5, 11

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Haploid Assembly of Diploid Genomes

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

De novo whole genome assembly

Genome Reassembly From Fragments. 28 March 2013 OSU CSE 1

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Genome Assembly, part II. Tandy Warnow

Next Generation Sequencing Technologies

Each cell of a living organism contains chromosomes

RNA-Sequencing analysis

Mate-pair library data improves genome assembly

Introduction to Bioinformatics

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Genome Assembly: Background and Strategy

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

A Brief Introduction to Bioinformatics

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

Next Gen Sequencing. Expansion of sequencing technology. Contents

PRE- AND POST-PROCESSING TOOLS FOR NEXT-GENERATION SEQUENCING DE NOVO ASSEMBLIES. Sari S. Khaleel

Genes and gene finding

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Introduction to Bioinformatics. Ulf Leser

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

Introduction: Methods:

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

White paper on de novo assembly in CLC Assembly Cell 4.0

High-throughput scale. Desktop simplicity.

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Axiom mydesign Custom Array design guide for human genotyping applications

CloG: a pipeline for closing gaps in a draft assembly using short reads

RNASEQ WITHOUT A REFERENCE

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Introduction to Bioinformatics. Ulf Leser

Infectious Disease Omics

Workflow of de novo assembly

Bioinformatics in next generation sequencing projects

SCIENCE CHINA Life Sciences

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

ASSEMBLY ALGORITHMS FOR NEXT-GENERATION SEQUENCE DATA. by Aakrosh Ratan

Biotechnology Explorer

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Molecular Biology: DNA sequencing

Slide 1. Slide 2. Slide 3

Supplemental Data Supplemental Figure 1.

Introduction to Molecular Biology

Connect-A-Contig Paper version

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

601 CTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTT GACAGGTGTGTTAGACGGGAAAGCTTTCTAGGGTTGCTTTTCTCTCTGGTGTACCAGGAA >>>>>>>>>>>>>>>>>>

Metagenomics is the study of all micro-organisms coexistent in an environmental area, including

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Gene Expression Technology

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Assessing De-Novo Transcriptome Assemblies

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Genome Sequence Assembly

IPA Advanced Training Course

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

Welcome to the NGS webinar series

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Supplemental Data. mir156-regulated SPL Transcription. Factors Define an Endogenous Flowering. Pathway in Arabidopsis thaliana

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

RNA-Seq with the Tuxedo Suite

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

SVMerge Output File Format Specification Sheet

Transcription:

2015.6.12 De novo sequence assembly 徐唯哲 Paul Wei Che HSU 中央研究院分子生物研究所研究助技師 Assistant Research Specialist Bioinformatics Service Core, Institute of Molecular Biology, Academia Sinica, Taiwan, R.O.C. Bioinformatics Service Core 1

De novo sequence assembly Genome assembly Transcriptome assembly Metagenome 00 assembly

Shortest common superstring (SCS) Given a collection of strings S, find SCS(S): the shortest string that contains all strings in S as substrings Example: S: BAA AAB BBA ABA ABB BBB AAA BAB Concatenation: BAAAABBBAABAABBBBBAAABAB 24 Without requirement of shortest SCS(S): AAABBBABAA 10 AAA AAB ABB BBB BBA BAB ABA BAA (Ben Langmead, http://www.langmead lab.org/teaching materials/)

De novo genome assembly Unknown Genome Shotgun sequencing DNA is sheared into random fragments (reads or tags) assembly 4

de novo assembly algorithms String based assemblers (Greedy extension algorithm) Graph based assemblers: Overlap layout consensus (OLC) de Bruijn graph assembly (dbg) 5

String based assemblers (Greedyextension algorithm) SSAKE (2007), SHARCGS (2007), QSRA (2009) are applicable to illumina platform More time consuming, suitable for small amount of reads(low throughput), smaller genomes Greedy algorithm is not guaranteed to choose overlaps yielding SCS, but is a good approximation. 6

Shortest common superstring: greedy Greedy SCS algorithm in action (l=1) Input strings ABA ABB AAA AAB BBB BBA BBB 2 BAAB ABA ABB AAA BBB BBA BAB 2 BABB BABB ABA AAA BBB BBA 2 BBAAB 2 BBBAAB BABB BABB ABA ABA AAA BBB AAA 2 BBBAABA 2 BABBBAABA BABB AAA AAA 1 BABBBAABAAA BABBBAABAAA Superstring BAA In red are strings that get merged before the next round Greedy answer: BABBBAABAAA Actual SCS: AAABBBABAA Rounds of merging, one merge per line. Number in first column = length of overlap merged before that round (Ben Langmead, http://www.langmead lab.org/teaching materials/)

String overlap alogrithm Greedy extension algorithm Identify overlapping area (select the highest score) overlap Merge overlapping sequences merge Identify overlapping area again, then merge (rerun again) Until sequences cannot be merged anymore 8

Graph based assemblers High speed, suitable for big amount of reads(high throughput), bigger genomes Overlap layout consensus (OLC) Newbler (2006, 454 platform), Forge(2009, 454+ illumina) de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) are applicable to illumina platform 9

Overlap layout consensus (OLC) Software: Newbler (454 platform), SGA 1. Finding overlaps 2. Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig

Finding overlaps Semiglobal Alignment To find the optimal alignment between suffix (prefix) of S1 with prefix (suffix) of S2 Needleman Wunsch algorithm (Dynamic programming)

Finding overlaps Exact string matching L = 3

suffix tree Finding overlaps

Build overlap graph Find out overlapping relationship between all reads, then draw diagrams reads Overlapping sequences 14

Layout

Layout Hamilton Path It is a graph path between two vertices of a graph that visits each vertex exactly once. An edge (in graph) from the last vertex to the first vertex of the Hamiltonian Path, is so called Hamilton Circuit. B C D A E H F G I 16

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)

Consensus Pick most likely nucleotide sequence for each contig Sequencing error? SNP? Insertion? Deletion? (Ben Langmead, http://www.langmead lab.org/teaching materials/)

Limitation of OLC More than million reads cannot be resolved effectively. 21

Use K mer sequences instead of reads True Genome (You Never Know) reads K mer sequences Break reads into smaller k mer sequences De Bruijn graph assembly (DBG) 22

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 1: sub strings length K of read will be replaced (k mer). A read: which has all 3 mers k =3 AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG 23

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 2 : k 1 as vertex, k as edge, draw diagrams, (k 1 appears only once on the diagram) AGATGATTCG K mer AGA, GAT, ATG, TGA, GAT, ATT, TTC, TCG, K 1 AG GA GA AT AT TG TG GA GA AT AT TT TT TC TC CG TGA AGA GAT ATG AG GA AT TG ATT TT TTC TC TCG CG 24

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 3: find Euler Tour in an undirected graph that traverses each edge of the graph exactly once AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG AGA GAT ATG AG GA AT TG TT ATT TTC TGA TC TCG CG and go on 25

If it is always assembled in k mer sequences, it would be more efficient to use dbg (Compeau et al., 2011, Nature) OLC dbg 26

Error correction In order to assemble fewer and longer contigs, most assembly programs will modify the result

Error correction 28

dbg algorithm (Velvet Software) Step 1 sequencing (red stands for a sequencing error) Genome The length of Reads is 7 Step 2 Set up retrieving table(k = 4mers), and link all k mer 29

dbg algorithm(velvet Software) Step 3 simplify the graph and link overlapping k mer Simplify the graph: combine the overlapping k mer into a longer sequence. Attention: there are several possible paths by simplifying the graph. Step 4 remove the error path, get four contigs 30

Required conditions for a perfect dbg All k mers can cover the entire genome It is not quite possible, because some areas in genome are not so easy to sequence(gc rich or structure problem ) and some areas are very easy to sequence. It comes out that some areas display many reads in the genome, but some areas shows no reads. All k mers sequences are no errors. It is impossible. So far, the best quality tool illumina can only guarantee till ~80% Q30 (an error appears once in 1000 bases) Each k mer appears only once in the genome It is impossible. Most biological or viral genomes contain varying lengths of repeated sequences. There are ~ 45% repeated sequences in the human genome. References Human Molecular Genetics 4/e 2010 31

Repeats are very problematic in genome assembly With short reads, all the algorithms cannot resolve repeats exactly. OLC read1 read1 read2 read2 read3 read4

Repeats are very problematic in genome assembly dbg: Reads are immediately split into shorter k mers; may not resolve repeats as well as overlap graph 33

The common results of different algorithms, when the sequences repeat String overlap algorithm Graphics algorithms Resources: www.langmead lab.org/teaching materials 34

How to select K in dbg algorithms Finding the optimal balance between sensitivity and graph complexity Guideline for k selection Low coverage: smaller k mer, increased number of overlapping reads that contribute to the graph High coverage: large k mer, no need to be too sensitive, need to reduce graph complexity. 35

In accordance with the number of base pairs, the CLC will automatically determines the length of k mer, max. 64 12 24 on 32 bit computers and 12 64 on 64 bit computers. Resources: http://www.clcsupport.com/clcassemblycell/4 20/index.php?manual=How_it_works.html 36

Comparison of assembly algorithms OLC and dbg OLC low coverage long reads small genome assembly dbg high coverage short reads large genome assembly 37

優點 merit OLC dbg It can analysis varying length sequences from different platforms. High speed, high efficiency It can use overlapping sequences to assemble, high reliability 缺點 fault OLC dbg Very low speed, difficult to calculate If the length of repeat is longer than k mer, there will be an error prone assembly. It s applicable to long read sequencing If there is an error in the sequence, regardless of the size, it lead to bifurcate. A modification is necessary. The assembled genome sometimes would not match the original reads 100%. No assembler/algorithm had consistent good performance in all the statistics. 38

What is N50? 1. After sequence assembly, we get a bunch of contigs 2. According to the length, classify the contigs in descending order. Calculate the sum of the lengths of contigs together. The sum of the lengths 1 2 3 4 5 6 7 89 3. The N50 length is defined as the length N for which 50% of the sum of the lengths of the collection of all contigs. Half of the total length (50%) 1 2 3 4 5 6 7 89 N50 = The length of contig #2 39

The longer of N50 length, the better assembly quality? 50% length 50% length because The N50 of Assembly B >> The N50 of Assembly A Therefore the result of Assembly B is better?? 40

N75 50% length N25 N75 N25 50% length 如果 N50 與 N25 相近, 表示 contig 長度都很長如果 N50 與 N75 相近, 表示 contig 長度中偏短 If the N50 and N25 are similar, it means the lengths of most contigs are long If the N50 and N75 are similar, it means the lengths of most contigs are shorter than the medium length. 41

De novo transcriptome assembly Nature Review Genetics, 2011

Overview of the de novo transcriptome assembly strategy Step1: Generate k mer sequences from the reads (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome Step2: Generate the de Bruijn graph assembly strategy Step3: Simplify the graph the de Bruijn graph (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome assembly strategy Step4: Traverse the graph Step5: Assembled isoforms (Martin & Wang, Nat. Rev. Genet., 2011)

Contrasting Genome and Transcriptome Assembly Genome Assembly Uniform coverage Transcriptome Assembly Exponentially distributed coverage levels Single contig per locus Double stranded Multiple contigs per locus (alternative splicing) Strand specific

Genome Assembly Single Massive Graph Transcriptome Assembly Many Thousands of small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene.

Trinity (Haas et al., Nat Protoc, 2013)

Trinity: RNA Seq De novo Assembly Inchworm assembles reads, generating unique full length transcripts for a dominant isoform (contigs). Chrysalis clusters the contigs and constructs complete de Bruijn graphs for each cluster. Butterfly compacts graph with reads, reporting full length transcripts for alternatively spliced isoforms. (Haas et al., Nat Protoc, 2013) 49

De novo metagenome assembly MetaVelvet software DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly (Sakakibara et al., NAR, 2014 )

De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Advantage: High thoughput sequencing Deep sequencing from low populations Problem: short read length mixture of sequence reads > chimeric assembly

De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Clustering Single genome assembly

ATGT GGC T T GTC AACA CG GACCGTA Decomposing into subgraphs MetaVelvet strategy Construct a large de Bruijn Graph for mixed reads of multiple species ATGT GTC AACA CG Assembly for a species A Assembly for a species B GGC GTC GACCGTA Assembly for a species C

Problem on metagenome assembly using Velvet Mislabeling node by Velvet if applied to metagenome node of High coverage > mislabeled as Repeat node of Low coverage > mis removed as Error Species C of low coverage (assume = 10) Species B of mid coverage (assume = 30) Species A of high coverage (assume = 60)

心理建設 : 做 de novo assembly 請先看這篇文章 Out of touch with the reality: Before running de novo assembly, please read this article first. 55

不然也看看這篇文章的 BOX 1 A short cut to the whole picture: Box1 56

de novo assembly improvement suggestions Good quality data is key to a successful assembly: Trimming based on quality Trimming Adapters from sequences Scan over many k values (25 65) and pick the one with best N50 High quality data > larger k mer Data with homo polymer errors > smaller k mer Genome + transcriptome assembly can vastly improve assemblies Expect lower quality in difficult regions. Repeats High GC content Bubble Size (Using CLC): If you do not expect a repetitive genome > higher bubble size If your sequence quality is not good > higher bubble size if you anticipate more repeats > smaller bubble size

Don t take as Gospel the output of an assembly program, Benedict Paten Assistant Research Scientist, University of California, Santa Cruz If your paper is going to rely on that, it is absolutely essential that you do PCR and other follow up experiments.

Thank you for your attention~ My Email: paul@imb.sinica.edu.tw Rm.N107 IMB BSC, No.128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan R.O.C Bioformatics Core @ IMB TEL:886 2 2789 9967