Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Size: px
Start display at page:

Download "Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015"

Transcription

1 Genome Assembly J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

2 From reads to molecules

3 What s the Problem? How to get the best assemblies for the smallest expense (sequencing) and least effort (bioinformatics).

4 What s the Problem? "[...] repeats are the single biggest impediment to all assembly algorithms and sequencing technologies." ~ Koren 2012 Nat Biotech

5 What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously.

6 What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. A R B

7 What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R A R B R?? A R B

8 What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R B R A R?? A R B

9 What s the Problem? Magical FutureSeqTM reads easily resolve these long repetitive regions, but have unfortunately been slow coming to market. R A R B R!! R A R B R

10 What s the Problem? Assembly graphs with perfect reads of length k Koren & Phillippy 2015 Curent Opinions in Microbiology 23:110

11 Software ~timeline Celera Assembler ( OLC assembler used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now open source wgs-assembler Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler

12 OLC Assemblers

13 OLC Assemblers Overlap

14 OLC Assemblers Overlap Layout A R B

15 OLC Assemblers Overlap Layout Consensus. R OLC A R B R A R B

16 de Bruijn graph assembly To reduce computational challenge from millions of reads, break them up into smaller chunks.?!!

17 Constructing an assembly "graph"

18 Constructing an assembly "graph"

19 Constructing an assembly "graph"

20 Constructing an assembly "graph"

21 Constructing an assembly "graph"

22 de Bruijn graph assembler, Velvet Build graph from 7 bp reads, with ernors... using 4 bp k-mers Tracking k-mers, not reads, essentially compresses the data... important for NextGen era!

23 de Bruijn graph assembler, Velvet Tip Removal Bubble Popping (Coverage Constraints) Cutting at every ambiguity (branch point) yields the final contigs: TAGTCGAG GAGGCTTAGA AGATCGGATGAG AGAGACAG Zerbino 2008 Genome Research 18:

24 K-mer coverage...? Performance (speed, memory, effectiveness of assembly) of de Bruijn-graph assemblers is correlated with k-mer coverage, not base coverage.

25 Base coverage

26 K-mer coverage k-mers tile across reads (L - k + 1) k-mers in a read of length L

27 Error Exclusion Smaller k will increase coverage of true kmers (peak shifts to the right), but not error kmers. Choosing a coverage cutoff that separates the two distributions will simplify the graph, removing noise and leaving signal. Simple graph = longer contigs!

28 Choosing k Smaller k-mers increase the connectivity of the graph by simultaneously increasing the chance of observing an overlap between two reads and the number of ambiguous repeats in the graph. There is therefore a balance between sensitivity and specificity determined by k. ~Zerbino (2008) Genome Research 18:821

29 Choosing k

30 Choosing k

31 Assembly Miscellanea

32 Hierarchical Assembly Amplify Bacterial Artificial Chromosomes, Fosmids, etc.... sequence, assemble (simpler problem for BACs than chromosomes), then assemble the assemblies.

33 Scaffoldering

34 Gap filling / contig extension

35 Gap filling / contig extension IMAGE (Iterative Mapping and Assembly for Gap Elimination) Tsai 2010 Genome Biology 11:R41 PRICE (Paired Read Iterative Contig Extension) DeRisi lab, UCSF

36 Reference-assisted assembly

37 Error Correction (Quake)

38 Error Correction (Quake)

39 Error Correction (Quake)

40 Error Correction Similar correction methods are incorporated into modern assemblers (like SOAPdenovo, SGA, ALLPATHS), and error exclusion (based on k-mer coverage) is an element of some (Velvet...)

41 Digital Normalization K-mer based one-pass filtering/trimming of short reads; discards redundant data to even out uneven coverage, and preferentially discards or trims error-containing reads. This reduces graph size (RAM) and computation time for assemblers. Brown 2012 arxiv: v2

42 Digital Normalization Based on median k-mer abundance / coverage, diginorm discards the majority of errorcontaining k-mers, while retaining nearly all real k-mers - (discards data, not information). Brown 2012 arxiv: v2

43 Diginorm (second pass - trimming) After digital normalization, make a second pass wherein 3'-end of reads are trimmed to remove low frequency k-mers.

44 Diginorm (third pass - normalization) After trimming, do another normalization pass. Trimming in between two normalization passes allows more discrimination between erroneous and real k-mers. Majority of computational time is in first pass (normalization), so three-pass approach is not much more demanding than single-pass approach.

45 Assemblers of note...

46 SPAdes uneven coverage, chimerism ( St. Petersburg Assembler ) Nurk, Bankevich et al. (2013 book chapter) DOI: / _13 Bankevich, Nurk et al. (2012) J Comp Biol DOI: /cmb Deals with highly uneven coverage depth (like IDBA_UD) but also high rates of chimerism in sequencing libraries (more of a problem for single-cell assemblies amplified with Multiple Displacement Amplification - micrometagenomes?). Users of SPAdes report: It just works

47 Allpaths-LG... and its "recipe" Ribeiro (2012) Genome Research doi: /gr Gnerre (2011) PNAS 108:1513 Makes use of a recipe of three (or four) different libraries (see below) can be run without largest scale libraries, but not for best results. Makes sense for an institute that can standardize its sequencing and bioinformatics together. Gnerre 2011: 45x Overlapping PE reads (180 bp ISIZE, >100bp reads) 45x Short jump / MP (3kb) 5x.. Optional long jump / MP (6kb) 1x.. Optional fosmid jump / MP (40kb) Ribeiro 2012: 50x Overlapping PE reads (180bp ISIZE, >100bp reads) 50x 1-3kb PacBio reads 50x Long jump / MP (2-10kb)

48 sga: String Graph Assembler Simpson, J and Durbin, R (2010) Efficient construction of an assembly string graph using the FM-index Bioinformatics 26: i367 String graphs retain the information lost by de Bruijn graphs full read context by building graphs based on the full overlaps between reads (instead of k-mers). But, this requires all-to-all overlap detection! sga utilizes BWT & FM-index to make this tractable, but graph construction is still the most (computationally) expensive step. Compared to de Bruijn graph assemblers, sga uses less memory, but is significantly slower.

49 DISCOVAR de novo Weisenfeld, et al. (2014) Comprehensive variation discovery in single human genomes Nature Genetics 46:1350 (Publication is for DISCOVAR -- assembly and variant finding for smaller organisms -- not DISCOVAR de novo -- assembler for large genomes) Uses a single PCR-free, SPRI bead size selected library, and at least 60x coverage with PE250 reads. The size selection yields a broad spectrum of fragment sizes, and the longer distance read pairs are used for scaffolding. Polymorphic sequences can be pulled from the resulting graph structure, or consensus sequences.

50 Bringing PacBio into the picture

51 PacBio Read Correction "PBcR" (web page at UMd) links to spec files, raw data PBcR (wgs-assembler script) pages in wgs-assembler (Celera Assembler) wiki: http: //wgs-assembler.sourceforge.net/wiki/index.php/pbcr ec-tools code on GitHub: plus data: also in SMRT-analysis software code on GitHub: com/pacificbiosciences/smrt-analysis

52 PacBio Read Correction short, high accuracy reads mapped to PB reads Illumina, 454, PB-CCS small coverage gaps recruit other PB reads to fill them large coverage gaps split reads (maxgap option controls cutoff size) recommended minimum: X PacBio 50 X high accuracy reads

53 PacBio Read Correction maxgap? Gaps shorter than 'maxgap' setting get a chance to recruit multiple PB reads for support / correction Gaps longer than 'maxgap' setting automatically split no yes Koren, personal communication

54 PacBio Read Correction More recent Koren paper available at arxiv.org... check: Discusses PB read self-correction (for long reads from C2 or better chemistry). No independent high-accuracy reads needed; PB reads aligned to each other to infer consensus sequence. Implemented in Celera Assembler (wgs-assembler pacbiotoca script) and in PacBio s HGAP pipeline. Also, MHAP for faster alignment of long, noisy reads (reduces bottleneck in assembly).

55 historic Genome Assemblers Celera Assembler (used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now, wgs-assembler (PBcR!) Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler With high accuracy long reads, older OLC assemblers become more appropriate

56 How to incorporate PacBio? small-ish (< 10 Mbp) genomes: 100x PacBio PBcR or HGAP medium ( Mbp): x PacBio HGAP moderate (< 1 Gbp): > 20x PacBio, 50x Illumina PBcR or EC Tools, DBG2OLC? large (> 1 Gbp): > 5x PacBio, x Illumina Illumina assembly, then PBSuite

57 PacBio s HGAP (Not shown) - Quiver algorithm polishes assembly by aligning all reads to finished genome, and calling a new consensus Quiver polishing Chin (2013) Nature Methods 10:563 doi: /nmeth.2474

58 Assembly Assessment

59 Assembly Competitions Assemblathon 1: Earl 2011 Genome Research 21:2224 2: ArXiv.org - GAGE - Genome Assembly Gold-standard Evaluations dngasp - de novo Genome Assembly Project

60 Assembly Assessment N50 NG50 Cumulative Length Plots Feature Response Curves (Alignment) Block NG50 (versus a good? reference) Read alignment methods

61 N50

62 N50, confused

63 NG50

64 Cumulative Length Plots

65 Align to Trusted Reference Mauve s contig reorder tool:

66 (Alignment) Block NG50

67 Cumulative (Alignment) Length Plots

68 Read-based Assessment (AMOS-validate), FRCurves, REAPR Vezzi 2012 PLoS One DOI: /journal.pone

69 Questions?

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing Illumina Assembly 1 Outline The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing 2 Illumina Sequencing Paired end Illumina

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Qi Sun Bioinformatics Facility Cornell University Sequencing platforms Short reads: o Illumina (150 bp, up to 300 bp) Long reads (>10kb): o PacBio SMRT; o Oxford Nanopore

More information

De novo genome assembly with next generation sequencing data!! "

De novo genome assembly with next generation sequencing data!! De novo genome assembly with next generation sequencing data!! " Jianbin Wang" HMGP 7620 (CPBS 7620, and BMGN 7620)" Genomics lectures" 2/7/12" Outline" The need for de novo genome assembly! The nature

More information

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M 454

More information

A Roadmap to the De-novo Assembly of the Banana Slug Genome

A Roadmap to the De-novo Assembly of the Banana Slug Genome A Roadmap to the De-novo Assembly of the Banana Slug Genome Stefan Prost 1 1 Department of Integrative Biology, University of California, Berkeley, United States of America April 6th-10th, 2015 Outline

More information

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k Assembly and Validation of Large Genomes from Short Reads Michael Schatz March 16, 2011 Genome Assembly Workshop / Genome 10k A Brief Aside 4.7GB / disc ~20 discs / 1G Genome X 10,000 Genomes = 1PB Data

More information

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es Sequencing project Unknown sequence { experimental evidence result read 1 read 4 read 2 read 5 read 3 read 6 read 7 Computational requirements

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Lecture 1 Qi Sun Minghui Wang Bioinformatics Facility Cornell University DNA Sequencing Platforms Illumina sequencing (100 to 300 bp reads) Overlapping reads ~180bp fragment

More information

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 2 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:

More information

Genome Assembly: Background and Strategy

Genome Assembly: Background and Strategy Genome Assembly: Background and Strategy Monday, February 8, 2016 BIOL 7210: Genome Assembly Group Aroon Chande, Cheng Chen, Alicia Francis, Alli Gombolay, Namrata Kalsi, Ellie Kim, Tyrone Lee, Wilson

More information

De novo whole genome assembly

De novo whole genome assembly De novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell University Data generation Sequencing Platforms Short reads: Illumina Long reads: PacBio; Oxford Nanopore Contiging/Scaffolding

More information

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence Assembly and Alignment. Jim Noonan Department of Genetics Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb Outline Basic concepts in genome

More information

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018 Lecture 18: Single-cell Sequencing and Assembly Spring 2018 May 1, 2018 1 SINGLE-CELL SEQUENCING AND ASSEMBLY 2 Single-cell Sequencing Motivation: Vast majority of environmental bacteria are unculturable

More information

DNA Sequencing and Assembly

DNA Sequencing and Assembly DNA Sequencing and Assembly CS 262 Lecture Notes, Winter 2016 February 2nd, 2016 Scribe: Mark Berger Abstract In this lecture, we survey a variety of different sequencing technologies, including their

More information

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA. Genome Assembly Software for Different Technology Platforms PacBio Canu Falcon 10x SuperNova Illumina Soap Denovo Discovar Platinus MaSuRCA Experimental design using Illumina Platform Estimate genome size:

More information

De novo assembly of complex genomes using single molecule sequencing

De novo assembly of complex genomes using single molecule sequencing De novo assembly of complex genomes using single molecule sequencing Michael Schatz Jan 14, 2014 PAG XXII @mike_schatz / #PAGXXII 1. Shear & Sequence DNA Assembling a Genome 2. Construct assembly graph

More information

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab' Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab' Outline' 1. Assembly'Review' 2. Pacbio' Technology'Overview' Data'CharacterisFcs' Algorithms' Results' 'Assemblies' 3. Oxford'Nanopore'

More information

Workflow of de novo assembly

Workflow of de novo assembly Workflow of de novo assembly Experimental Design Clean sequencing data (trim adapter and low quality sequences) Run assembly software for contiging and scaffolding Evaluation of assembly Several iterations:

More information

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads Supplemental Materials 1. Supplemental Methods... 3 1.1 Algorithm Detail... 3 1.1.1 k-mer coverage distribution

More information

de novo paired-end short reads assembly

de novo paired-end short reads assembly 1/54 de novo paired-end short reads assembly Rayan Chikhi ENS Cachan Brittany Symbiose, Irisa, France 2/54 THESIS FOCUS Graph theory for assembly models Indexing large sequencing datasets Practical implementation

More information

De Novo and Hybrid Assembly

De Novo and Hybrid Assembly On the PacBio RS Introduction The PacBio RS utilizes SMRT technology to generate both Continuous Long Read ( CLR ) and Circular Consensus Read ( CCS ) data. In this document, we describe sequencing the

More information

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Assembly of Ariolimax dolichophallus using SOAPdenovo2 Assembly of Ariolimax dolichophallus using SOAPdenovo2 Charles Markello, Thomas Matthew, and Nedda Saremi Image taken from Banana Slug Genome Project, S. Weber SOAPdenovo Assembly Tool Short Oligonucleotide

More information

de novo metagenome assembly

de novo metagenome assembly 1 de novo metagenome assembly Rayan Chikhi CNRS Univ. Lille 1 Formation metagenomique de novo metagenomics 2 de novo metagenomics Goal: biological sense out of sequencing data Techniques: 1. de novo assembly

More information

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler High-Throughput Bioinformatics: Re-sequencing and de novo assembly Elena Czeizler 13.11.2015 Sequencing data Current sequencing technologies produce large amounts of data: short reads The outputted sequences

More information

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) tru TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR) Anton Bankevich Center for Algorithmic Biotechnology, SPbSU Sequencing costs 1. Sequencing costs do not follow Moore s law

More information

Genome Sequencing and Assembly

Genome Sequencing and Assembly Genome Sequencing and Assembly History of Sequencing What was the first fully sequenced nucleic acid? Yeast trna (alanine trna) Robert Holley 1965 Image: Wikipedia History of Sequencing Sequencing began

More information

Alignment and Assembly

Alignment and Assembly Alignment and Assembly Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which

More information

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects. Goals Organization Labs Project Reading Course summary DNA sequencing. Genome Projects. Today New DNA sequencing technologies. Obtaining molecular data PCR Typically used in empirical molecular evolution

More information

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads Online Resources Pre&compiledsourcecodeanddatasetsusedforthispublication: http://www.cbcb.umd.edu/software/pbcr

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 21 2011, pages 2957 2963 doi:10.1093/bioinformatics/btr507 Genome analysis Advance Access publication September 7, 2011 : fast length adjustment of short reads

More information

De novo meta-assembly of ultra-deep sequencing data

De novo meta-assembly of ultra-deep sequencing data De novo meta-assembly of ultra-deep sequencing data Hamid Mirebrahim 1, Timothy J. Close 2 and Stefano Lonardi 1 1 Department of Computer Science and Engineering 2 Department of Botany and Plant Sciences

More information

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club De novo assembly of human genomes with massively parallel short read sequencing Mikk Eelmets Journal Club 06.04.2010 Problem DNA sequencing technologies: Sanger sequencing (500-1000 bp) Next-generation

More information

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS GENOME ASSEMBLY FINAL PIPELINE AND RESULTS Faction 1 Yanxi Chen Carl Dyson Sean Lucking Chris Monaco Shashwat Deepali Nagar Jessica Rowell Ankit Srivastava Camila Medrano Trochez Venna Wang Seyed Alireza

More information

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing

More information

Genome Assembly Workshop Titles and Abstracts

Genome Assembly Workshop Titles and Abstracts Genome Assembly Workshop Titles and Abstracts TUESDAY, MARCH 15, 2011 08:15 AM Richard Durbin, Wellcome Trust Sanger Institute A generic sequence graph exchange format for assembly and population variation

More information

Genome Assembly Background and Strategy

Genome Assembly Background and Strategy Genome Assembly Background and Strategy February 6th, 2017 BIOL 7210 - Faction I (Outbreak) - Genome Assembly Group Yanxi Chen Carl Dyson Zhiqiang Lin Sean Lucking Chris Monaco Shashwat Deepali Nagar Jessica

More information

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality Mapping Main Topics Sept 11 Saving results on RCAC Scaffolding and gap closing Assembly quality Saving results on RCAC Core files When a program crashes, it will produce a "coredump". these are very large

More information

Bioinformatics for Genomics

Bioinformatics for Genomics Bioinformatics for Genomics It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. When I was young my Father

More information

Contact us for more information and a quotation

Contact us for more information and a quotation GenePool Information Sheet #1 Installed Sequencing Technologies in the GenePool The GenePool offers sequencing service on three platforms: Sanger (dideoxy) sequencing on ABI 3730 instruments Illumina SOLEXA

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Alla L Lapidus, Ph.D. SPbSU St. Petersburg Term Bioinformatics Term Bioinformatics was invented by Paulien Hogeweg (Полина Хогевег) and Ben Hesper in 1970 as "the study of

More information

De novo assembly in RNA-seq analysis.

De novo assembly in RNA-seq analysis. De novo assembly in RNA-seq analysis. Joachim Bargsten Wageningen UR/PRI/Plant Breeding October 2012 Motivation Transcriptome sequencing (RNA-seq) Gene expression / differential expression Reconstruct

More information

NOW GENERATION SEQUENCING. Monday, December 5, 11

NOW GENERATION SEQUENCING. Monday, December 5, 11 NOW GENERATION SEQUENCING 1 SEQUENCING TIMELINE 1953: Structure of DNA 1975: Sanger method for sequencing 1985: Human Genome Sequencing Project begins 1990s: Clinical sequencing begins 1998: NHGRI $1000

More information

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017 Lectures 20, 21: Single- cell Sequencing and Assembly Spring 2017 April 20,25, 2017 1 SINGLE-CELL SEQUENCING AND ASSEMBLY 2 Single-cell Sequencing Motivation: Vast majority of environmental bacteria are

More information

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park Next Generation Sequences & Chloroplast Assembly 8 June, 2012 Jongsun Park Table of Contents 1 History of Sequencing Technologies 2 Genome Assembly Processes With NGS Sequences 3 How to Assembly Chloroplast

More information

Genome Sequencing-- Strategies

Genome Sequencing-- Strategies Genome Sequencing-- Strategies Bio 4342 Spring 04 What is a genome? A genome can be defined as the entire DNA content of each nucleated cell in an organism Each organism has one or more chromosomes that

More information

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the Supplementary Information Supplementary Figures Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the strain M8 of S. ruber and a fosmid containing the S. ruber M8 virus M8CR4

More information

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

Introduction: Methods:

Introduction: Methods: Eason 1 Introduction: Next Generation Sequencing (NGS) is a term that applies to many new sequencing technologies. The drastic increase in speed and cost of these novel methods are changing the world of

More information

De novo Genome Assembly

De novo Genome Assembly De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, AU - 3 July 2017 Introduction The human genome has 47 pieces MT (or XY) The shortest piece

More information

The Diploid Genome Sequence of an Individual Human

The Diploid Genome Sequence of an Individual Human The Diploid Genome Sequence of an Individual Human Maido Remm Journal Club 12.02.2008 Outline Background (history, assembling strategies) Who was sequenced in previous projects Genome variations in J.

More information

The Basics of Understanding Whole Genome Next Generation Sequence Data

The Basics of Understanding Whole Genome Next Generation Sequence Data The Basics of Understanding Whole Genome Next Generation Sequence Data Heather Carleton-Romer, MPH, Ph.D. ASM-CDC Infectious Disease and Public Health Microbiology Postdoctoral Fellow PulseNet USA Next

More information

CloG: a pipeline for closing gaps in a draft assembly using short reads

CloG: a pipeline for closing gaps in a draft assembly using short reads CloG: a pipeline for closing gaps in a draft assembly using short reads Xing Yang, Daniel Medvin, Giri Narasimhan Bioinformatics Research Group (BioRG) School of Computing and Information Sciences Miami,

More information

Genome Assembly, part II. Tandy Warnow

Genome Assembly, part II. Tandy Warnow Genome Assembly, part II Tandy Warnow How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & Glenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable

More information

Genome Projects. Part III. Assembly and sequencing of human genomes

Genome Projects. Part III. Assembly and sequencing of human genomes Genome Projects Part III Assembly and sequencing of human genomes All current genome sequencing strategies are clone-based. 1. ordered clone sequencing e.g., C. elegans well suited for repetitive sequences

More information

Building and Improving Reference Genome Assemblies

Building and Improving Reference Genome Assemblies Building and Improving Reference Genome Assemblies This paper reviews the problems and algorithms of assembling a complete genome from millions of short DNA sequencing reads. By K a ry n M e lt z St e

More information

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A.

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A. Bioinformatics Advance Access published August 29, 2013 Genome Analysis The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and

More information

We begin with a high-level overview of sequencing. There are three stages in this process.

We begin with a high-level overview of sequencing. There are three stages in this process. Lecture 11 Sequence Assembly February 10, 1998 Lecturer: Phil Green Notes: Kavita Garg 11.1. Introduction This is the first of two lectures by Phil Green on Sequence Assembly. Yeast and some of the bacterial

More information

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics Genomic Technologies Michael Schatz Feb 1, 2018 Lecture 2: Applied Comparative Genomics Welcome! The primary goal of the course is for students to be grounded in theory and leave the course empowered to

More information

10/20/2009 Comp 590/Comp Fall

10/20/2009 Comp 590/Comp Fall Lecture 14: DNA Sequencing Study Chapter 8.9 10/20/2009 Comp 590/Comp 790-90 Fall 2009 1 DNA Sequencing Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments

More information

NGS developments in tomato genome sequencing

NGS developments in tomato genome sequencing NGS developments in tomato genome sequencing 16-02-2012, Sandra Smit TATGTTTTGGAAAACATTGCATGCGGAATTGGGTACTAGGTTGGACCTTAGTACC GCGTTCCATCCTCAGACCGATGGTCAGTCTGAGAGAACGATTCAAGTGTTGGAAG ATATGCTTCGTGCATGTGTGATAGAGTTTGGTGGCCATTGGGATAGCTTCTTACC

More information

Understanding Accuracy in SMRT Sequencing

Understanding Accuracy in SMRT Sequencing Understanding Accuracy in SMRT Sequencing Jonas Korlach, Chief Scientific Officer, Pacific Biosciences Introduction Single Molecule, Real-Time (SMRT ) DNA sequencing achieves highly accurate sequencing

More information

Gap Filling for a Human MHC Haplotype Sequence

Gap Filling for a Human MHC Haplotype Sequence American Journal of Life Sciences 2016; 4(6): 146-151 http://www.sciencepublishinggroup.com/j/ajls doi: 10.11648/j.ajls.20160406.12 ISSN: 2328-5702 (Print); ISSN: 2328-5737 (Online) Gap Filling for a Human

More information

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth Category IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth Yu Peng 1, Henry C.M. Leung 1, S.M. Yiu 1 and Francis Y.L. Chin 1,* 1 Department of Computer

More information

arxiv: v2 [q-bio.gn] 21 May 2012

arxiv: v2 [q-bio.gn] 21 May 2012 1 arxiv:1203.4802v2 [q-bio.gn] 21 May 2012 A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data C. Titus Brown 1,2,, Adina Howe 2, Qingpeng Zhang 1, Alexis B. Pyrkosz 3,

More information

Direct determination of diploid genome sequences. Supplemental material: contents

Direct determination of diploid genome sequences. Supplemental material: contents Direct determination of diploid genome sequences Neil I. Weisenfeld, Vijay Kumar, Preyas Shah, Deanna M. Church, David B. Jaffe Supplemental material: contents Supplemental Note 1. Comparison of performance

More information

Next Generation Sequencing Technologies

Next Generation Sequencing Technologies Next Generation Sequencing Technologies Julian Pierre, Jordan Taylor, Amit Upadhyay, Bhanu Rekepalli Abstract: The process of generating genome sequence data is constantly getting faster, cheaper, and

More information

State of the art de novo assembly of human genomes from massively parallel sequencing data

State of the art de novo assembly of human genomes from massively parallel sequencing data State of the art de novo assembly of human genomes from massively parallel sequencing data Yingrui Li, 1 Yujie Hu, 1,2 Lars Bolund 1,3 and Jun Wang 1,2* 1 BGI-Shenzhen, Shenzhen, Guangdong 518083, China

More information

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight 1 Workflow Import NGS raw data QC on reads De novo assembly Trim reads Finding Genes BLAST Sample to Insight Case Study Pseudomonas aeruginosa

More information

Microbiome: Metagenomics 4/4/2018

Microbiome: Metagenomics 4/4/2018 Microbiome: Metagenomics 4/4/2018 metagenomics is an extension of many things you have already learned! Genomics used to be computationally difficult, and now that s metagenomics! Still developing tools/algorithms

More information

Faction 2: Genome Assembly Lab and Preliminary Data

Faction 2: Genome Assembly Lab and Preliminary Data Faction 2: Genome Assembly Lab and Preliminary Data [Computational Genomics 2017] Christian Colon, Erisa Sula, David Lu, Tian Jin, Lijiang Long, Rohini Mopuri, Bowen Yang, Saminda Wijeratne, Harrison Kim

More information

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 Introduction to Short Read Alignment UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

From Infection to Genbank

From Infection to Genbank From Infection to Genbank How a pathogenic bacterium gets its genome to NCBI Torsten Seemann VLSCI - Life Sciences Computation Centre - Genomics Theme - Lab Meeting - Friday 27 April 2012 The steps 1.

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II W. Richard McCombie Disclosures Introduction to the challenge

More information

Looking Ahead: Improving Workflows for SMRT Sequencing

Looking Ahead: Improving Workflows for SMRT Sequencing Looking Ahead: Improving Workflows for SMRT Sequencing Jonas Korlach FIND MEANING IN COMPLEXITY Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences

More information

The Resurgence of Reference Quality Genome Sequence

The Resurgence of Reference Quality Genome Sequence The Resurgence of Reference Quality Genome Sequence Michael Schatz Jan 12, 2016 PAG IV @mike_schatz / #PAGIV Genomics Arsenal in the year 2015 Sample Preparation Sequencing Chromosome Mapping Summary &

More information

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons Francesco Vezzi 1,, Giuseppe Narzisi 2, Bud Mishra 2,3,4

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons Francesco Vezzi 1,, Giuseppe Narzisi 2, Bud Mishra 2,3,4 1 Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons Francesco Vezzi 1,, Giuseppe Narzisi 2, Bud Mishra 2,3,4 1 School of Computer Science and Communication, KTH Royal

More information

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core Assembly Ian Misner, Ph.D. Bioinformatics Crash Course Multiple flavors to choose from De novo No prior sequence knowledge required Takes what you have and tries to build the best contigs/scaffolds possible

More information

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway Joseph F. Ryan* Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway Current Address: Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine,

More information

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ de novo transcriptome assembly de novo from the Latin expression meaning from the beginning In bioinformatics, we often use

More information

Genomic resources. for non-model systems

Genomic resources. for non-model systems Genomic resources for non-model systems 1 Genomic resources Whole genome sequencing reference genome sequence comparisons across species identify signatures of natural selection population-level resequencing

More information

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next Gen Sequencing. Expansion of sequencing technology. Contents Next Gen Sequencing Contents 1 Expansion of sequencing technology 2 The Next Generation of Sequencing: High-Throughput Technologies 3 High Throughput Sequencing Applied to Genome Sequencing (TEDed CC BY-NC-ND

More information

Haploid Assembly of Diploid Genomes

Haploid Assembly of Diploid Genomes Haploid Assembly of Diploid Genomes Challenges, Trials, Tribulations 13 October 2011 İnanç Birol Assembly By Short Sequencing IEEE InfoVis 2009 2 3 in Literature ~40 citations on tool comparisons ~20 citations

More information

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA

More information

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015 ChIP-Seq Tools J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind DNA or

More information

Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding

Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding M Keehan and C Couldrey Research and Development, Livestock Improvment Corporation, Hamilton, New Zealand

More information

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN ... 2014 2015 2016 2017 ... 2014 2015 2016 2017 Synthetic

More information

Genomics and Transcriptomics of Spirodela polyrhiza

Genomics and Transcriptomics of Spirodela polyrhiza Genomics and Transcriptomics of Spirodela polyrhiza Doug Bryant Bioinformatics Core Facility & Todd Mockler Group, Donald Danforth Plant Science Center Desired Outcomes High-quality genomic reference sequence

More information

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 ChIP-Seq Data Analysis J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014 What s the Question? Where do Transcription Factors (TFs) bind genomic DNA 1? (Where do other things bind

More information

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach Title for Extending Contigs Using SVM and Look Ahead Approach Author(s) Zhu, X; Leung, HCM; Chin, FYL; Yiu, SM; Quan, G; Liu, B; Wang, Y Citation PLoS ONE, 2014, v. 9 n. 12, article no. e114253 Issued

More information

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly SCIENCE CHINA Life Sciences SPECIAL TOPIC February 2013 Vol.56 No.2: 156 162 RESEARCH PAPER doi: 10.1007/s11427-013-4444-x Comparative analysis of de novo transcriptome assembly CLARKE Kaitlin 1, YANG

More information

Lecture 14: DNA Sequencing

Lecture 14: DNA Sequencing Lecture 14: DNA Sequencing Study Chapter 8.9 10/17/2013 COMP 465 Fall 2013 1 Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments (Sanger method) DNA Sequencing

More information

Assembling metagenomes: a not so practical guide

Assembling metagenomes: a not so practical guide Assembling metagenomes: a not so practical guide C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University September 2013 ctb@msu.edu Acknowledgements Lab members involved Adina Howe

More information

De novo genome assembly. Dr Torsten Seemann

De novo genome assembly. Dr Torsten Seemann De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane Mon 1 July 2013 Introduction Ideal world I would not need to give this talk! Human DNA Non-existent USB3 device AGTCTAGGATTCGCTA

More information

Genome Assembly With Next Generation Sequencers

Genome Assembly With Next Generation Sequencers Genome Assembly With Next Generation Sequencers Personal Genomics Institute 3 May, 2011 Jongsun Park Table of Contents 1 Central Dogma and Omics Studies 2 History of Sequencing Technologies 3 Genome Assembly

More information

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies Eric T. Weimer, PhD, D(ABMLI) Assistant Professor, Pathology & Laboratory Medicine, UNC School of Medicine Director, Molecular Immunology Associate Director, Clinical Flow Cytometry, HLA, and Immunology

More information

AMOS Assembly Validation and Visualization

AMOS Assembly Validation and Visualization AMOS Assembly Validation and Visualization Michael Schatz Center for Bioinformatics and Computational Biology University of Maryland August 13, 2006 University of Hawaii Outline AMOS Validation Pipeline

More information

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis -Seq Analysis Quality Control checks Reproducibility Reliability -seq vs Microarray Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcript identification

More information