Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Similar documents
Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

De novo Genome Assembly

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

De novo genome assembly with next generation sequencing data!! "

Transcriptome analysis

De novo genome assembly. Dr Torsten Seemann

Purpose of sequence assembly

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

De Novo Assembly of High-throughput Short Read Sequences

De novo whole genome assembly

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Introduction to Bioinformatics

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

RNA-Seq de novo assembly training

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Analysis of RNA-seq Data

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

ChIP-seq and RNA-seq

De novo assembly in RNA-seq analysis.

The Basics of Understanding Whole Genome Next Generation Sequence Data

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Bioinformatics for Genomics

Connect-A-Contig Paper version

ChIP-seq and RNA-seq. Farhat Habib

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Lecture 14: DNA Sequencing

Each cell of a living organism contains chromosomes

RNA-Seq with the Tuxedo Suite

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

NOW GENERATION SEQUENCING. Monday, December 5, 11

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

UNIVERSITY OF EAST ANGLIA School of Computing Sciences Main Series UG Examination ALGORITHMS FOR BIOINFORMATICS CMP-6034B

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Introduction to Bioinformatics

CSE182-L16. LW statistics/assembly

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

From reads to results: differential. Alicia Oshlack Head of Bioinformatics

Mapping strategies for sequence reads

Genome Sequencing and Assembly

Genome Reassembly From Fragments. 28 March 2013 OSU CSE 1

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Genome Assembly and Annotation of Isochrysis Galbana

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Review of whole genome methods

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Challenging algorithms in bioinformatics

Poisson Distribution in Genome Assembly

De novo whole genome assembly

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Analysis of NGS data. resources. Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015

NEXT GENERATION SEQUENCING. Farhat Habib

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Workflow of de novo assembly

RNA-Sequencing analysis

de novo paired-end short reads assembly

10/20/2009 Comp 590/Comp Fall

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

BIOINFORMATICS ORIGINAL PAPER

De novo sequence assembly

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

De Novo Co-Assembly Of Bacterial Genomes From Multiple Single Cells

An introduction to RNA-seq. Nicole Cloonan - 4 th July 2018 #UQWinterSchool #Bioinformatics #GroupTherapy

Genome Sequence Assembly

Unfortunately, plate data was not available to generate an initial low coverage

Assemblathon Summary Report

Rapid Transcriptome Characterization for a nonmodel organism using 454 pyrosequencing

A Brief Introduction to Bioinformatics

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Contact us for more information and a quotation

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Why are we here? Introduction

De novo whole genome assembly

Genome Assembly With Next Generation Sequencers

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Introduction to Bioinformatics. Lecture 20: Sequencing genomes

Next Generation Sequencing Technologies

RNA

From Infection to Genbank

Mate-pair library data improves genome assembly

The perfect genome is a treasure, DNA is the key

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

De Novo and Hybrid Assembly

Microbiome: Metagenomics 4/4/2018

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

We begin with a high-level overview of sequencing. There are three stages in this process.

Introduction to Bioinformatics

Next Gen Sequencing. Expansion of sequencing technology. Contents

Transcription:

Assembly Ian Misner, Ph.D. Bioinformatics Crash Course

Multiple flavors to choose from De novo No prior sequence knowledge required Takes what you have and tries to build the best contigs/scaffolds possible The more data the better. Multiple library types. Multiple sequencing platforms. Reference Take your reads and put them back together, but using a guide. The reference doesn t have to be the exact same species or strain. You cannot align what isn t there. DNA sequences Genomes RNA sequences Transcriptomes

De Novo Assembly The process of reconstructing the orignal DNA sequence from the fragment reads alone. Think of it like a jigsaw puzzle: Find reads that fit together (overlap) Some reads fit in multiple locations (repeats) Your children lost some pieces (sequencing bias) Some pieces are dirty (adaptor contamination, errors)

Star Wars-omics Small genome Come to the Dark Side we have cookies

Star Wars-omics Reads: e......w o The Dar Come to he Dork Side..we have cookies de......he Dark Sid Overlap: Come to o The Dar he Dork Side. Dark Sid e......w de......he.we have cookies Consensus: Come to the Dark Side......we have cookies

Torsten Seemann

Assembly approaches Greedy assembly Overlap::layout::consensus (OLC) de Bruijn graphs String Graphs Seed and extend These all do the same thing but they simply use different shortcuts to deal with the data.

The Ghost of Genomes Past Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have millions of dollars to spend, tons of people to help, accurate long reads, and genetic maps? Keith Bradnam UC Davis

Keith Bradnam UC Davis

The Ghost of Genomes Present Gene-c Maps Physical Maps Knowledge of Genome structure Haploid genomes Accurate & Long reds Resources Time, People, $$$$$ Yes Yes Yes Yes Yes Yes What do you get when you have no dollars to spend, one person to help, less accurate short reads, and no genetic maps? Keith Bradnam UC Davis

Keith Bradnam UC Davis

What s the problem? We want the best possible assembly from the smallest sequencing cost and the least amount of bioinformatics effort. I have one simple request, and that is to have sharks with freaking laser beams attached to their heads! Dr. Evil

What the problem? REPEATS! REPEATS! REPEATS! REPEATS! REPEATS you get the point The repeat paradox: It is nearly impossible to resolve repeats of length n unless you have reads longer than n.

What s the problem?

Assembly approaches Greedy assembly Overlap::layout::consensus (OLC) de Bruijn graphs String Graphs Seed and extend These all do the same thing but they simply use different shortcuts to deal with the data.

Greedy Assembly Find sequences with overlaps: 1. Find the largest overlaps 2. Merge those overlaps Pros: Simple in practice. Cons: Early mistakes can create bad assemblies. Lars Arvestad

O-L-C Overlap: What reads overlap? Create a node for that read Create a directed edge Layout: How do we combine those reads? Simplify graph Find the shortest paths in the graph Consensus: Derive the contigs from the graphs.

Overlap

Layout

Consensus

Ben Langmead - JHU

Ben Langmead - JHU

Keith Bradnam UC Davis

Assembly Quality Assessment

Key Metrics Bird Genome Assembly

Keith Bradnam UC Davis

Keith Bradnam UC Davis

Keith Bradnam UC Davis

Keith Bradnam UC Davis

Where does this leave us? There is not one single assembler that works for every data set. No single assembler performs well across all measures. In Assemblathon 2 paper the choice of one command option by one tool for one metric caused scoring errors for overall assembler ranking.

What does this all mean. There is no consensus on how to make a good assembly. Use different assemblers, use different options within assemblers. Assembly will get better, it has to, but it will take time. Long reads will help! LOOK AT YOUR DATA!