A Roadmap to the De-novo Assembly of the Banana Slug Genome

Similar documents
De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo whole genome assembly

Assembly of Ariolimax dolichophallus using SOAPdenovo2

De novo whole genome assembly

De Novo Assembly of High-throughput Short Read Sequences

Faction 2: Genome Assembly Lab and Preliminary Data

De novo genome assembly with next generation sequencing data!! "

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

De novo whole genome assembly

Workflow of de novo assembly

De Novo and Hybrid Assembly

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

NGS technologies: a user s guide. Karim Gharbi & Mark Blaxter

Third Generation Sequencing

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Looking Ahead: Improving Workflows for SMRT Sequencing

De novo assembly in RNA-seq analysis.

Transcriptome analysis

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

NOW GENERATION SEQUENCING. Monday, December 5, 11

DNA Sequencing and Assembly

de novo paired-end short reads assembly

De novo assembly of complex genomes using single molecule sequencing

Next-Generation Sequencing. Technologies

RNA-Seq de novo assembly training

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

DATA FORMATS AND QUALITY CONTROL

Alignment and Assembly

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Why are we here? Introduction

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway

The Diploid Genome Sequence of an Individual Human

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Deep Sequencing technologies

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Mate-pair library data improves genome assembly

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

Genome Sequencing and Assembly

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Using New ThiNGS on Small Things. Shane Byrne

10/20/2009 Comp 590/Comp Fall

Overview of Next Generation Sequencing technologies. Céline Keime

Lecture 14: DNA Sequencing

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Genome Projects. Part III. Assembly and sequencing of human genomes

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

NGS technologies approaches, applications and challenges!

Next Gen Sequencing. Expansion of sequencing technology. Contents

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

BIOINFORMATICS ORIGINAL PAPER

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Genomics AGRY Michael Gribskov Hock 331

High throughput sequencing technologies

Wheat CAP Gene Expression with RNA-Seq

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

De novo meta-assembly of ultra-deep sequencing data

Gap Filling for a Human MHC Haplotype Sequence

Long Read Sequencing Technology

Experimental Design Microbial Sequencing

Genome Assembly Background and Strategy

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

RNA-Seq analysis workshop

2nd (Next) Generation Sequencing 2/2/2018

Parts of a standard FastQC report

NEXT GENERATION SEQUENCING. Farhat Habib

How much sequencing do I need? Emily Crisovan Genomics Core

Next Generation Sequencing. Tobias Österlund

Combined final report: genome and transcriptome assemblies

BroadE Workshop: Genome Assembly. March 20 th, 2013

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

Matthew Tinning Australian Genome Research Facility. July 2012

CBC Data Therapy. Metatranscriptomics Discussion

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

The goal of this project was to prepare the DEUG contig which covers the

De novo Genome Assembly

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

Contact us for more information and a quotation

Bioinformatics for Genomics

Genomics and Transcriptomics of Spirodela polyrhiza

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Transcription:

A Roadmap to the De-novo Assembly of the Banana Slug Genome Stefan Prost 1 1 Department of Integrative Biology, University of California, Berkeley, United States of America April 6th-10th, 2015

Outline 1 A priori Information about the Genome 2 Sequencing Strategies and Platforms 3 Sequencing Libraries 4 Raw Data Processing and Quality Assessment 5 Assembly Strategies and Tools 6 Assembly Quality Assessment 7 Further Improvement of the Assembly 8 What is a finished Assembly? 9 Downstream Processing

Basics De-novo Assembly Changed after Green 2001 Reference-based Mapping Wikipedia

Basics Genome Assembly Steps Changed after Green 2001

Basics Genome Assembly Steps Definition: Kmer Short, unique element of DNA sequence of length n

(1) A priori Information about the Genome Expected Genome Size Expected Repeat Content Expected Heterozygosity Diploid or Polyploid?

(1) A priori Information about the Genome Genome Size A Genome Size (C-Value) 0 25 50 100 Insects Arachnoids Amphibians Reptiles Birds Fishes Mammals B C-values obtained from: www.genomesize.com/ 50 60 )

(1) A priori Information about the Genome A Genome Size Genome Size (C-Value) 0 25 50 100 Insects Arachnoids Amphibians Reptiles Birds Genome Size Range in Fishes Fishes Mammals B at Content (%) 0 40 50 60

Fishes (1) A priori Information about the Genome Mammals Genome Size and Repeat Content B Repeat Content (%) 10 20 30 40 50 60 0 1 2 3 4 5 6 Genome Size (Gb)

(1) A priori Information about the Genome Genome Synteny Poelstra et al. 2014

(1) A priori Information about the Genome Genome Size Helpful Online Databases: Genomes www.ncbi.nlm.nih.gov/genome/browse www.gigadb.org Genome Sizes (C-values) www.genomesize.com

(1) A priori Information about the Genome Expected Genome Size Expected Repeat Content Expected Heterozygosity Diploid or Polyploid?

(2) Sequencing Strategies and Platforms First Generation Sequencing Sanger Sequencing Second Generation Sequencing (PCR Needed) Illumina: MiSeq & HiSeq Roche: 454 Life Science: IONtorrent & IONproton ABI: SOLiD Third Generation Sequencing (Single Molecule Sequencing) Helicos Biosciences: Heliscope Pacific Biosciences: PacBio RS II Oxford Nanopore: MinION & GridION

(2) Sequencing Strategies and Platforms Illumina Illumina Sequencing Metzker 2010 Metzker 2010

(2) Sequencing Strategies and Platforms PacBio Pacific Biosciences Eid et al. 2009

(2) Sequencing Strategies and Platforms Nanopore Oxford Nanopore Schadt et al. 2010; Clarke et al. 2009

(2) Sequencing Strategies and Platforms Nanopore First Generation Sequencing Sanger Sequencing Second Generation Sequencing (PCR Needed) Illumina: MiSeq & HiSeq Roche: 454 Life Science: IONtorrent & IONproton ABI: SOLiD Third Generation Sequencing (Single Molecule Sequencing) Helicos Biosciences: Heliscope Pacific Biosciences: PacBio RS II Oxford Nanopore: MinION & GridION

(3) Sequencing Libraries Illumina PE Libraries Illumina Paired-end Sequencing Libraries Illumina Homepage

(3) Sequencing Libraries Illumina MP Libraries Illumina Mate Pair Sequencing Libraries

(3) Sequencing Libraries Library Recipe Allpaths-LG Recipe Custom Recipe for 1.3Gb Bird Genome (<15% Repeats) 180bp PE - 1 lane HiSeq (30x coverage) 650bp PE - 1 lane HiSeq (30x coverage) 5kb/8kb MP - 1 lane HiSeq (20x coverage) Custom Recipe for 2.5Gb Mammal Genome (<50% Repeats) 180bp PE - 4 lanes HiSeq (50x coverage) 3kb MP - 4 lanes HiSeq (40x coverage) 8kb/15kb MP - 1 lane (10x coverage)

(3) Sequencing Libraries BAC and Fosmid Libraries Fosmid Vector Bacterial F-plasmid <40kb Insert Size Bacterial Artificial Chromosome (BAC) <300kb Insert size

(4) Raw Data Processing and Quality Assessment Read Quality Assessment Base Quality Phred Score: Q Phred = -10 log 10 P(error) e.g. a Phred score of 20 translates to a 1% error rate GC Content Sequence Duplication Levels Kmer Content... Software Tools FastQC Preqc

(4) Raw Data Processing and Quality Assessment Fastq Format Quality Score Encoding

(4) Raw Data Processing and Quality Assessment Read Quality Assessment Base Quality Phred Score: Q Phred = -10 log 10 P(error) e.g. a Phred score of 20 translates to a 1% error rate GC Content Sequence Duplication Levels Kmer Content... Software Tools FastQC Preqc

(4) Raw Data Processing and Quality Assessment Genome Size Estimation from Read Data G = pn(l k+1) λ k G = Genome Size pn = proportion of correct reads l = read length k = kmer length λ k = mode of the k-mer count histogram Simpson 2013, arxiv

(4) Raw Data Processing and Quality Assessment Error Correction Error Correction (EC) using the k-mer spectrum BLESS SGA CUDA DecGPU Euler Musket Quake Reptile EC using kmer counts RACER SHREC HiTEC HSHREC PSAEC EC using Multiple Sequence Alignment Coral Echo MyHybrid

(4) Raw Data Processing and Quality Assessment Error Correction

(4) Raw Data Processing and Quality Assessment Error Correction Error Correction (EC) using the k-mer spectrum BLESS SGA CUDA DecGPU Euler Musket Quake Reptile EC using kmer counts RACER SHREC HiTEC HSHREC PSAEC EC using Multiple Sequence Alignment Coral Echo MyHybrid

(4) Raw Data Processing and Quality Assessment Error Correction Adapter and Low Quality Base Trimming Skewer (Jiang et al. 2014) AdapterRemoval (Lindgreen 2012) Trimmomatic (Bolger et al. 2014) Contamination Filtering Blast Allpaths-LG Removing Low Frequency k-mers Discarding Scaffolds Shorter than 1kb

(5) Assembly Strategies and Tools de Bruijn graph Bridges of Koenigsberg problem (Graph Theory by Euler) Compeau et al. 2011

(5) Assembly Strategies and Tools de Bruijn graph Compeau et al. 2011

(5) Assembly Strategies and Tools de Bruijn graph Berger et al. 2013

(5) Assembly Strategies and Tools de Bruijn graph Overlap-layout-consensus vs. de Bruijn graph Li et al. 2011, Briefings in Functional Genomcis Overlap-layout-consensus based Tools: SGA (Simpson and Durbin 2012), Celera assembler (Myers et al. 2002) de Bruijn graph based Tools: Allpaths-LG (Gnerre et al. 2011), Abyss (Simpson et al. 2009), SOAPdenovo (Luo et al. 2012)

(5) Assembly Strategies and Tools Assembly Tools Short Read Assembler Allpaths-LG (Gnerre et al. 2011) Abyss (Simpson et al. 2009) SOAPdenovo (Luo et al. 2012) Platanus (Kajitani et al. 2014) SSAKE (Warren et al. 2007) SGA (Simpson and Durbin 2012) Celera assembler (Myers et al. 2002) MaSuRCA (Zimin et al. 2013) Long Read Assembler Allpaths-LG (Gnerre et al. 2011) DBG2OLC (Ye et al. 2014) PBcR - Celera Assembler (Berlin et al. 2014) Short or Long Read Scaffolder SSPACE (Boetzer et al. 2011) PBjelly (English et al. 2012)

(5) Assembly Strategies and Tools ABySS ABySS command Output Files little eagle-unitigs.fa little eagle-contigs.fa little eagle-scaffolds.fa Get Assembly Stats: abyss-fac little eagle-scaffolds.fa

(6) Assembly Quality Assessment By annotation: CEGMA (Parra et al. 2007) Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv) De novo likelihood-based measures (LAP; Ghodsi et al. 2013) Feature Response Curve (FRC; Vezzi et al. 2012) Assembly Statistics Definition: N50 contig length also called L50 A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list. (Yandel and Ence 2012)

(6) Assembly Quality Assessment Feature Response Curve Features: Mate-pair Orientations and Separations Repeat Content by k-mer Analysis Depth-of-coverage Correlated Polymorphism in the Read Alignments Read Alignment Breakpoints

(6) Assembly Quality Assessment Feature Response Curve By annotation: CEGMA (Parra et al. 2007) Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv) De novo likelihood-based measures (LAP; Ghodsi et al. 2013) Feature Response Curve (FRC; Vezzi et al. 2012) Assembly Statistics Definition: N50 contig length also called L50 A contig N50 is calculated by first ordering every contig by length from longest to shortest. Next, starting from the longest contig, the lengths of each contig are summed, until this running sum equals one-half of the total length of all contigs in the assembly. The contig N50 of the assembly is the length of the shortest contig in this list. (Yandel and Ence 2012)

(7) Further Improvement of the Assembly Overview Gap Filling GapCloser (Luo et al. 2012) GapFiller (Nadalin et al. 2011) Resolving Mis-Assemblies REAPR (Hunt et al. 2013) NxRepair (Murphy et al. 2014) Genome Merging Metassembler (Wences and Schatz 2015, Arxiv) GAM-NGS (Vicedomini et al. 2013)

(7) Further Improvement of the Assembly REAPR

(7) Further Improvement of the Assembly Overview Gap Filling GapCloser (Luo et al. 2012) GapFiller (Nadalin et al. 2011) Resolving Mis-Assemblies REAPR (Hunt et al. 2013) NxRepair (Murphy et al. 2014) Genome Merging Metassembler (Wences and Schatz 2015, Arxiv) GAM-NGS (Vicedomini et al. 2013)

(7) Further Improvement of the Assembly Assembly Merging (Wences and Schatz 2015, Arxiv)

(7) Further Improvement of the Assembly Overview Gap Filling GapCloser (Luo et al. 2012) GapFiller (Nadalin et al. 2011) Resolving Mis-Assemblies REAPR (Hunt et al. 2013) NxRepair (Murphy et al. 2014) Genome Merging Metassembler (Wences and Schatz 2015, Arxiv) GAM-NGS (Vicedomini et al. 2013)

(7) Further Improvement of the Assembly Basic Workflow Allpaths-LG ABySS SOAPdenovo2 assembly using Allpaths-LG, etc. GapCloser Gap filling to resolve diffcult regions REAPR Quality assessment and breakup of misassemblies GapCloser Gap filling to resolve misassembled areas Final Assembly

(8) What is a finished Assembly? What is a good assembly and when is it finished? No Finished Assemblies >100 Euchromatic Gaps in Human Genome Drosophila melongaster release 6.1 (More Centric Heterochromatic Sequences Added) Contig L50 Close to or Bigger than Average Gene Size Number of Scaffolds Should be Close to Number of Chromosomes

(9) Downstream Processing Downstream Processing Repeat Annotation Gene Annotation Mapping to get Diploid Genome

Appendix - Sequencing Platforms and Libraries Roche 454 Roche 454

Appendix - Sequencing Platforms and Libraries ABI SOLiD ABI SOLiD

Appendix - Sequencing Platforms and Libraries Helicos Helicos Kircher and Kelso 2010

Appendix - Sequencing Platforms and Libraries Illumina SLR Libraries TruSeq Synthetic Long-Read Libraries (Moleculo)

Appendix - Sequencing Platforms and Libraries Illumina SLR Libraries Moleculo PacBio

Appendix - Sequencing Platforms and Libraries Dovetail Genomics Chicago Library Putnam et al. 2015 (ArXiv)