Computational assembly for prokaryotic sequencing projects

Similar documents
Computational assembly for prokaryotic sequencing projects

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De Novo Assembly of High-throughput Short Read Sequences

Genome Assembly: Background and Strategy

10/20/2009 Comp 590/Comp Fall

Lecture 14: DNA Sequencing

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Genome Assembly. Background and Approach 28 Jan Jillian Walker Diana Williams

De novo meta-assembly of ultra-deep sequencing data

Faction 2: Genome Assembly Lab and Preliminary Data

The Basics of Understanding Whole Genome Next Generation Sequence Data

Introduction to NGS Analysis Tools

Genome Assembly Background and Strategy

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De novo genome assembly with next generation sequencing data!! "

Genomics AGRY Michael Gribskov Hock 331

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Master's Degree Project

The first generation DNA Sequencing

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Bioinformatics for Genomics

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

Short Read Alignment to a Reference Genome

De novo assembly in RNA-seq analysis.

CSE182-L16. LW statistics/assembly

Mate-pair library data improves genome assembly

A Roadmap to the De-novo Assembly of the Banana Slug Genome

De novo whole genome assembly

RNA-Seq de novo assembly training

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

NOW GENERATION SEQUENCING. Monday, December 5, 11

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny.

de novo paired-end short reads assembly

The Basics of Understanding Whole Genome Next Generation Sequence Data

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Transcriptome analysis

BIOINFORMATICS ORIGINAL PAPER

De novo whole genome assembly

Bioinformatics- Data Analysis

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Bioinformatics Tools and Pipelines for Real-Time Pathogen Surveillance

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

SUPPLEMENTARY INFORMATION

NGS part 2: applications. Tobias Österlund

Total RNA isola-on End Repair of double- stranded cdna

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Analytics Behind Genomic Testing

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

De novo whole genome assembly

Francisco García Quality Control for NGS Raw Data

Genome Projects. Part III. Assembly and sequencing of human genomes

de novo metagenome assembly

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Whole-genome sequencing (WGS) of microbes employing nextgeneration sequencing (NGS) technologies enables pathogen

Gap Filling for a Human MHC Haplotype Sequence

Introduction to RNA sequencing

Connect-A-Contig Paper version

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

ChIP-seq and RNA-seq

Workflow of de novo assembly

De Novo and Hybrid Assembly

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

White paper on de novo assembly in CLC Assembly Cell 4.0

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

Validating Bionumerics 7.6: A strategic approach from Oregon

RNA-Seq Module 2 From QC to differential gene expression.

Alignment and Assembly

NEXT GENERATION SEQUENCING. Farhat Habib

Improving Genome Assemblies without Sequencing

Genome Sequencing and Assembly

How to deal with your RNA-seq data?

Compute- and Data-Intensive Analyses in Bioinformatics"

BroadE Workshop: Genome Assembly. March 20 th, 2013

Sequencing techniques

De novo Genome Assembly

ChIP-seq and RNA-seq. Farhat Habib

Transcription:

Computational assembly for prokaryotic sequencing projects Lee Katz, Ph.D. Bioinformatician, Enteric Diseases Laboratory Branch January 21, 2015 Disclaimers The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC

2

Partners in Public Health 3

Graduated Oct 2010 4

Public Health Impact Award 2011: Helping to rid Africa of meningitis A 5

CDC 2010 - present 6

2011 to present ENTERIC DISEASES LABORATORY BRANCH 7

Thanks to John Besser for letting me borrow this slide August 29, 2014

Thanks to John Besser for letting me borrow this slide PulseNet Detected Outbreaks 2014 (so far)

Lee Katz, Present Currently in the National Enteric Reference Laboratory Vibrio, Campylobacter, Escherichia, Shigella, Yersinia, Salmonella Current major projects Listeria WGS for outbreak analysis Spread of Vibrio cholerae into Haiti and into surrounding areas High quality SNP analysis software (Lyve-SET) Gulf coast V. cholerae etc. 10

One of my projects is #2 on CDC s list of accomplishments for 2013! #2 http://www.cdc.gov/features/endofyear/ (2014 accomplishments have not been published yet) 11

Outline Illumina sequencing Reads Quality control (Q/C) Read metrics Read-cleaning Assembly Algorithms Assembly metrics 12

Prokaryotic Sequencing Projects Stages Sequencing Assembly Feature prediction Functional annotation analysis Display (Genome Browser) Examples Haemophilus influenzae Neisseria meningitidis Bordetella bronchisceptica Vibrio cholerae Listeria monocytogenes Sequencing Assembly prediction annotation Display Fleischman et al. (1995) Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd Science 269:5223 Kislyuk et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects Bioinformatics 26:15

Illumina sequencing (2 nd Gen) I think I ve heard you have sequences from the following machines (?) GAIIx HiSeq 14

15

Illumina sequencing video http://www.youtube.com/watch?v=womkfik WlxM 16

Q/C + cleaning READS 17

Q/C You need to know if your data are good! Example software that give you read metrics FastQC Computational Genomics Pipeline (CG-Pipeline) run_assembly_readmetrics.pl AMOS FastqQC (I promise it s a different program than FastQC) 18

Quality Control 19 FastQC output

Quality Control bioinformatics 20 FastQC output

The CG-Pipeline way run_assembly_readmetrics.pl File avgreadlength totalbases minreadlength maxreadlength avgquality tmp.fastq 80.00 177777760 80 80 35.39 21

READ CLEANING 22

Read cleaning with CG-Pipeline (not validated; please use with caution) Read F. Read R. Read %ACGT Phred http://sourceforge.net/projects/cg-pipeline/ https://github.com/lskatz/cg-pipeline Graphs made with FastqQC (AMOS) 23

1. Trimming low-qual ends run_assembly_trimlowqualends.pl Read F. Read R. Read 1A. %ACGT 1B. Phred http://sourceforge.net/projects/cg-pipeline/ https://github.com/lskatz/cg-pipeline Graphs made with FastqQC (AMOS) 24

2a. Removing duplicate reads 2b. Sometimes: downsampling run_assembly_removeduplicatereads.pl Trimmed reads http://sourceforge.net/projects/cg-pipeline/ https://github.com/lskatz/cg-pipeline 25

3. Trimming and filtering run_assembly_trimclean.pl Min length Min avg. quality 3A. trimming Min length Min avg. quality 3B. filtering http://sourceforge.net/projects/cg-pipeline/ https://github.com/lskatz/cg-pipeline 26

More read cleaning and correction Software CGP documentation: http://cg-pipeline.sourceforge.net/wiki Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit EA-utils https://code.google.com/p/ea-utils AMOS amos.sourceforge.net Sickle https://github.com/najoshi/sickle Quake http://www.cbcb.umd.edu/software/quake BayesHammer http://bioinf.spbau.ru/spades/bayeshammer and more is out there! Evaluation Fabbro et al 2013, An extensive evaluation of read trimming effects on Illumina NGS data analysis 27

ASSEMBLY CONCEPTS 28

Assembly Overlaps between reads or between nodes in a graph Generate contigs (contiguous sequences) Generate scaffolds NNN N 29

Further reading: Zhang et al 2011 Plos One; Miller et al 2011 Genomics 30

Greedy extension Given a set of sequence fragments the object is to find the shortest common supersequence. Сalculate pairwise alignments of all fragments. Choose two fragments with the largest overlap. Merge chosen fragments. Repeat step 2 and 3 until only one fragment is left. The result is a suboptimal solution to the problem. [citation needed] https://en.wikipedia.org/wiki/sequence_assembly#greedy_algorithm 31

Overlap-layout-consensus Great at longer, fewer, imperfect reads, e.g., Ion Torrent, 454, PacBio Stephen Schlebusch, Nicola Illing. Next generation shotgun sequencing and the challenges of de novo genome assembly. 32 South African Journal of Science. Vol. 108

Derive consensus sequence TAGATTACACAGATTACTGA-TTGATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG-TTACACAGATTATTGACTTCATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGGGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA Derive each consensus base by weighted voting Slide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14- compgenomics-kislyuk.pdf 33

De bruijn (kmer) Great at shorter, numerous, less error-prone reads, e.g., Illumina http://www.homolog.us/blogs/blog/2012/06/17/an-intuitive-explanation-for-running-de-bruijn-assembler-with-varyingk-mer-sizes/ 34

Another way to look at de bruijn assembly A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 35 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Another way to look at de bruijn assembly A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 36 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Recap of assembly reads Paired end reads contigs Scaffold NNNNNNNNNN 37

Further reading Ekblom and Wolf 2013, A field guide to wholegenome sequencing, assembly, and annotation. Evolutionary applications I really like the glossary and introductions in this paper 38

Algorithms + metrics ASSEMBLY 39

Some assemblers to try Open source SPAdes Velvet Mira Edena CG-Pipeline (uses SPAdes + Velvet) Proprietary Geneious CLC Genomics Workbench Lasergene Bionumerics These are some ideas but it is up to you, the class, to do a more thorough review! 40

Comparisons of Illumina assemblers Zhang et al 2011 A practical comparison of de novo genome assembly software tools for nextgeneration sequencing technolgies Genome Assembly Gold-standard Evaluations (GAGE) - http://gage.cbcb.umd.edu/ Lin et al 2011, Comparative studies of de novo assembly tools for next-generation sequencing technologies Don t forget to look at many newer assemblers including SPAdes (currently my favorite) 41

CG-Pipeline way for Illumina run_assembly reads.fastq.gz o assembly.fasta -e 2100000 --numcpus 8 # for a 2.1 MB genome 42

Assembler comparison (further reading) GAGE-B: http://ccb.jhu.edu/gage_b/ Assemblathon 2: http://www.gigasciencejournal.com/content/ 2/1/10 GABenchToB: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107014#pone-0107014- g003 43

ASSEMBLY DIFFICULTIES 44

One problem: randomly low coverage Assuming random distribution of reads and ignoring repeat resolution issues, G= genome length L = length of a single read N= number of reads sequenced T= minimum overlap to align the reads together Then overall coverage is C = LN/G (Lander-Waterman) Coverage for any given base obeys the Poisson distribution: The number of gaps(bases with 0 coverage) is: A good Lander-Waterman reference: http://www.math.ucsd.edu/~gptesler/186/shotgun_08-handout.pdf Lander and Waterman (1988). Genomic Mapping by Fingerprinting random clones: a mathematical analysis. Genomics 2 pp. 231-239 45 http://www.cmb.usc.edu/papers/msw_papers/msw-081.pdf

Major Problem: repeat elements A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 46 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Another way to look at de bruijn assembly A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 47 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

A quick note on reference assembly It is possible to map reads to a reference genome using short-read mappers, and then deriving a consensus sequence What makes a good reference genome? Must be a closely related genome (same species or same serogroup or same sequence type ) Relatively few SNPs (>90 or 95% identity to your genome) Relatively few rearrangements 48

Software to try Mapping software Smalt BWA Bowtie/Bowtie2 SNP-calling software Samtools mpileup or bcftools FreeBayes GATK Varscan SolSNP Must know how to use samtools for this route 49

Reference assembly notes To my knowledge no paper exists that compares reference assemblers (could be wrong) Assembly could be biased Miss genomic islands Ends of contigs might be tapered Good practice for reference assembly: perform de novo assembly on unused reads just in case you missed something 50

ASSEMBLY EVALUATION 51

Assembly Metrics How do you tell if your assembly is good? Metric description reference Assembly Length Number of Contigs N50 Longest Contig Average contig length Kmer21 GC-content The size of the concatenated assembly The count of contigs The size of the contig at where half the genome is located in size >N50 and half is located in size <N50 Frequency of kmers with k=21 Percentage of the genome that is either G or C http://www.homolog.us/blogs/blog/2012/06/26/ what-is-wrong-with-n50-how-can-we-make-itbetter-part-ii/ Assembly score Log(N50/numberOfContigs) CG-Pipeline/Lee Katz 52

QUAST: http://bioinf.spbau.ru/quast 53

QUAST results 54

The CG-Pipeline way $ run_assembly_metrics.pl assembly.fasta column t File genomelength N50 numcontigs assemblyscore assembly.fasta 2992976 2992976 1 14.9117787680924 55

AMOS amosvalidate http://amos.sourceforge.net (build this from source control to avoid bugs) 56

Reminder: your goal is to make the very best assembly, and here are ways to go above and beyond. POST ASSEMBLY METHODS 57

Merging assemblies You can take the output of two assemblers and merge them! However, this could lead to misassemblies and so you might not want to risk combining two assemblies. Some software: NGS-GAM Minimus Reconciliator 58

Verifying the assembly View the pileups and see if you agree with base calls and synteny Hawkeye (AMOS) IGV viewer - very smooth GUI Artemis viewer - many bells and whistles Tablet viewer - no bells and whistles (lightweight) Samtools tview (command line interface) Look for bad base calls; look for indications of bad structure/synteny 59

Sort contigs Compare to other genomes; sort contigs Must have a closely related reference genome (See: reference assembly slides) MAUVE Contig Mover Mummer (mummerplot) Abacas CONTIGuator A sorted genome assembly will be easier to analyze and there will probably be fewer mistakes in an analysis Result: fasta file with sorted and unsorted contigs 60

Making scaffolds Babmus2 (AMOS) IMAGE SOPRA SSPACE Use paired ends to give an idea if two contigs belong together NNN N Result: sorted scaffolds with Ns (unknown bases) 61

Close gaps in scaffolds IMAGE GapFiller FinIS SOAPdenovo2 GapCloser Basic idea: map reads back onto your assembly and make base calls where gaps are Result: scaffold with fewer or no Ns 62

Acknowledgements Every single compgenomics class since 2008 Many slides were borrowed from my 2014 talk For letting me off work, my supervisor Cheryl Tarr Many, many others who I work with on a daily basis 63

Geny Credit to Gladys Gonzalez-Aviles in PulseNet, CDC Based on a figure in Katz et al 2013 MBio 64

Questions? Lee Katz lkatz@cdc.gov The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this presentation are those of the author and do not necessarily represent the official position of CDC 65