De novo whole genome assembly

Similar documents
De novo whole genome assembly

De novo whole genome assembly

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De novo genome assembly with next generation sequencing data!! "

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Workflow of de novo assembly

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Looking Ahead: Improving Workflows for SMRT Sequencing

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De novo assembly in RNA-seq analysis.

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

A Roadmap to the De-novo Assembly of the Banana Slug Genome

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Mate-pair library data improves genome assembly

Why are we here? Introduction

De Novo and Hybrid Assembly

De Novo Assembly of High-throughput Short Read Sequences

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Genomic resources. for non-model systems

The Diploid Genome Sequence of an Individual Human

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

De novo Genome Assembly

State of the art de novo assembly of human genomes from massively parallel sequencing data

Genome Assembly: Background and Strategy

De novo meta-assembly of ultra-deep sequencing data

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Add 2016 GBS Poster As Slide One

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

de novo paired-end short reads assembly

De novo assembly of complex genomes using single molecule sequencing

NOW GENERATION SEQUENCING. Monday, December 5, 11

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

DNA Sequencing and Assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Genome Projects. Part III. Assembly and sequencing of human genomes

De novo genome assembly. Dr Torsten Seemann

Genome Sequencing and Assembly

Biol 478/595 Intro to Bioinformatics

Next Genera*on Sequencing II: Personal Genomics. Jim Noonan Department of Gene*cs

CSE182-L16. LW statistics/assembly

Next-Generation Sequencing. Technologies

BIOINFORMATICS ORIGINAL PAPER

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Haploid Assembly of Diploid Genomes

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Genome Assembly, part II. Tandy Warnow

Genome Assembly of the Obligate Crassulacean Acid Metabolism (CAM) Species Kalanchoë laxiflora

Comparison and Evaluation of Cotton SNPs Developed by Transcriptome, Genome Reduction on Restriction Site Conservation and RAD-based Sequencing

10/20/2009 Comp 590/Comp Fall

Mapping strategies for sequence reads

Genome Assembly Background and Strategy

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Genome Sequencing-- Strategies

Lecture 14: DNA Sequencing

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

NEXT GENERATION SEQUENCING. Farhat Habib

Purpose of sequence assembly

de novo metagenome assembly

Figure S1. Data flow of de novo genome assembly using next generation sequencing data from multiple platforms.

CSCI2950-C DNA Sequencing and Fragment Assembly

Understanding Accuracy in SMRT Sequencing

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

The Irys System. Rapid Genome Wide Mapping for de novo Assembly and Structural Variation Analysis. Jack Peart, Ph.D. Director of Sales EMEA

Compute- and Data-Intensive Analyses in Bioinformatics"

RNA-Sequencing analysis

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Bioinformatics for Genomics

SUPPLEMENTARY INFORMATION

The Human Genome and its upcoming Dynamics

Alignment and Assembly

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

The Basics of Understanding Whole Genome Next Generation Sequence Data

AMOS Assembly Validation and Visualization

The Bioluminescence Heterozygous Genome Assembler

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A.

Lecture 10 : Whole genome sequencing and analysis. Introduction to Computational Biology Teresa Przytycka, PhD

Transcriptome analysis

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Analysis of RNA-seq Data

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Analysis of genome-wide genotype data

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Transcription:

De novo whole genome assembly Lecture 1 Qi Sun Minghui Wang Bioinformatics Facility Cornell University

DNA Sequencing Platforms Illumina sequencing (100 to 300 bp reads) Overlapping reads ~180bp fragment Paired-end reads Mate pair reads <500bp fragment >5kb fragment Soap denovo AllPaths LG >250 bppcr free reads Discovar PacBiolong-read (10 15 kb reads) 10-15 kb SMRT analysis Illumina Hybrid tools

De novo whole genome assembly Paired-end sequencing reads Contiging Contigs Mate-pair long insert reads Scaffolding Scaffolds NNNN

Two categories of contiging strategies Long reads overlap layout consensus Short reads de-bruijn-graph Celera assembler, 454 assembler Velvet, Soap-denovo, Abyss, Trinity et al source: http://www.homolog.us/tutorials/index.php

de-bruijn-graph for contiging short reads Kmers source: http://www.homolog.us/tutorials/index.php

Kmer counting from sequencing reads AATTTGTGAATGGCT 1 ATGCAAGACCAATTG 128 ACAAATAGGGGTAGT 1 AACACAAAATAATAA 1 GCTGTAAGCTTTAAC 1 AACCTATGAGTGAAA 1 ATTCGTAGATCTTCC 101 CAGGTATCAGGCATG 146 TATCAGCTGGAGTCA 60 ACCAGAGATATAAAA 1 GGTCCATTTTGTTTA 1 GTCAAAAAATTACGA 1 ACATTCCCTCGGTAA 1 AGCATTTCTTAACAC 1 TAAAAACCATCTTTA 137 ATTCGATTTGTCAAG 62 ACATTGGAAAAAACT 1 AATGGAAATACAATA 2 CATTCGTTAGATCAA 1 CATTCGTGAGATCAA 128 CCTTGTCATTAACTC 667 (15mers) Kmer depth distribution plots 31mer 15mer

Impact of kmer-length (1) Short kmerswould collapse paralogousregions, and result in more crossovers Paralogous regions GGATGGAAGTCG CGATGGAAGGAT 7mer GGATTGGA CCGATTGGA GATGGAA ATGGAAG TGGAAGT TGGAAGG 9mer GGATGGAAGT CGATGGAAGG GATGGAAGT GATGGAAGG

Impact of kmer-length (2) Read depth vs Kmerdepth 80mer 30mer Kmerdepth =1 Kmerdepth =3 Read depth: 3 Longer kmers could results in too little kmer depth

Sequencing errors and Tips, bubbles and crosslinks Tips Bubbles source: http://www.homolog.us/tutorials/index.php Crosslinks

Deal with sequencing errors and repetitive regions 1. Sequencing errors Remove low depth kmersin a bubble; Too long kmers would cause coverage problem; 2. Repetitive region Longer kmers Tiny repeat: Separate the path Break boundary between low and high copy regions

Examine an assembly software 1. Soap denovo

Heterozygous genomes Experimentally: Create inbred lines or haploid cell culture. Assembly of clonal fragments, merging allelic regions. Assembly SNP: merge bubbles Highly polymorphic regions Kmer coverage distribution Highly polymorphic regions

Schematic overview of the Platanus algorithm. Examine an assembly software 2. Platanus Rei Kajitani et al. Genome Res. 2014;24:1384-1395 2014 Kajitani et al.; Published by Cold Spring Harbor Laboratory Press

Scaffolding strategies 1. Mate-pair reads Contigs Illumina mate pair data: 5kb, 8kb or 15kb inserts Software: Soap denovo, et al

Scaffolding strategies 2. Long reads PacBio long reads: Long but with high error rate Current size: 10-15 kb S. Koren, et al. Nature Biotechnology, 20:693 (2012)

Scaffolding strategies 3. Illumina Moleculo Generate synthetic long reads forde novoassembly and genome finishing applications Perform whole human genome phasing to identify co-inherited alleles, haplotype information, and phase de novo mutations

Scaffolding strategies 4. Physical maps BioNano: Generating high-quality genome maps by labeling specific 7-mer nickase recognition sites in a genome with a single-color fluorophore Hi-C: Sequence cross-linked and ligated DNA fragments

Generate kmer file /programs/jellyfish-2.1.4/bin/jellyfish count -s 1G -m 21 -t 8 -o kmer -C SRR1554178.fastq /programs/jellyfish-2.1.4/bin/jellyfish dump -c -t kmer > kmer21.txt CATTATTGGAGTTGATGAAAA 134 CCGCCACCACCACCACCGCCA 52 TAATTAATTTCAAATTTATAA 1 CCCTGTAGAAGCCTGTGTACC 1 AAAAAAAAATTTGTAGGGGGG 79 GGCATCTAGGTCATGTATTTA 3 AAATGAAAGTACATCAGCTTA 1 AAAATTTGTTCTACCACCGCC 2 CGAGGAGAATTGGGAAAAAAA 103 CTATTTGTGGAATCATAGATC 1 CATCACAAGAACCTATAGAAG 1 CGCCAACCATCCATTATCAAA 1 AAAGTATCGCCAACTTTTAAT 1 CAAAATGAATTTCTTATTTTC 76 GGAAATATGGAGTAGTTACCA 1 CTGATTTCTCTCTCTATCTCC 1 ATTTATCTCGCAAATTTTAGA 1 CAAAGAGAGAATCACAACAAA 86 awk '{print $2}' kmer21.txt sort -n uniq -c \ awk '{print $2 "\t" $1}'> histogram 1 14747539 2 798919 3 131160 4 39150 5 19202 6 11755 7 8734 8 6752 9 5795 10 4679 11 3848 12 3232 13 2978 14 2819 15 2655 16 2373

Estimate genome size based on kmer distribution 100 60272 101 64317 102 68183 103 72164 104 75022 105 78876 106 81907 107 85077 108 88059 109 89946 110 91659 111 91950 112 92535 113 91935 114 91286 115 90194 116 87616 117 85215 118 82041 119 78751 120 75306 Peak kmer depth: 112

Estimate genome size based on kmer distribution Step 1: convert read depth to kmerdepth N = M* L/(L-K+1) 75 mer Read depth = 3 Kmerdepth = 2 M: kmerdepth = 112 L: read length = 101 bp K: Kmersize =21 bp N: read depth =140 Step 2: genome size is total sequenced basepairs devided by read depth Genome size = T/N T: total base pairs = 0.505 gb N: read depth = 140 Genome size: 3.6 mb

Estimate genome size based on kmer distribution R code to fit Poisson distribution Total kmerbases: 387 mb Kmer depth: 113 Genome size: 3.4 mb data <- read.table(file="histogram",header=f) poisson_expeact_k=function(file,start=3,end=200,step=0.1){ diff<-1e20 min<-1e20 pos<-0 total<-sum(as.numeric(file[start:dim(file)[1],1]*file[start:dim(file)[1],2])) singlecopy_total <- sum(as.numeric(file[10:500,1]*file[10:500,2])) for (i in seq(start,end,step)) { singlec <- singlecopy_total/i a<-sum((dpois(start:end, i)*singlec-file[start:end,2])^2) if (a < diff){ pos<-i diff<-a } } print(paste("total kmer bases: ", total)) print(paste("kmer depth: ", pos)) print(paste("genome size: ", total/pos)) pdf("kmerplot.pdf") plot(1:200,dpois(1:200, pos)*singlec, type = "l", col=3, lwd=2, lty=2, ylim=c(0,150000), xlab="k-mer coverage",ylab="k-mer counts frequency") lines(file[1:200,],type="l",lwd=2) dev.off() } poisson_expeact_k(data)

Using short kmer could underestimate genome size (<20) Low kmer depth could overestimate genome size (<20)

Exercise: Run kmer analysis tools to estimate genome size You will be provided with a Fastqfile. Using Jellyfish to get the kmercount, then estimate the kmer depth /programs/jellyfish-2.1.4/bin/jellyfish count -s 1G -m 21 -t 4 -o kmer -C SRR1554178.fastq /programs/jellyfish-2.1.4/bin/jellyfish dump -c -t kmer > kmer21.txt