De novo whole genome assembly

Similar documents
De novo whole genome assembly

Workflow of de novo assembly

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

De novo genome assembly with next generation sequencing data!! "

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

A Roadmap to the De-novo Assembly of the Banana Slug Genome

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Assembly of Ariolimax dolichophallus using SOAPdenovo2

De Novo Assembly of High-throughput Short Read Sequences

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Next-Generation Sequencing. Technologies

Mate-pair library data improves genome assembly

De novo Genome Assembly

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

De novo genome assembly. Dr Torsten Seemann

Announcements. Coffee! Evalua,on. Dr. Yoshiki Sasai, R.I.P.

NOW GENERATION SEQUENCING. Monday, December 5, 11

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

RNA-Sequencing analysis

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Mapping strategies for sequence reads

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Genome Assembly, part II. Tandy Warnow

Genome Sequencing-- Strategies

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Graph structures for represen/ng and analysing gene/c varia/on. Gil McVean

Genomics AGRY Michael Gribskov Hock 331

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

The tomato genome re-seq project

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Introduction to NGS Analysis Tools

Genomics and Transcriptomics of Spirodela polyrhiza

Connect-A-Contig Paper version

Meraculous-2D: Haplotype-sensitive Assembly of Highly Heterozygous genomes.

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Introduction to RNA-Seq

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Analysis of Structural Variants using 3 rd generation Sequencing

Lees J.A., Vehkala M. et al., 2016 In Review

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

De novo assembly of human genomes with massively parallel short read sequencing

Sequencing techniques and applications

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Analysing genomes and transcriptomes using Illumina sequencing

Introduction to RNA sequencing

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

Gap Filling for a Human MHC Haplotype Sequence

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Human genome sequence

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Harnessing the power of RADseq for ecological and evolutionary genomics

Third Generation Sequencing

Next Generation Sequencing for Metagenomics

Researching Cancer with the minion: Methylation and Structural Variation

RNASEQ WITHOUT A REFERENCE

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Supplementary Figures and Tables

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

HLA and Next Generation Sequencing it s all about the Data

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

Measuring transcriptomes with RNA-Seq

Sept 2. Structure and Organization of Genomes. Today: Genetic and Physical Mapping. Sept 9. Forward and Reverse Genetics. Genetic and Physical Mapping

RNA-seq Data Analysis

Opportunities offered by new sequencing technologies

SNP calling and VCF format

RNA-Seq Software, Tools, and Workflows

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

Computational assembly for prokaryotic sequencing projects

Introduction to Bioinformatics

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Introduction: Methods:

Genome Assembly Workshop Titles and Abstracts

Applications of PacBio Single Molecule, Real- Time (SMRT) DNA Sequencing

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Genome Sequence of Medicago sativa: Cultivated Alfalfa at the Diploid Level (CADL)

Genetic Exercises. SNPs and Population Genetics

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Jenny Gu, PhD Strategic Business Development Manager, PacBio

Next Generation Sequencing Technologies. Some slides are modified from Robi Mitra s lecture notes

Next Generation Sequencing: An Overview

de novo sequencing of the sunflower genome

Quantitative Genetics

Transcription:

De novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell University

Data generation Sequencing Platforms Short reads: Illumina Long reads: PacBio; Oxford Nanopore Contiging/Scaffolding Platforms Linked-Reads: 10X Chromium Optical Mapping: BioNano Hi-C: Dovetail; Phase Genomics

DNA fragment length vs Sequencing read length DNA fragment Sequencing Read: ACGGGAGGGACCCG

Short-read Sequencing Platform: Illumina Paired-end Fragment: <500bp fragment Read: 100-250 bp Mate pair reads Fragment: 5kb, 8kb or 15kb Read: 50 bp

Illumina Reads Assembly Strategy Contiging Paired-end reads Contigs Scaffolding Mate-pair reads Scaffolds NNNN

Long-read Sequencing Platform: PacBio Fragment: >>10kb Read: 10-20 kb

PacBio: Long read (>10kb); High error rate (>10%) Error correction methods Self error correction (100x depth) By Illumina reads Chin et al.(2013) Nature Methods 10:563

Oxford Nanopore Error rate is catching up to PacBio but much much cheaper MinION

Software for short-read assembly Easy SOAP de novo: easy to run; relatively fast Discovar: require longer reads (>=250bp); MaSuRCA: high quality assembly; slow and memory intensive difficult

Two categories of contiging strategies Long reads overlap layout consensus Short reads de-bruijn-graph Canu, Falcon, Celera et al. Velvet, Soap-denovo, Abyss, Trinity et al source: http://www.homolog.us/tutorials/index.php

de-bruijn-graph for contiging short reads Kmers source: http://www.homolog.us/tutorials/index.php

Kmer Paths in paralogous regions Paralogous regions GGATGGAAGTCG CGATGGAAGGAT 7mer GGATTGGA CCGATTGGA GATGGAA ATGGAAG TGGAAGT TGGAAGG 9mer GGATGGAAGT CGATGGAAGG GATGGAAGT GATGGAAGG

Sequencing errors and Tips, bubbles and crosslinks Tips Bubbles source: http://www.homolog.us/tutorials/index.php Crosslinks

Deal with sequencing errors and repetitive regions 1. Sequencing errors Remove low depth kmers in a bubble; Too long kmers would cause coverage problem; 2. Repetitive region Longer kmers Tiny repeat: Separate the path Break boundary between low and high copy regions

Read depth vs Kmer depth Read depth: number of reads at a genome position Read depth: 4 Kmer depth: number of occurrence of an identical kmer Kmer occurance: 3

Impact of kmer-length (2) Read depth vs Kmer depth 80mer 30mer Kmer depth =1 Kmer depth =3 Read depth: 3 Longer kmers could results in too little kmer depth

Kmer counting from sequencing reads AATTTGTGAATGGCT 1 ATGCAAGACCAATTG 128 ACAAATAGGGGTAGT 1 AACACAAAATAATAA 1 GCTGTAAGCTTTAAC 1 AACCTATGAGTGAAA 1 ATTCGTAGATCTTCC 101 CAGGTATCAGGCATG 146 TATCAGCTGGAGTCA 60 ACCAGAGATATAAAA 1 GGTCCATTTTGTTTA 1 GTCAAAAAATTACGA 1 ACATTCCCTCGGTAA 1 AGCATTTCTTAACAC 1 TAAAAACCATCTTTA 137 ATTCGATTTGTCAAG 62 ACATTGGAAAAAACT 1 AATGGAAATACAATA 2 CATTCGTTAGATCAA 1 CATTCGTGAGATCAA 128 CCTTGTCATTAACTC 667 (15mers) Kmer depth distribution plots 31mer 15mer

Heterozygous genomes Experimentally: Create inbred lines or haploid cell culture. Assembly of clonal fragments, merging allelic regions. Assembly SNP: merge bubbles Highly polymorphic regions Kmer coverage distribution Highly polymorphic regions

Examine an assembly software 1. Soap denovo

Scaffolding strategies 1. Mate-pair reads Contigs Illumina mate pair data: 5kb, 8kb or 15kb inserts Software: Soap denovo, et al

Schematic overview of the Platanus algorithm. Examine an assembly software 2. Platanus Rei Kajitani et al. Genome Res. 2014;24:1384-1395 2014 Kajitani et al.; Published by Cold Spring Harbor Laboratory Press

Scaffolding strategies 2. Illumina Moleculo and 10X Illumina Moleculo Scaffolding contigs Phasing

Scaffolding strategies 3. Physical maps BioNano: Generating high-quality genome maps by labeling specific 7-mer nickase recognition sites in a genome with a single-color fluorophore Hi-C: Sequence cross-linked and ligated DNA fragments

Eucaryotic genome assembly of in 2016 Small and Inbred Highly heterozyougs/polyploidy Illumina paired-end + mate-pair OR PacBio long reads Scaffolding by physical mapping BioNano Hi-C 10X Pseudo-chromsoomes through genetics mapping

Exercise: Estimate genome size based on kmer distribution /programs/jellyfish-2.1.4/bin/jellyfish count -s 1G -m 21 -t 8 -o kmer -C SRR1554178.fastq /programs/jellyfish-2.1.4/bin/jellyfish dump -c -t kmer > kmer21.txt CATTATTGGAGTTGATGAAAA 134 CCGCCACCACCACCACCGCCA 52 TAATTAATTTCAAATTTATAA 1 CCCTGTAGAAGCCTGTGTACC 1 AAAAAAAAATTTGTAGGGGGG 79 GGCATCTAGGTCATGTATTTA 3 AAATGAAAGTACATCAGCTTA 1 AAAATTTGTTCTACCACCGCC 2 CGAGGAGAATTGGGAAAAAAA 103 CTATTTGTGGAATCATAGATC 1 CATCACAAGAACCTATAGAAG 1 CGCCAACCATCCATTATCAAA 1 AAAGTATCGCCAACTTTTAAT 1 CAAAATGAATTTCTTATTTTC 76 GGAAATATGGAGTAGTTACCA 1 CTGATTTCTCTCTCTATCTCC 1 ATTTATCTCGCAAATTTTAGA 1 CAAAGAGAGAATCACAACAAA 86 awk '{print $2}' kmer21.txt sort -n uniq -c \ awk '{print $2 "\t" $1}'> histogram 1 14747539 2 798919 3 131160 4 39150 5 19202 6 11755 7 8734 8 6752 9 5795 10 4679 11 3848 12 3232 13 2978 14 2819 15 2655 16 2373

Exercise: Estimate genome size based on kmer distribution 100 60272 101 64317 102 68183 103 72164 104 75022 105 78876 106 81907 107 85077 108 88059 109 89946 110 91659 111 91950 112 92535 113 91935 114 91286 115 90194 116 87616 117 85215 118 82041 119 78751 120 75306 Peak kmer depth: 112

Estimate genome size based on kmer distribution Step 1: convert read depth to kmer depth N = M* L/(L-K+1) 75 mer Read depth = 3 Kmer depth = 2 M: kmer depth = 112 L: read length = 101 bp K: Kmer size =21 bp N: read depth =140 Step 2: genome size is total sequenced basepairs devided by read depth Genome size = T/N T: total base pairs = 0.505 gb N: read depth = 140 Genome size: 3.6 mb

Estimate genome size based on kmer distribution R code to fit Poisson distribution Total kmer bases: 387 mb Kmer depth: 113 Genome size: 3.4 mb data <- read.table(file="histogram",header=f) poisson_expeact_k=function(file,start=3,end=200,step=0.1){ diff<-1e20 min<-1e20 pos<-0 total<-sum(as.numeric(file[start:dim(file)[1],1]*file[start:dim(file)[1],2])) singlecopy_total <- sum(as.numeric(file[10:500,1]*file[10:500,2])) for (i in seq(start,end,step)) { singlec <- singlecopy_total/i a<-sum((dpois(start:end, i)*singlec-file[start:end,2])^2) if (a < diff){ pos<-i diff<-a } } print(paste("total kmer bases: ", total)) print(paste("kmer depth: ", pos)) print(paste("genome size: ", total/pos)) pdf("kmerplot.pdf") plot(1:200,dpois(1:200, pos)*singlec, type = "l", col=3, lwd=2, lty=2, ylim=c(0,150000), xlab="k-mer coverage",ylab="k-mer counts frequency") lines(file[1:200,],type="l",lwd=2) dev.off() } poisson_expeact_k(data)

Using short kmer could underestimate genome size (<20) Low kmer depth could overestimate genome size (<20) - Minghui Wang

Exercise: Run kmer analysis tools to estimate genome size You will be provided with a Fastq file. Using Jellyfish to get the kmer count, then estimate the kmer depth /programs/jellyfish-2.2.3/bin/jellyfish count -s 1G -m 21 -t 4 -o kmer -C SRR1554178.fastq /programs/jellyfish-2.2.3/bin/jellyfish dump -c -t kmer > kmer21.txt