De novo Genome Assembly

Similar documents
De novo genome assembly. Dr Torsten Seemann

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Mate-pair library data improves genome assembly

Mapping strategies for sequence reads

De Novo and Hybrid Assembly

The Basics of Understanding Whole Genome Next Generation Sequence Data

ChIP-seq and RNA-seq

De novo whole genome assembly

10/20/2009 Comp 590/Comp Fall

We begin with a high-level overview of sequencing. There are three stages in this process.

Genomic resources. for non-model systems

ChIP-seq and RNA-seq. Farhat Habib

Lecture 14: DNA Sequencing

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

De Novo Assembly of High-throughput Short Read Sequences

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

Compositional Analysis of Homogeneous Regions in Human Genomic DNA

Sundaram DGA43A19 Page 1. Finishing Drosophila grimshawi Fosmid: DGA43A19 Varun Sundaram 2/16/09

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Poisson Distribution in Genome Assembly

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De novo whole genome assembly

Bioinformatics Course AA 2017/2018 Tutorial 2

De novo whole genome assembly

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Connect-A-Contig Paper version

Wet-lab Considerations for Illumina data analysis

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

De novo assembly in RNA-seq analysis.

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

The Human Genome and its upcoming Dynamics

Understanding Accuracy in SMRT Sequencing

The Human Genome Project

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Alignment and Assembly

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

From Infection to Genbank

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Next Gen Sequencing. Expansion of sequencing technology. Contents

DNA Sequencing and Assembly

Lander-Waterman Statistics for Shotgun Sequencing Math 283: Ewens & Grant 5.1 Math 186: Not in book

Introduction to Bioinformatics

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

Genomics. Data Analysis & Visualization. Camilo Valdes

Crash-course in genomics

Genome Projects. Part III. Assembly and sequencing of human genomes

Purpose of sequence assembly

Next Generation Sequencing. Tobias Österlund

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

De novo meta-assembly of ultra-deep sequencing data

Genome Resequencing. Rearrangements. SNPs, Indels CNVs. De novo genome Sequencing. Metagenomics. Exome Sequencing. RNA-seq Gene Expression

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Analysis of structural variation. Alistair Ward - Boston College

Analysis of NGS data. resources. Grid computing workshop 2015 Jan Oppelt, NCBR & CEITEC MU 1 st December, 2015

Whole Genome Amplification (WGA): What to Do When You Don t Have Enough Genomic DNA

Bioinformatics for Genomics

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Complete Genomics Technology Overview

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

RNA-Seq de novo assembly training

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

NGS developments in tomato genome sequencing

Genome Assembly CHRIS FIELDS MAYO-ILLINOIS COMPUTATIONAL GENOMICS WORKSHOP, JUNE 19, 2018

The Resurgence of Reference Quality Genome Sequence

Workflow of de novo assembly

BroadE Workshop: Genome Assembly. March 20 th, 2013

Molecular Cell Biology - Problem Drill 06: Genes and Chromosomes

BIOINFORMATICS ORIGINAL PAPER

CSE182-L16. LW statistics/assembly

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Analytics Behind Genomic Testing

Microbiome: Metagenomics 4/4/2018

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

Finishing Drosophila ananassae Fosmid 2410F24

The goal of this project was to prepare the DEUG contig which covers the

How much sequencing do I need? Emily Crisovan Genomics Core

Building a platinum human genome assembly from single haplotype human genomes. Karyn Meltz Steinberg PacBio UGM December,

Finishing of DFIC This project sought to finish DFIC , the terminal 45 kb of the Drosophila

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

NEXT GENERATION SEQUENCING. Farhat Habib

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Deep Sequencing technologies

Transcription:

De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, AU - 3 July 2017

Introduction

The human genome has 47 pieces MT (or XY)

The shortest piece is 48,000,000 bp Total haploid size 3,200,000,000 bp (3.2 Gbp) Total diploid size 6,400,000,000 bp (6.4 Gbp) = 6,400,000,000 bp

In an ideal world... AGTCTAGGATTCGCTATAG ATTCAGGCTCTGATATATT TCGCGGCATTAGCTAGAGA TCTCGAGATTCGTCCCAGT CTAGGATTCGCTAT AAGTCTAAGATTC... Human DNA isequencer 46 chromosomal and 1 mitochondrial sequences

The real world ( for now ) Genome Short fragments Sequencing Reads are stored in FASTQ files

Read lengths 100-300 bp 100-400 bp 5,000-40,000+ bp 1,000-897,000+ bp

Assemble Compare

Genome assembly (the red pill)

De novo genome assembly Reconstruct the original genome from the sequence reads only Millions of short sequences (reads) A few long sequences (contigs) Ideally, one sequence per replicon.

De novo genome assembly From scratch

An example

A small genome I ll return them tomorrow! Friends, Romans, countrymen, lend me your ears;

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me Whoops! I dropped them.

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me I am good with words. Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

Shakespearomics Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me We have reached a consensus! Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; Majority consensus Friends, Romans, countrymen, lend me your ears; (1 contig)

Overlap - Layout - Consensus Amplified DNA Shear DNA Sequenced reads Overlaps Layout Consensus Contigs

Assembly graphs (not Excel bar charts)

Overlap graph

Another example is this

Overlaps find

The graph one can simplify Not, look is fully self contained within the other longer read, and can be removed.

Do the graph traverse Size matters not. Look at me. Judge me by my size, do you? Size matters not. Look at me. Judge me by my soze, do you? 2 supporting reads 1 supporting reads

So far, so good.

Why is it so hard?

What makes a jigsaw puzzle hard? Lots of pieces No box Frayed pieces Missing pieces Repetitive regions :., Dirty pieces Multiple copies No corners (circular genomes)

What makes genome assembly hard 1. Many pieces (read length is very short compared to the genome) Size of the human genome = 3.2 x 109 bp (3,200,000,000) Typical short read length = 102 bp (100) A puzzle with millions to billions of pieces Storing in RAM is a challenge succinct data structure

What makes genome assembly hard 2. Lots of overlaps Finding overlaps means examining every pair of reads Comparisons = N (N-1)/2 ~ N2 Lots of smart tricks to reduce this close to ~N

What makes genome assembly hard 3. Lots of sky (short repeats) ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA How could we possibly assemble this segment of DNA? All the reads are the same!

What makes genome assembly hard 4. Dirty pieces (sequencing errors) Read 1: GGAACCTTTGGCCCTGT Read 2: GGCGCTGTCCATTTTAGAAACC What counts as overlapping? Minimum overlap length Minimum DNA indentity

What makes genome assembly hard 5. Multiple copies (long repeats) Gene Copy 1 Gene Copy 2 Our old nemesis the REPEAT!

Repeats

What is a repeat? A segment of DNA that occurs more than once in the genome

Major classes of repeats in the human genome Repeat Class Arrangement Coverage (Hg) Length (bp) Satellite (micro, mini) Tandem 3% 2-100 SINE Interspersed 15% 100-300 Transposable elements Interspersed 12% 200-5k LINE Interspersed 21% 500-8k rdna Tandem 0.01% 2k-43k Segmental Duplications Tandem or Interspersed 0.2% 1k-100k

Repeats Repeat copy 1 Repeat copy 2 1 locus 4 contigs Collapsed repeat consensus

Repeats cause graph ambiguity Repeat copy 1 Repeat copy 2 1 locus Overlap graph

Repeats are hubs in the graph

Long reads can span repeats Repeat copy 1 Repeat copy 2 long reads 1 contig

Draft vs Finished genomes (bacteria) 250 bp - Illumina - $200 10,000 bp - Pacbio - $2000

The two laws of repeats 1. It is impossible to resolve repeats of length L unless you have reads longer than L. 2. It is impossible to resolve repeats of length L unless you have reads longer than L.

How much data do we need?

Coverage vs Depth Coverage (aka Breadth) Depth Fraction of the genome sequenced by at least one read Average number of reads that cover any given region Intuitively more reads should increase coverage and depth

Example: Read length = ⅛ Genome Length Coverage = ⅜ Depth = ⅜ Genome Reads

Example: Read length = ⅛ Genome Length Coverage = 4.5/8 Depth = 5/8 Genome Reads Newly Covered Regions = 1.5 Reads

Example: Read length = ⅛ Genome Length Coverage = 5.2/8 Depth = 7/8 Genome Reads Newly Covered Regions = 0.7 Reads

Example: Read length = ⅛ Genome Length Coverage = 6.7/8 Depth = 10/8 Genome Reads Newly Covered Regions = 1.5 Reads

Depth is easy to calculate Depth = N L / G Depth = Y / G N = Number of reads L = Length of a read G = Genome length Y = Sequence yield (N x L)

Example: Coverage of the Tarantula genome The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata Sangaard et al 2014. Nature Comm (5) p1-11 Depth = N L / G 40 = N x 100 / (6 Billion) N = 40 * 6 Billion / 100 N = 2.4 Billion

Coverage and depth are related Approximate formula for coverage assuming random reads Coverage = 1 - e-depth

Much more sequencing needed in reality Sequencing is not random GC and AT rich regions are under represented Other chemistry quirks More depth needed for: sequencing errors polymorphisms Pacbio Illumina (both)

Assessing assemblies

Contiguity Completeness Correctness

Contiguity Desire Fewer contigs Longer contigs Metrics Number of contigs Average contig length Median contig length Maximum contig length N50, NG50, D50

Contiguity: the N50 statistic

Completeness : Total size Proportion of the original genome represented by the assembly Can be between 0 and 1 Proportion of estimated genome size but estimates are not perfect

Completeness: core genes Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms. Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes. In the past this was done with a tool called CEGMA There is a new tool for this called BUSCO Number of Core Genes in Assembly Number of Core Genes in Database

Correctness Proportion of the assembly that is free from mistakes Errors include 1. 2. 3. 4. Mis-joins Repeat compressions Unnecessary duplications Indels / SNPs caused by assembler

Correctness: check for self consistency Align all the reads back to the contigs Look for inconsistencies Assembly Align Original Reads Mapped Read

Assemble ALL the things

Why assemble genomes Produce a reference for new species Genomic variation can be studied by comparing against the reference A wide variety of molecular biological tools become available or more effective Coding sequences can be studied in the context of their related non-coding (eg regulatory) sequences High level genome structure (number, arrangement of genes and repeats) can be studied

Vast majority of genomes remain unsequenced Eukaryotes Bacteria & Archaea Estimated number of species ~9M Estimated number of species Millions to Billions Whole genome sequences (including incomplete drafts) ~ 3000 Genomic sequences ~ 600,000 Every individual has a different genome Some diseases such as cancer give rise to massive genome level variation

Not just genomes Transcriptomes One contig for every isoform Do not expect uniform coverage Meta-genomes Mixture of different organisms Host, bacteria, virus, fungi all at once All different depths Meta-transcriptomes Combination of above!

Conclusions

Take home points De novo assembly is the process of reconstructing a long sequence from many short ones Represented as a mathematical overlap graph Assembly is very challenging ( impossible ) because sequencing bias under represents certain regions Reads are short relative to genome size Repeats create tangled hubs in the assembly graph Sequencing errors cause detours and bubbles in the assembly graph

Acknowledgments Nick Hamilton Ira Cooke Ann-Marie Patch Katia Nones Simon Gladman Mark Ragan Anna Syme

Contact tseemann.github.io torsten.seemann@gmail.com @torstenseemann

The End