Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Similar documents
A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

De Novo Assembly of High-throughput Short Read Sequences

Graph structures for represen/ng and analysing gene/c varia/on. Gil McVean

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Axiom mydesign Custom Array design guide for human genotyping applications

Genome Sequencing-- Strategies

De novo whole genome assembly

MoGUL: Detecting Common Insertions and Deletions in a Population

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction to RNA sequencing

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Mutations during meiosis and germ line division lead to genetic variation between individuals

De novo whole genome assembly

Haploid Assembly of Diploid Genomes

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Runs of Homozygosity Analysis Tutorial

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

Population Dynamics. Population: all the individuals of a species that live together in an area

Variant calling in NGS experiments

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

Mapping strategies for sequence reads

De novo genome assembly with next generation sequencing data!! "

SNP calling and VCF format

Create a Planned Run. Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin USER BULLETIN. Publication Number MAN Revision A.

Sequencing, Assembling, and Correcting Draft Genomes Using Recombinant Populations

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Genetics of dairy production

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo Genome Assembly

Germline variant calling and joint genotyping

Genome Assembly: Background and Strategy

The tomato genome re-seq project

Amapofhumangenomevariationfrom population-scale sequencing

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

User Bulletin. Veriti 96-Well Thermal Cycler AmpFlSTR Kit Validation. Overview

Read Mapping and Variant Calling. Johannes Starlinger

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Bioinformatics Advice on Experimental Design

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Lees J.A., Vehkala M. et al., 2016 In Review

Workflow of de novo assembly

LightScanner Hi-Res Melting Comparison of Six Master Mixes for Scanning and Small Amplicon and LunaProbes Genotyping

NOW GENERATION SEQUENCING. Monday, December 5, 11

Alignment. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Genomics and Transcriptomics of Spirodela polyrhiza

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Meraculous-2D: Haplotype-sensitive Assembly of Highly Heterozygous genomes.

The first and only fully-integrated microarray instrument for hands-free array processing

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

Assignment 9: Genetic Variation

Basic Concepts of Human Genetics

SVMerge Output File Format Specification Sheet

Answers to additional linkage problems.

Assessing De-Novo Transcriptome Assemblies

Genomics AGRY Michael Gribskov Hock 331

RareVariantVis 2: R suite for analysis of rare variants in whole genome sequencing data.

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

Gap Filling for a Human MHC Haplotype Sequence

SNP GENOTYPING WITH iplex REAGENTS AND THE MASSARRAY SYSTEM

Genetic load. For the organism as a whole (its genome, and the species), what is the fitness cost of deleterious mutations?

Creation of a PAM matrix

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Whole genome sequencing in drug discovery research: a one fits all solution?

What is genetic variation?

Midterm 1 Results. Midterm 1 Akey/ Fields Median Number of Students. Exam Score

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Next Gen Sequencing. Expansion of sequencing technology. Contents

The first generation DNA Sequencing

Petar Pajic 1 *, Yen Lung Lin 1 *, Duo Xu 1, Omer Gokcumen 1 Department of Biological Sciences, University at Buffalo, Buffalo, NY.

White paper on de novo assembly in CLC Assembly Cell 4.0

Review of whole genome methods

Sequence Variations. Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms. NCBI SNP Primer:

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Basic Concepts of Human Genetics

Tangram: a comprehensive toolbox for mobile element insertion detection

The Genetics of Parenthood: Background Information

LUMPY: A probabilistic framework for structural variant discovery

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Data Mining and Applications in Genomics

GENE MAPPING. Genetica per Scienze Naturali a.a prof S. Presciuttini

ARTICLE Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering

1/21/ Exploring Mendelian Genetics. What is the principle of independent assortment? Independent Assortment. Biology.

Oral Cleft Targeted Sequencing Project

Expressed Sequence Tags: Clustering and Applications

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

GENE FLOW AND POPULATION STRUCTURE

GENETIC DRIFT INTRODUCTION. Objectives

Transcription:

Genome Assembly Using de Bruijn Graphs Biostatistics 666

Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position Works very well for SNPs and relatively well for other types of variant

Shotgun Sequence Reads Typical short read might be <25-100 bp long and not very informative on its own Reads must be arranged (aligned) relative to each other to reconstruct longer sequences

Read Alignment GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA Short Read (30-100 bp) 5 -ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3 Reference Genome (3,000,000,000 bp) The first step in analysis of human short read data is to align each read to genome, typically using a hash table based indexing procedure This process now takes no more than a few hours per million reads Analyzing these data without a reference human genome would require much longer reads or result in very fragmented assemblies

Mapping Quality Measures the confidence in an alignment, which depends on: Size and repeat structure of the genome Sequence content and quality of the read Number of alternate alignments with few mismatches The mapping quality is usually also measured on a Phred scale Idea introduced by Li, Ruan and Durbin (2008) Genome Research 18:1851-1858

Shotgun Sequence Data TAGCTGATAGCTAGATAGCTGATGAGCCCGAT ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA Sequence Reads 5 -ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3 Reference Genome Reads overlapping a position of interest are used to calculate genotype likelihoods and Interpreted using population information.

Limitations of Reference Based Analyses For some species, no suitable reference genome available The reference genome may be incomplete, particularly near centromeres and telomeres Alignment is difficult in highly variable regions Alignment and analysis methods need to be customized for each type of variant

Assembly Based Analyses Assembly based approaches to study genetic variation Implementation, challenges and examples Approaches that naturally extend to multiple variant types

De Bruijn Graphs A representation of available sequence data Each k-mer (or short word) is a node in the graph Words linked together when they occur consecutively Short Sequence De Bruijn Graph Representation Iqbal (2012)

Effective Read Depth Overlaps must exceed k-mer length to register in a de Bruijn graph This requirement effectively reduces coverage Give read read length L, word length k, and expected depth D DD eeeeeeeeeeeeeeeeee = DD LL kk + 1 LL

Cleaning De Bruijn graphs are typically cleaned before analysis Cleaning involves removing portions of the graph that have very low coverage For example, most paths with depth = 1 and even with depth <= 2 are likely to be errors

Variation in a de Bruijn Graph Variation in sequence produces a bubble in a de Bruijn graph Do all bubbles represent true variation? What are other alternative explanations? Iqbal (2012)

Effective Read Depth - Consequences Consider a simple example where L = 100 With k = 21 Each read includes 80 words Each SNP generates a bubble of length 22 A single read may enable SNP discovery With k = 75 Each read includes 26 words Each SNP generates a bubble of length 76 Multiple overlapping reads required to discover SNP

Properties of de Bruijn Graphs Many useful properties of genome assemblies (including de Bruijn graphs) can be studied using results of Lander and Waterman (1988) Described number of assembled contigs and their lengths as a function of genome size, length of fragments, and required overlap

Lander and Waterman (1988) The genome size G Notation The number of fragments in assembly N The length of sequenced fragments L The fractional overlap required for assembly Ѳ The depth of coverage c = NL/G Probability a clone starts at a position α = N/G

Number of Contigs Ne -c(1-ѳ) Consider the probability that a fragment starts is not linked to another before ending αα 1 αα LL 1 θθ = αα 1 NN/GG GGGG NN 1 θθ cc 1 θθ = ααee Then, the expected number of fragments that are not linked to another is GGGGee cc 1 θθ = NNee cc 1 θθ This is also the number of contigs!

Number of Contigs Number of Contigs (G/L units) Genome Coverage (or depth) c Number of contigs peaks when depth c = (1 α) -1

Contig Lengths Probability a fragment ends the contig: cc 1 θθ ee Probability of contig with exactly j fragments: 1 ee cc 1 θθ jj 1 ee cc 1 θθ The number of contigs with j fragments is: NNee cc 1 θθ cc 1 θθ jj 1 1 ee How many contigs will have 2+ fragments?

Contig Lengths (in bases) The expected contig length, in fragments, is cc 1 θθ EE JJ = ee Each fragment contributes X bases PP XX = mm = 1 αα mm 1 αα for 0 < mm LL(1 θθ) PP XX = LL = 1 αα LL(1 θθ) After some algebra: 1 ee cc 1 θθ EE XX = LL[ cc θθee cc 1 θθ ] The expected contig length in bases is E(X) E(J) LL[ eecc 1 θθ 1 θθ] cc

Contig Lengths Contig Length (in units of L base pairs) Genome Coverage (or depth) c Lander and Waterman also studied gap lengths

Enhanced De Bruijn Graphs Usefulness of a de Bruijn graph increases if we annotate each node with useful information Basic information might include the number of times each word was observed More detailed information might include the specific individuals in which the word was present

Variant Analysis Algorithm 1: Bubble Calling Create a de Bruijn graph of reference genome Bubbles in this graph are paralogous sequences Using a different label, assemble sample of interest Systematically search for bubbles Nodes where two divergent paths eventually connect Iqbal (2012)

Word size k and Accessible Genome Iqbal (2012)

Power of Homozygous Variant Discovery (100-bp reads, no errors) Iqbal (2012)

Power of Heterozygous Variant Discovery (100-bp reads, no errors) Iqbal (2012)

Power of Homozygous Variant Discovery (Simulated 30x genomes, 100-bp reads) Iqbal (2012)

Power of Heterozygous Variant Discovery (Simulated 30x genomes, 100-bp reads) Dotted lines ( ) refer to theoretical expectations. Solid lines (---) refer to simulation results. Iqbal (2012)

Variant Analysis Algorithm 2: Path Divergence Bubble calling requires accurately both alleles Power depends on word length k, allele length, genome complexity and error model Low power for the largest events Path divergence searches for regions where a sample path differs from the reference Especially increases power for deletions Deletion often easier to assemble than reference

Path Divergence Example Black line represents assembly of sample. We can infer a variant between positions a and b, because the path between them differs from reference.

Variant Analysis Algorithm 3: Multi-Sample Analysis Improves upon simple bubble calling by tracking which paths occur on each sample Improved ability to distinguish true variation from paralogous sequence and errors Iqbal (2012)

Classifying Sites Evaluate ratio of coverage along the two branches of each bubble and in each individual If the ratio is uniform across individuals Error: Ratio consistently low for one branch Repeat: Ratio constant across individuals If the ratio varies across individuals Variant: Ratio clusters around 0, ½ and 1 with probability of these outcomes depending on HWE

Variant Analysis Algorithm 4: Genotyping Calculate probability that a certain number of k-mers cover each path To improve accuracy, short duplicate regions within a path can be ignored. Allows likelihood calculation for use in imputation algorithms Iqbal (2012)

Example Application to High Coverage Genome 26x, 100-bp reads, k = 55 2,777,252,792 unique k-mers 2,691,115,653 also in reference 23% more k-mers before cleaning 2,686,963 bubbles found by Bubble Caller 5.6% of these also present in reference 528,651 divergent paths 39.8% of these also present in reference 2,245,279 SNPs, 361,531 short indels, 1,100 large or complex events Reproduces 67% of heterozygotes from mapping (87% of homozygotes)

Comparison to Mapping Based Algorithms

Summary Assembly based algorithms currently reach about 80% of the genome These algorithms can handle different variant types more conveniently than mapping based approaches Incorporating population information allows repeats to be distinguished from true variation

Recommended Reading Iqbal, Caccamo, Turner, Flicek and McVean (2012) Nature Genetics 44:226-232 Lander and Waterman (1988) Genomics 2:231-239