short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Similar documents
De novo genome assembly with next generation sequencing data!! "

De Novo Assembly of High-throughput Short Read Sequences

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Workflow of de novo assembly

De novo whole genome assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to Bioinformatics

Assembly of Ariolimax dolichophallus using SOAPdenovo2

De novo whole genome assembly

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

A Roadmap to the De-novo Assembly of the Banana Slug Genome

De novo Genome Assembly

Basic Bioinformatics: Homology, Sequence Alignment,

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

NOW GENERATION SEQUENCING. Monday, December 5, 11

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Introduction to RNA sequencing

Genome Assembly Workshop Titles and Abstracts

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Haploid Assembly of Diploid Genomes

De novo genome assembly. Dr Torsten Seemann

Mapping strategies for sequence reads

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Next Generation Sequencing Technologies

Each cell of a living organism contains chromosomes

Mate-pair library data improves genome assembly

Lecture 11: Gene Prediction

Genome Assembly, part II. Tandy Warnow

Next Gen Sequencing. Expansion of sequencing technology. Contents

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

PRE- AND POST-PROCESSING TOOLS FOR NEXT-GENERATION SEQUENCING DE NOVO ASSEMBLIES. Sari S. Khaleel

PRiB - Mandatory Project 2. Gene finding using HMMs

Genomics and Transcriptomics of Spirodela polyrhiza

Connect-A-Contig Paper version

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

Genome Sequencing-- Strategies

Disease and selection in the human genome 3

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Computational assembly for prokaryotic sequencing projects

Figure S4 A-H : Initiation site properties and evolutionary changes

Genomics AGRY Michael Gribskov Hock 331

Course Presentation. Ignacio Medina Presentation

Introduction to NGS Analysis Tools

Supplemental Data. mir156-regulated SPL Transcription. Factors Define an Endogenous Flowering. Pathway in Arabidopsis thaliana

Sequence Design for DNA Computing

II 0.95 DM2 (RPP1) DM3 (At3g61540) b

CloG: a pipeline for closing gaps in a draft assembly using short reads

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

ASSEMBLY ALGORITHMS FOR NEXT-GENERATION SEQUENCE DATA. by Aakrosh Ratan

Introduction: Methods:

NCBI web resources I: databases and Entrez

arxiv: v1 [q-bio.gn] 20 Apr 2013

Meraculous-2D: Haplotype-sensitive Assembly of Highly Heterozygous genomes.

Shuji Shigenobu. April 3, 2013 Illumina Webinar Series

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Axiom mydesign Custom Array design guide for human genotyping applications

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Genomics and Gene Recognition Genes and Blue Genes

De novo sequence assembly

Why learn sequence database searching? Searching Molecular Databases with BLAST

Supplemental Data Supplemental Figure 1.

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Bioinformatics pipeline development to support Helicobacter pylori genome analysis Master s thesis in Computer Science

Supplement 1: Sequences of Capture Probes. Capture probes were /5AmMC6/CTG TAG GTG CGG GTG GAC GTA GTC

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

Electronic Supplementary Information

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Finished (Almost) Sequence of Drosophila littoralis Chromosome 4 Fosmid Clone XAAA73. Seth Bloom Biology 4342 March 7, 2004

Lees J.A., Vehkala M. et al., 2016 In Review

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Bionano Access 1.1 Software User Guide

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

Lecture 2: Biology Basics Continued

Search for and Analysis of Single Nucleotide Polymorphisms (SNPs) in Rice (Oryza sativa, Oryza rufipogon) and Establishment of SNP Markers

Metagenomics is the study of all micro-organisms coexistent in an environmental area, including

White paper on de novo assembly in CLC Assembly Cell 4.0

Interpretation of sequence results

Genome assembly reborn: recent computational challenges Mihai Pop

Glossary of Commonly used Annotation Terms

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Wet Lab Tutorial: Genelet Circuits

N ext-generation sequencing (NGS) technologies have become common practice in life science1. Benefited

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Supplementary Information. Construction of Lasso Peptide Fusion Proteins

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

Introduction to Bioinformatics. Genome sequencing & assembly

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Local assembly and pre-mrna splicing analyses by high-throughput sequencing data

Transcription:

1 short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

2 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at: http://cs.brown.edu/courses/csci1820/software/minimal_asse mbler.nb Sample FASTA genome phix174.fasta can be found in HW5 Biology: http://cs.brown.edu/courses/csci1820/software/phix174.fasta Remember to Change the input genome to your FASTA file s location Evaluate each cell initially, then you only need to evaluate the last two cells to re-run the assembly, and display the results respectively Mathematica can be downloaded here: http://www.brown.edu/information-technology/software/

Sequence reads are in black Contiguous strings of assembled DNA (contigs) are in red coverage = 1

Sequence reads are in black Contiguous strings of assembled DNA (contigs) are in red coverage = 2

Sequence reads are in black Contiguous strings of assembled DNA (contigs) are in red coverage = 3

Sequence reads are in black Contiguous strings of assembled DNA (contigs) are in red coverage = 4

Sequence reads are in black Contiguous strings of assembled DNA (contigs) are in red coverage = 5

coverage = 2, paired ends

Sample prep Raw Sequence Reads Sequence data wet-lab experimental methods to isolate, prepare, and sequence the DNA results in a number of large FASTQ files FASTQC can be used to check basic statistics of the files http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ many tools available for QC e.g. http://hannonlab.cshl.edu/fastx_toolkit/

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

http://www.ncbi.nlm.nih.gov/traces/sra/

Genome Assembly Software Overlap-layout-consensus Celera: http://wgs-assembler.sourceforge.net/ K-mer based Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

Two graph models A first graph model Nodes (vertices) are contiguous sequences of k characters (k-mer) Directed edge from v i to v j if v i [2..k]=v j [1..k-1] A C G T T C ACG CGT GTT TTC

Two graph models De-bruijn graph Nodes (vertices) are contiguous sequences of k-1 characters (k-1-mer) Directed edge from v i to v j if v i [1..k-1]+v j [k-1] are a valid k-mer A C G T T C ACG CGT GTT TTC AC CG GT TT TC

Note edges that are not reflected in the input! Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Genome Assembly Building the k-mer graph nodes as k-mers, edges (k-1) overlap 17

Genome GACGTACGTT GACG Genome assembly 1 1 ACGT k=4 CGTA Reads GACGTA CGTACG TACGTT k=3 1 1 1 GAC ACG CGT GTA

Genome GACGTACGTT GACG Genome assembly 1 1 ACGT k=4 CGTA 1 1 GTAC Reads GACGTA CGTACG TACGTT TACG k=3 1 1 2 GAC ACG CGT GTA 1 TAC 1

Genome GACGTACGTT GACG Genome assembly ACGT CGTT 1 1 1 k=4 CGTA 1 1 GTAC Reads GACGTA CGTACG TACGTT TACG GTT 1 k=3 1 2 2 GAC ACG CGT GTA 1 TAC 2

Genome Assembly Building the k-mer graph nodes as k-mers, edges (k-1) overlap nodes as (k-1)-mers, edges form k-mers 21

Genome GACGTACGTT Genome assembly k=4 Reads GACGTA CGTACG TACGTT 1 1 GAC ACG CGT GTA 1 1 1 1 k=3 GA AC CG GT TA 1

Genome GACGTACGTT Genome assembly k=4 Reads GACGTA CGTACG TACGTT 1 1 2 1 GAC ACG CGT GTA TAC 1 1 3 2 k=3 GA AC CG GT TA 1 2

Genome GACGTACGTT 1 2 Genome assembly GTT 1 k=4 2 1 GAC ACG CGT GTA TAC Reads GACGTA CGTACG TACGTT 2 GT 1 1 4 2 k=3 GA AC CG GT TA 1 2 TT 2

Genome Assembly Building the k-mer graph G(k): nodes as k-mers, edges (k-1) overlap H(k): nodes as (k-1)-mers, edges form k-mers H(k)=G(k-1) So it really does not matter which you choose to implement Where does the complexity come from? Sequencing errors, repeats, uneven coverage, contamination from other organisms, ploidy, unsequenced regions 25

Popping bubbles Error occurs in the middle of a read and is propagated to many k-mers.

Trimming tips Error creates an erroneous ending k-mer

Chimeric extensions Errors connect two nodes in the graph which do not correspond to a valid extension in the genome sequence Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Repetitive regions Satellites, SINEs, LINEs Homologous Genes Ortholog: descended from the same ancestral sequence and separated by speciation Paralog: genes created by a duplication event 29

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly 30

Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

Velvet assembler Four stages Hashing reads into k-mers Constructing the de Bruijn graph (not all 4^k k- mers, only those that exist in input) Correct errors Resolve repeats But what after? Paper gives very little information on this... 32

The Chinese postman problem (CPP) Compute a closed tour of minimum length that visits each edge at least once Similar to what we want except we may want to visit edges more than once due to repeats How do we deal with repeats? Also, the starting and ending vertices are distinct in genome assembly How can we convert the closed tour to an open one? 33

Your homework You are not required to implement section 4 of http://web.eecs.umich.edu/~pettie/matching /Edmonds-Johnson-chinese-postman.pdf You are not even required to model genome assembly as CPP But you do have to build the k-mer graph, correct errors, resolve repeats, and compute a CPP or Eulerian-like tour. 34

Evaluating assembly The Assemblathon2 study lists 102 measures for evaluating assembly quality. Bradnam et al. (2013) Assemblathon 2: evaluating de novo methods fo genome assembly in three vertebrate species 1. NG50 scaffold length: a length x where all scaffolds of length x or longer consists of at least 50% of the genome size 2. NG50 contig length: a length x where all contigs of length x or longer consists of at least 50% of the genome size 3. Amount of gene-sized scaffolds (>25 kbp). Useful for gene finding. 4. CEGMA: Number of 458 core genes mapped

Evaluating assembly 5. Fosmid coverage: How many validated fosmid regions were captured in assembly 6. Fosmid validity: Percentage of assembly validated by validated fosmid regions 7. Validated fosmid region tag scaffold summary score: number of validated fosmid region tag pairs that match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map with correct distance. Rewards short-range accuracy. 8. and 9. Using local and global alignments of optimal map data, how well the assembly is ordered. 10. REAPR summary score: a tool that evalutes accuracy of assembly using paired reads