Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Similar documents
Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

10/20/2009 Comp 590/Comp Fall

Lecture 14: DNA Sequencing

CSCI2950-C DNA Sequencing and Fragment Assembly

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Daniel R. Zerbino and Ewan Birney 2008

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

The first generation DNA Sequencing

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017

CSE182-L16. LW statistics/assembly

De novo genome assembly with next generation sequencing data!! "

De Novo Assembly of High-throughput Short Read Sequences

Alignment and Assembly

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Bioinformatics for Genomics

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Each cell of a living organism contains chromosomes

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Lecture 11: Gene Prediction

De Novo Co-Assembly Of Bacterial Genomes From Multiple Single Cells

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo whole genome assembly

De novo whole genome assembly

Genome Projects. Part III. Assembly and sequencing of human genomes

Purpose of sequence assembly

The Diploid Genome Sequence of an Individual Human

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

The Seven Bridges of Königsberg. The Seven Bridges of Königsberg. Graphs

Mate-pair library data improves genome assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Mapping strategies for sequence reads

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Meta-IDBA: A de Novo Assembler for Metagenomic Data

De novo assembly in RNA-seq analysis.

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Conditional Random Fields, DNA Sequencing. Armin Pourshafeie. February 10, 2015

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

State of the art de novo assembly of human genomes from massively parallel sequencing data

De novo whole genome assembly

Genome Assembly, part II. Tandy Warnow

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Lecture 10, 20/2/2002: The process of solution development - The CODEHOP strategy for automatic design of consensus-degenerate primers for PCR

We begin with a high-level overview of sequencing. There are three stages in this process.

Genome Sequencing-- Strategies

ChIP-seq and RNA-seq

Challenging algorithms in bioinformatics

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

de novo paired-end short reads assembly

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

De novo meta-assembly of ultra-deep sequencing data

ادخ مانب DNA Sequencing یرهطم لضفلاوبا دیس

Introduction to Bioinformatics

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

Biol 478/595 Intro to Bioinformatics

De novo Genome Assembly

Lander-Waterman Statistics for Shotgun Sequencing Math 283: Ewens & Grant 5.1 Math 186: Not in book

ChIP-seq and RNA-seq. Farhat Habib

Review of whole genome methods

Introduction to Bioinformatics. Genome sequencing & assembly

DNA Sequencing and Assembly

A thesis submitted in partial fulfillment of the requirements for the degree in Master of Science

ARACHNE: A Whole-Genome Shotgun Assembler

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Repetitive DNA sequence assembly

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Supplementary. Table 1: Oligonucleotides and Plasmids. complementary to positions from 77 of the SRα '- GCT CTA GAG AAC TTG AAG TAC AGA CTG C

Building and Improving Reference Genome Assemblies

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

PCR analysis was performed to show the presence and the integrity of the var1csa and var-

Analysis of large deletions in human-chimp genomic alignments. Erika Kvikstad BioInformatics I December 14, 2004

Glenn Tesler. University of California, San Diego Department of Mathematics

Disease and selection in the human genome 3

Contact us for more information and a quotation

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

From Infection to Genbank

Complete Genomics Technology Overview

Haploid Assembly of Diploid Genomes

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

Lecture 10 : Whole genome sequencing and analysis. Introduction to Computational Biology Teresa Przytycka, PhD

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Analysis of structural variation. Alistair Ward - Boston College

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Microbiome: Metagenomics 4/4/2018

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Connect-A-Contig Paper version

The Human Genome and its upcoming Dynamics

BIOINFORMATICS ORIGINAL PAPER

Genome Assembly: Background and Strategy

De Novo and Hybrid Assembly

Electronic Supplementary Information

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

3) This diagram represents: (Indicate all correct answers)

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Assembly of Ariolimax dolichophallus using SOAPdenovo2

DNA sequencing. Course Info

Bioinformatics? Assembly, annotation, comparative genomics and a bit of phylogeny.

Transcription:

Outline Introduction Lectures 22, 23: Sequence Assembly Spring 2015 March 27, 30, 2015 Sequence Assembly Problem Different Solutions: Overlap-Layout-Consensus Assembly Algorithms De Bruijn Graph Based Assembly Algorithms Resolving Repeats Introduction to Single-Cell Sequencing 1 2 Whole Genome Shotgun Sequencing Frederick Sanger (and others) shared a Nobel Prize in Chemistry in 1980 for developing a method to sequence short regions of DNA. There is no current technology to simply read the whole genome sequence from one end to the other. DNA Sequencing Shear DNA into millions of small fragments Read 500 700 nucleotides at a time from the small fragments (Sanger method) The human genome is 3 billion nucleotides long. Sequencing it requires breaking it into little pieces, sequencing the pieces separately, and fitting them back together, like a jigsaw puzzle. 3 Whole Genome Shotgun Sequencing Start with many copies of genome. Bacterial genome length: 5 million. Sequencing Coverage Fragment them and sequence reads at both ends. Read length: 35 to 1000 bp. Find overlapping reads. ACGTAGAATCCATG......AACATAGTTGTAGAATC Merge overlapping reads into contigs....aacatagttgtagaatccatg... Gap Gap Contig Contig Contig Coverage at this location=2 5 Number of reads: ~28 million, read length: 100 bp, genome size: 4.6 Mbp, coverage: ~600x H. Chitsaz, et al., Nature Biotech (2011) 6 1

Sequencing by Hybridization (SBH): History How SBH Works 1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work 1991: Light directed polymer synthesis developed by Steve Fodor and colleagues. 1994: Affymetrix develops first 64-kb DNA microarray First microarray prototype (1989) First commercial DNA microarray prototype w/16,000 features (1994) 500,000 features per chip (2002) Attach all possible DNA probes of length l to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array. Apply a solution containing fluorescently labeled DNA fragment to the array. The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment. How SBH Works (cont d) Hybridization on DNA Array Using a spectroscopic detector, determine which probes hybridize to the DNA fragment to obtain the l mer composition of the target DNA fragment. Apply the combinatorial algorithm (below) to reconstruct the sequence of the target DNA fragment from the l mer composition. l-mer composition Spectrum ( s, l ) - unordered multiset of all possible (n l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} Different sequences the same spectrum Different sequences may have the same spectrum: Spectrum(GTATCT,2)= Spectrum(GTCTAT,2)= {AT, CT, GT, TA, TC} 2

The SBH Problem Goal: Reconstruct a string from its l-mer composition Input: A set S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S Some Applications of Sequencing 1000 Human Genomes Project An international effort to map variability in the genome The 1000 Genomes Project Consortium, Nature (Oct 2010) 467: 1061 1073 Prostate Cancer Genomics M.F. Berger et al., Nature (Feb 2011) 470: 214-220 Genome 10K Project A continuation of Human (2001), Mouse (2002), Rat (2004), Chicken (2004), Dog (2005), Chimpanzee (2005), Macaque (2007), Cat (2007), Horse (2007), Elephant (2009), Turkey (2011), etc. genomes. An international effort to sequence, de novo assemble, and annotate 10,000 vertebrate genomes; 300+ species to be started in 2011. Genome 10K Community of Scientists, J Heredity (Sep 2009) 100 (6): 659-674 14 De Novo Genome Assembly Problem: given a collection of reads, i.e. short subsequences of the genomic sequence in the alphabet A, C, G, T, completely reconstruct the genome from which the reads are derived. Challenges: Repeats in the genome ACCCAGTTTGGGATCCTTTTTTGGGATTTTAACGCG CAGTTTG ACTGGGATCC Sample reads TGGGATT Sequencing errors: substitutions, insertions, deletions, and others. TTTTTATAGA (substitution), CCTT TCG (deletion and insertion) Size of the data, e.g. 1.5 billion reads in 110GB FASTA file. Challenges in Fragment Assembly Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Repeat Repeat Green and blue fragments are interchangeable when assembling repetitive DNA 15 Repeat Types Low-Complexity DNA (e.g. ATATATATACATA ) Microsatellite repeats (a 1 a k ) N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) LINE Long Interspersed Nuclear Elements ~500-5,000 bp long, 200,000 copies LTR retroposons Long Terminal Repeats (~700 bp) at each end Triazzle: A Fun Example The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies 3

De Novo Genome Assembly Current solutions Overlap-layout-consensus (Celera, Newbler) Suitable for low coverage, long reads Highly parallelizable De Bruijn graph construction (ALLPATHS-LG, ABySS, Velvet, SOAPdenovo, EULER-SR, SPAdes, and HyDA) Suitable for high coverage, short reads Fast but memory-intensive Sensitive to sequencing errors Mathematically elegant repeat classification Overlap-Layout-Consensus Assembly 19 20 Overlap-Layout-Consensus Assemblers: SGA, ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Consensus: derive the DNA sequence and correct read errors..acgattacaataggtt.. Overlapping Reads Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment throw away if not >95% similar Overlapping Reads and Repeats A k-mer that appears N times, initiates N 2 comparisons For an Alu that appears 10 6 times à 10 12 comparisons too much Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10) TACA TAGATTACACAGATTAC T GA TAGT TAGATTACACAGATTAC TAGA 4

Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAG TTACACAGATTATTGA TAG TTACACAGATTATTGA Finding Overlapping Reads (cont d) Correct errors using multiple alignment C: 20 C: 20 C: 35 C: 35 T: 30 C: 0 C: 35 C: 35 C: 40 C: 40 TAG TTACACAGATTATTGA A: 15 - A: 40 Score alignments Accept alignments with good scores A: 15 A: 0 A: 40 Layout Repeats are a major challenge. Do two aligned fragments really overlap, or are they from two copies of a repeat? Merge Reads into Contigs repeat region Solution: repeat masking hide the repeats!!! Masking results in high rate of misassembly (up to 20%). Misassembly means alot more work at the finishing step. Merge reads up to potential repeat boundaries Repeats, Errors, and Contig Lengths Repeats shorter than read length are OK. Repeats with more base pair differences than sequencing error rate are OK. De Bruijn Graph Based Assembly To make a smaller portion of the genome appear repetitive, try to: Increase read length. Decrease sequencing error rate. 30 5

De Bruijn Graph Example Shred reads into k-mers (k = 3) Read 1 Read 2 G G A C T A A A G A C C A A A T G G A G A C A C T C T A T A A A A A G A C A C C C C A C A A A A A A A T De Bruijn Graph Example Merge vertices labeled by identical k-mers Read 1: Read 2: GGA ACT ACC CTA CCA TAA CAA AAT GGA ACT CTA TAA P. Pevzner, J Biomol Struct Dyn (1989) 7:63 73 R. Idury, M. Waterman, J Comput Biol (1995) 2:291 306 ACC CCA CAA AAT Resulting Graph: GGA ACT ACC CTA CCA TAA CAA AAT 31 32 Another Example Constructing the graph (k = 4) Example After condensation TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) ( 8x ) ( 5x ) ( 6x ) ( 7x ) ( 7x ) ( 7x ) ( 8x ) ( 8x ) AGATCCGATGAG GCTC CTCT TCTA CTAG AAGTCGA CGAG GCTCTAG AAGT ( 3x ) AGTC ( 7x ) GTCG ( 9x ) TCGA ( 10x ) CGAG ( 8x ) C GAGG ( 16x ) G AGGC GGCT TAGA AGAG GAGA A A ( 16x ) ( 11x ) ( 16x ) ( 9x ) ( 12x ) ( 9x ) ( 8x ) GCTT CTTT TTTA TTAG ( 8x ) ( 8x ) ( 8x ) ( 12x ) ACGC Sequencing errors are normally detected by a coverage cutoff threshold ACAA ( 5x ) CGC GAGGCT GCTTTAG TAGA AGAG G A branching vertex is caused by either a repeat in the original sequence or a sequencing error 33 34 Example After error removal Example After recondensation AGATCCGATGAG AGATCCGATGAG AAGTCGA CGAG GAGGCT GCTTTAG TAGA AGAG G AAGTCGAG GAGGCTTTAGA AGAG G Any non-branching path in this graph corresponds to a contig in the original sequence. Taking the risk of following arbitrary branching paths may create chimeric species 35 Source: Serafim Batzoglou 36 6

Resolving Repeats Using paired reads Resolving Repeats Equivalent transformations Genome: S 1 REPEAT S 2. S 3 REPEAT S 4 Insert size: a design parameter Matches the distance in the graph, Longer than repeat length Genome Read 1 Read 2 S 1 S 2 S 1 REPEAT S 2 REPEAT S 3 REPEAT S 4 S 3 S 4 P. Pevzner and H. Tang, Bioinformatics (2001) Suppl1:S225-33 37 38 Resolving Repeats Failure Spans multiple paths S P 1 1 S 2 Single Cell Sequencing Whole genome amplification Start with a single copy of genome. REPEAT1 REPEAT2 S 3 P 2 S 4 Amplify (copy) the genome using multiple displacement amplification (MDA) technique invented by Roger Lasken at J. Craig Venter Institute. Mate pair transformation (Velvet, ABySS, EULER-SR) Find a unique path between mates in the graph. When multiple paths match the distance between mate-pairs, mate pair transformation fails. To resolve a repeat, insert size must be larger than the repeat length and smaller than the length of potential conjugate paths (same length paths passing through the repeat). F.B. Dean,et al., PNAS (2002) 99(8): 5261-6 Fragment them and sequence reads at both ends. 39 40 Sequencing Coverage Normal multicell vs. single cell Distribution of Coverage H. Chitsaz, et al., Nature Biotech (2011) Green regions are blackout Number of reads: ~28 million, read length: 100 bp, genome size: 4.6 Mbp, coverage: ~600x A cutoff threshold will eliminate about 25% of valid data in the single cell case, whereas it eliminates noise in the normal multicell case. H. Chitsaz, et al., Nature Biotech (2011) 41 42 7