Introduction to Bioinformatics. Genome sequencing & assembly

Similar documents
The first generation DNA Sequencing

Each cell of a living organism contains chromosomes

DNA sequencing. Course Info

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

1. A brief overview of sequencing biochemistry

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Human genome sequence

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Bioinformatics Support of Genome Sequencing Projects. Seminar in biology

Mate-pair library data improves genome assembly

Molecular Biology: DNA sequencing

De novo genome assembly with next generation sequencing data!! "

De Novo Assembly of High-throughput Short Read Sequences

CHEM 4420 Exam I Spring 2013 Page 1 of 6

Genome Sequencing-- Strategies

Genomics AGRY Michael Gribskov Hock 331

Genetics Lecture 21 Recombinant DNA

Chapter 17. PCR the polymerase chain reaction and its many uses. Prepared by Woojoo Choi

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Genetic Fingerprinting

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Third Generation Sequencing

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Introduction to Molecular Biology

2/5/16. Honeypot Ants. DNA sequencing, Transcriptomics and Genomics. Gene sequence changes? And/or gene expression changes?

DNA-Sequencing. Technologies & Devices

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Opportunities offered by new sequencing technologies

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Next Gen Sequencing. Expansion of sequencing technology. Contents

DNA Technology. Asilomar Singer, Zinder, Brenner, Berg

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

MCB 102 University of California, Berkeley July 28-30, Problem Set 6

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Lecture Four. Molecular Approaches I: Nucleic Acids

Additional Activity: Sanger Dideoxy Sequencing: A Simulation Activity

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

Computational Biology I LSM5191

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

Introduction to Next Generation Sequencing (NGS)

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing the Human Genome

Biochemistry 412. New Strategies, Technologies, & Applications For DNA Sequencing. 12 February 2008

DNA-Sequenzierung. Technologien & Geräte

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Biotechnology. DNA Cloning Finding Needles in Haystacks. DNA Sequencing. Genetic Engineering. Gene Therapy

Genome Sequence Assembly

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

R1 12 kb R1 4 kb R1. R1 10 kb R1 2 kb R1 4 kb R1

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

NPTEL VIDEO COURSE PROTEOMICS PROF. SANJEEVA SRIVASTAVA

BST227 Introduction to Statistical Genetics. Lecture 8: Variant calling from high-throughput sequencing data

SAMPLE LITERATURE Please refer to included weblink for correct version.

Next-Generation Sequencing. Technologies

2. Outline the levels of DNA packing in the eukaryotic nucleus below next to the diagram provided.

PV92 PCR Bio Informatics

Lecture 3 (FW) January 28, 2009 Cloning of DNA; PCR amplification Reading assignment: Cloning, ; ; 330 PCR, ; 329.

CloG: a pipeline for closing gaps in a draft assembly using short reads

Finishing Drosophila Ananassae Fosmid 2728G16

Research school methods seminar Genomics and Transcriptomics

Next Generation Sequencing. Simon Rasmussen Assistant Professor Center for Biological Sequence analysis Technical University of Denmark

DNA Structure and Analysis. Chapter 4: Background

Gene Identification in silico

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Sequencing the Human Genome

Connect-A-Contig Paper version

Biotechnology. Chapter 20. Biology Eighth Edition Neil Campbell and Jane Reece. PowerPoint Lecture Presentations for

DNA replication: Enzymes link the aligned nucleotides by phosphodiester bonds to form a continuous strand.

Sept 2. Structure and Organization of Genomes. Today: Genetic and Physical Mapping. Sept 9. Forward and Reverse Genetics. Genetic and Physical Mapping

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Next Generation Sequencing Technologies

PRESENTING SEQUENCES 5 GAATGCGGCTTAGACTGGTACGATGGAAC 3 3 CTTACGCCGAATCTGACCATGCTACCTTG 5

Session 3 Cloning Overview & Polymerase Chain Reaction

LECTURE TOPICS 3) DNA SEQUENCING, RNA SEQUENCING, DNA SYNTHESIS 5) RECOMBINANT DNA CONSTRUCTION AND GENE CLONING

DNA Function: Information Transmission

Chapter 15 Gene Technologies and Human Applications

Genetic Engineering & Recombinant DNA

CHAPTER 9 DNA Technologies

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Recombinant DNA Technology. The Role of Recombinant DNA Technology in Biotechnology. yeast. Biotechnology. Recombinant DNA technology.

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Genomics and Gene Recognition Genes and Blue Genes

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

Genome Sequencing Technologies. Jutta Marzillier, Ph.D. Lehigh University Department of Biological Sciences Iacocca Hall

Part IV => DNA and RNA. 4.1 Nucleotide Properties 4.1a Nucleotide Nomenclature 4.1b DNA Sequencing

1.1 Post Run QC Analysis

2 Gene Technologies in Our Lives

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

Replication. Obaidur Rahman

Fatchiyah

Chapter 20: Biotechnology

Transcription:

Introduction to Bioinformatics Genome sequencing & assembly

Genome sequencing & assembly p DNA sequencing How do we obtain DNA sequence information from organisms? p Genome assembly What is needed to put together DNA sequence information from sequencing? p First statement of sequence assembly problem (according to G. Myers): Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for some string matching problems arising in molecular genetics. Proc. 9th IFIP World Computer Congress, 1983 123

Recovery of shredded newspaper? 124

DNA sequencing p DNA sequencing: resolving a nucleotide sequence (whole-genome or less) p Many different methods developed Maxam-Gilbert method (1977) Sanger method (1977) High-throughput methods 125

Sanger sequencing: sequencing by synthesis p A sequencing technique developed by Fred Sanger p Also called dideoxy sequencing 126

DNA polymerase p A DNA polymerase is an enzyme that catalyzes DNA synthesis p DNA polymerase needs a primer Synthesis proceeds always in 5 ->3 direction 127 http://en.wikipedia.org/wiki/dna_polymerase

Dideoxy sequencing p In Sanger sequencing, chain-terminating dideoxynucleoside triphosphates (ddxtps) are employed ddatp, ddctp, ddgtp, ddttp lack the 3 -OH tail of dxtps p A mixture of dxtps with small amount of ddxtps is given to DNA polymerase with DNA template and primer p ddxtps are given fluorescent labels 128

Dideoxy sequencing p When DNA polymerase encounters a ddxtp, the synthesis cannot proceed p The process yields copied sequences of different lengths p Each sequence is terminated by a labeled ddxtp 129

Determining the sequence p Sequences are sorted according to length by capillary electrophoresis p Fluorescent signals corresponding to labels are registered p Base calling: identifying which base corresponds to each position in a read Non-trivial problem! Output sequences from base calling are called reads 130

Reads are short! p Modern Sanger sequencers can produce quality reads up to ~750 bases 1 Instruments provide you with a quality file for bases in reads, in addition to actual sequence data p Compare the read length against the size of the human genome (2.9x10 9 bases) p Reads have to be assembled! 131 1 Nature Methods - 5, 16-18 (2008)

Problems with sequencing p Sanger sequencing error rate per base varies from 1% to 3% 1 p Repeats in DNA For example, ~300 base Alu sequence repeated is over million times in human genome Repeats occur in different scales p What happens if repeat length is longer than read length? We will get back to this problem later 132 1 Jones, Pevzner (2004)

Shortest superstring problem p Find the shortest string that explains the reads p Given a set of strings (reads), find a shortest string that contains all of them 133

Example: Shortest superstring Set of strings: {000, 001, 010, 011, 100, 101, 110, 111} Concetenation of strings: 000001010011100101110111 010 110 011 000 Shortest superstring: 0001110100 001 111 101 100 134

Shortest superstrings: issues p NP-complete problem: unlike to have an efficient (exact) algorithm p Reads may be from either strand of DNA p Is the shortest string necessarily the correct assembly? p What about errors in reads? p Low coverage -> gaps in assembly Coverage: average number of times each base occurs in the set of reads (e.g., 5x coverage) 135

Sequence assembly and combination locks p What is common with sequence assembly and opening keypad locks? 136

Whole-genome shotgun sequence p Whole-genome shotgun sequence assembly starts with a large sample of genomic DNA 1. Sample is randomly partitioned into inserts of length > 500 bases 2. Inserts are multiplied by cloning them into a vector which is used to infect bacteria 3. DNA is collected from bacteria and sequenced 4. Reads are assembled 137

Assembly of reads with Overlap-Layout- Consensus algorithm p Overlap Finding potentially overlapping reads p Layout Finding the order of reads along DNA p Consensus (Multiple alignment) Deriving the DNA sequence from the layout p Next, the method is described at a very abstract level, skipping a lot of details 138 Kececioglu, J.D. and E.W. Myers. 1995. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13: 7-51.

Finding overlaps p First, pairwise overlap alignment of reads is resolved p Reads can be from either DNA strand: The reverse complement r* of each read r has to be considered acggagtcc agtccgcgctt r 1 5 a t g a g t g g a 3 3 t a c t c a c c t 5 r 2 r 1 : tgagt, r 1* : actca r 2 : tccac, r 2* : gtgga 139

Example sequence to assemble 5 CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCG TGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3 p 20 reads: # Read Read* 1 CATCGTCA TCACGATG 2 CGGTGAAG CTTCACCG 3 TATGCGCA TGCGCATA 4 GACGAGTC GACTCGTC 5 CTGACAAA TTTGTCAG 6 ATGCGCAT ATGCGCAT 7 ATGCGGTC GACCGCAT 8 CTGCGTGA TCACGCAG 9 GCGTGACG CGTCACGC 10 GTCGGTGA TCACCGAC # Read Read* 11 GGTCGGTG CACCGACC 12 ATCGTGAT ATCACGAT 13 GCGCTGCG CGCAGCGC 14 GCATCGTG CACGATGC 15 AGCGCGCT AGCGCGCT 16 GAAGTTGT ACAACTTC 17 AGTGAAAC GTTTCACT 18 ACGCGATG CATCGCGT 19 GCGCATCG CGATGCGC 20 AAGTGAAA TTTCACTT 140

Finding overlaps p Overlap between two reads canbefoundwitha dynamic programming algorithm Errors can be taken into account p Dynamic programming will be discussed more on next lecture p Overlap scores stored into the overlap matrix Entries (i, j) below the diagonal denote overlap of read r i and r j * Overlap(1, 6) = 3 6 ATGCGCAT 1 CATCGTCA 12 ATCGTGAT Overlap(1, 12) = 7 6 12 1 3 7 141

Finding layout & consensus p Method extends the assembly greedily by choosing the best overlaps p Both orientations are considered p Sequence is extended as far as possible 7* GACCGCAT 6=6* ATGCGCAT 14 GCATCGTG 1 CATCGTGA 12 ATCGTGAT 19 GCGCATCG 13* CGCAGCGC --------------------- CGCATCGTGAT Consensus sequence Ambiguous bases 142

Finding layout & consensus p We move on to next best overlaps and extend the sequence from there p The method stops when there are no more overlaps to consider p A number of contigs is produced p Contig stands for contiguous sequence, resulting from merging reads 2 CGGTGAAG 10 GTCGGTGA 11 GGTCGGTG 7 ATGCGGTC --------------------- ATGCGGTCGGTGAAG 143

Whole-genome shotgun sequencing: summary Original genome sequence Reads Non-overlapping read Overlapping reads => Contig p Ordering of the reads is initially unknown p Overlaps resolved by aligning the reads p In a 3x10 9 bp genome with 500 bp reads and 5x coverage, there are ~10 7 reads and ~10 7 (10 7-1)/2 = ~5x10 13 pairwise sequence comparisons 144

Repeats in DNA and genome assembly Two instances of the same repeat 145 Pop, Salzberg, Shumway (2002)

Repeats in DNA cause problems in sequence assembly p Recap: if repeat length exceeds read length, we might not get the correct assembly p This is a problem especially in eukaryotes ~3.1% of genome consists of repeats in Drosophila, ~45% in human p Possible solutions 1. Increase read length feasible? 2. Divide genome into smaller parts, with known order, and sequence parts individually 146

Divide and conquer sequencing approaches: BAC-by-BAC Genome Whole-genome shotgun sequencing Genome Divide-and-conquer BAC library 147

BAC-by-BAC sequencing p Each BAC (Bacterial Artificial Chromosome) is about 150 kbp p Covering the human genome requires ~30000 BACs p BACs shotgun-sequenced separately Number of repeats in each BAC is significantly smaller than in the whole genome......needs much more manual work compared to whole-genome shotgun sequencing 148

Hybrid method p Divide-and-conquer and whole-genome shotgun approaches can be combined Obtain high coverage from whole-genome shotgun sequencing for short contigs Generate of a set of BAC contigs with low coverage Use BAC contigs to bin short contigs to correct places p This approach was used to sequence the brown Norway rat genome in 2004 149

Paired end sequencing p Paired end (or mate-pair) sequencing is technique where both ends of an insert are sequenced For each insert, we get two reads We know the distance between reads, and that they are in opposite orientation k Read 1 Read 2 Typically read length < insert length 150

Paired end sequencing p The key idea of paired end sequencing: Both reads from an insert are unlikely to be in repeat regions If we know where the first read is, we know also second s location Repeat region k Read 1 Read 2 p This technique helps to WGSS higher organisms 151

First whole-genome shotgun sequencing project: Drosophila melanogaster p Fruit fly is a common model organism in biological studies p Whole-genome assembly reported in Eugene Myers, et al., A Whole-Genome Assembly of Drosophila, Science 24, 2000 p Genome size 120 Mbp 152 http://en.wikipedia.org/wiki/drosophila_melanogaster

Sequencing of the Human Genome p The (draft) human genome was published in 2001 p Two efforts: Human Genome Project (public consortium) Celera (private company) p HGP: BAC-by-BAC approach p Celera: whole-genome shotgun sequencing HGP: Nature 15 February 2001 Vol 409 Number 6822 Celera: Science 16 February 2001 Vol 291, Issue 5507 153

Genome assembly software p phrap (Phil s revised assembly program) p AMOS (A Modular, Open-Source wholegenome assembler) p CAP3 / PCAP p TIGR assembler 154

Next generation sequencing techniques p Sanger sequencing is the prominent firstgeneration sequencing method p Many new sequencing methods are emerging p See Lars Paulin s slides (course web page) for details 155

Next-gen sequencing: 454 p Genome Sequencer FLX (454 Life Science / Roche) >100 Mb / 7.5 h run Read length 250-300 bp >99.5% accuracy / base in a single run >99.99% accuracy / base in consensus 156

Next-gen sequencing: Illumina Solexa p Illumina / Solexa Genome Analyzer Read length 35-50 bp 1-2 Gb / 3-6 day run > 98.5% accuracy / base in a single run 99.99% accuracy / consensus with 3x coverage 157

Next-gen sequencing: SOLiD p SOLiD Read length 25-30 bp 1-2 Gb / 5-10 day run >99.94% accuracy / base >99.999% accuracy / consensus with 15x coverage 158

Next-gen sequencing: Helicos p Helicos: Single Molecule Sequencer No amplification of sequences needed Read length up to 55 bp p Accuracy does not decrease when read length is increased p Instead, throughput goes down 25-90 Mb / h >2 Gb / day 159

Next-gen sequencing: Pacific Biosciences p Pacific Biosciences Single-Molecule Real-Time (SMRT) DNA sequencing technology Read length thousands of nucleotides p Should overcome most problems with repeats Throughput estimate: 100 Gb / hour First instruments in 2010? 160