The first generation DNA Sequencing

Similar documents
10/20/2009 Comp 590/Comp Fall

Lecture 14: DNA Sequencing

CSCI2950-C DNA Sequencing and Fragment Assembly

CSE182-L16. LW statistics/assembly

We begin with a high-level overview of sequencing. There are three stages in this process.

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Molecular Cloning. Genomic DNA Library: Contains DNA fragments that represent an entire genome. cdna Library:

The most popular method for doing this is called the dideoxy method or Sanger method (named after its inventor, Frederick Sanger, who was awarded the

DNA SEQUENCING BY SANGER METHOD

sequencing I. Brief history of sequencing II. Sanger dideoxy method for III. Maxam-Gilbert chemical cleavage method

DNA sequencing. Course Info

Genetic Fingerprinting

Genomic Sequencing. Genomic Sequencing. Maj Gen (R) Suhaib Ahmed, HI (M)

Alignment and Assembly

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

DNA and genome sequencing. Matthew Hudson Dept of Crop Sciences University of Illinois

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

1. A brief overview of sequencing biochemistry

Genetic Fingerprinting

Biochemistry. Dr. Shariq Syed. Shariq AIKC/FinalYB/2014

Chapter 6 - Molecular Genetic Techniques

Restriction Enzymes (Site-Specific Endonuclease) Enzymes that recognize and cleave dsdna in a highly sequence specific manner.

The project of mapping Human Genome. Why they want to make a map of the human genome?????

BIOLOGY - CLUTCH CH.20 - BIOTECHNOLOGY.

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Introduction to Bioinformatics. Genome sequencing & assembly

DNA Sequencing and Assembly

7.1 Techniques for Producing and Analyzing DNA. SBI4U Ms. Ho-Lau

Matthew Tinning Australian Genome Research Facility. July 2012

Lecture 8: Sequencing and SNP. Sept 15, 2006

Additional Activity: Sanger Dideoxy Sequencing: A Simulation Activity

Lecture Four. Molecular Approaches I: Nucleic Acids

AP Biology

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Biol 478/595 Intro to Bioinformatics

Genetics and Genomics in Medicine Chapter 3. Questions & Answers

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

Selected Techniques Part I

Mate-pair library data improves genome assembly

BENG 183 Trey Ideker (the details )

Biotechnology. Chapter 20. Biology Eighth Edition Neil Campbell and Jane Reece. PowerPoint Lecture Presentations for

Introduction to Bioinformatics. Lecture 20: Sequencing genomes

Molecular Genetics Techniques. BIT 220 Chapter 20

Chapter 7. DNA Microarrays

GENETICS EXAM 3 FALL a) is a technique that allows you to separate nucleic acids (DNA or RNA) by size.

2/5/16. Honeypot Ants. DNA sequencing, Transcriptomics and Genomics. Gene sequence changes? And/or gene expression changes?

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Concepts and methods in sequencing and genome assembly

Genome Projects. Part III. Assembly and sequencing of human genomes

Nuts and bolts of phage genome sequencing. the 5 5 and 5 8 perspective. Allison Johnson & Anneke Padolina

Gene Expression Technology

Chapter 20 Biotechnology

Genetics Lecture 21 Recombinant DNA

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Molecular Biology: DNA sequencing

Sequencing the Human Genome

4. Analysing genes II Isolate mutants*

Genome Sequencing-- Strategies

The Polymerase Chain Reaction. Chapter 6: Background

Flag illusion: stare at a fixed point on the screen for ~30 seconds. Then look at a white wall or white piece of paper. What do you see?

DNA vs. RNA DNA: deoxyribonucleic acid (double stranded) RNA: ribonucleic acid (single stranded) Both found in most bacterial and eukaryotic cells RNA

De Novo Assembly of High-throughput Short Read Sequences

Design. Construction. Characterization

The Structure of Proteins and DNA

Polymerase chain reaction

CISC 889 Bioinformatics (Spring 2004) Lecture 3

Chapter 21 Techniques of Molecular Biology. Department of Food Science National Taiwan Ocean University

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Contact us for more information and a quotation

Computational Biology I LSM5191

NB536: Bioinformatics

Bioinformatics for Genomics

DNA Sequencing by Ion Torrent. Marc Lavergne CHEM 4590

ARACHNE: A Whole-Genome Shotgun Assembler

Module 17: Genetic Engineering and Biotechnology, Student Learning Guide

Finishing Drosophila Ananassae Fosmid 2728G16

Quiz Submissions Quiz 4

Fatchiyah

Appendix A DNA and PCR in detail DNA: A Detailed Look

SELECTED TECHNIQUES AND APPLICATIONS IN MOLECULAR GENETICS

Chapter 20 DNA Technology & Genomics. If we can, should we?

SEQUENCING TARU SINGH UCMS&GTBH

Gene Expression - Transcription

SAMPLE LITERATURE Please refer to included weblink for correct version.

PBG 430/530 Exam

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

AGRO/ANSC/BIOL/GENE/HORT 305 Fall, 2017 Recombinant DNA Technology (Chpt 20, Genetics by Brooker) Lecture outline: (#14)

A Lot More Advanced Biotechnology Tools. DNA Sequencing. DNA Sequencing. Sequencing and more. Sanger method

Amplified segment of DNA can be purified from bacteria in sufficient quantity and quality for :

Genomics AGRY Michael Gribskov Hock 331

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Biotechnology. Biotechnology is difficult to define but in general it s the use of biological systems to solve problems.

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

Next Generation Sequencing. Dylan Young Biomedical Engineering

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

3. Translation. 2. Transcription. 1. Replication. and functioning through their expression in. Genes are units perpetuating themselves

BS1940 Course Topics Fall 2001 Drs. Hatfull and Arndt

Transcription:

The first generation DNA Sequencing Slides 3 17 are modified from faperta.ugm.ac.id/newbie/download/pak_tar/.../instrument20072.ppt slides 18 43 are from Chengxiang Zhai at UIUC.

The strand direction http://en.wikipedia.org/wiki/dna

DNA sequencing Determination of nucleotide sequence the determination of the precise sequence of nucleotides in a sample of DNA Two similar methods: 1. Maxam and Gilbert method 2. Sanger method They depend on the production of a mixture of oligonucleotides labeled either radioactively or fluorescein, with one common end and differing in length by a single nucleotide at the other end This mixture of oligonucleotides is separated by high resolution electrophoresis on polyacrilamide gels and the position of the bands determined

Maxam-Gilbert Walter Gilbert Harvard physicist Knew James Watson Became intrigued with the biological side Became a biophysicist Allan Maxam

The Maxam-Gilbert Technique Principle - Chemical Degradation of Purines Purines (A, G) damaged by dimethylsulfate Methylation of base Heat releases base Alkali cleaves G Dilute acid cleave A>G

Maxam-Gilbert Technique Principle Chemical Degradation of Pyrimidines Pyrimidines (C, T) are damaged by hydrazine Piperidine cleaves the backbone 2 M NaCl inhibits the reaction with T

Advantages/disadvantages Maxam-Gilbert sequencing Requires lots of purified DNA, and many intermediate purification steps Relatively short readings Automation not available (sequencers) Remaining use for footprinting (partial protection against DNA modification when proteins bind to specific regions, and that produce holes in the sequence ladder) In contrast, the Sanger sequencing methodology requires little if any DNA purification, no restriction digests, and no labeling of the DNA sequencing template

Fred Sanger, 1958 Was originally a protein chemist Made his first mark in sequencing proteins Made his second mark in sequencing RNA 1980 dideoxy sequencing Sanger Method

Sanger Method in-vitro DNA synthesis using terminators, use of dideoxynucleotides that do not permit chain elongation after their integration DNA synthesis using deoxy- and dideoxynucleotides that results in termination of synthesis at specific nucleotides Requires a primer, DNA polymerase, a template, a mixture of nucleotides, and detection system Incorporation of di-deoxynucleotides into growing strand terminates synthesis Synthesized strand sizes are determined for each dideoxynucleotide by using gel or capillary electrophoresis Enzymatic methods

deoxyribonucleotide

Dideoxynucleotide PPP O 5 CH2 O BASE 3 no hydroxyl group at 3 end prevents strand extension

primer 3 CCGTAC 5 5 3 dntp ddatp ddttp ddctp ddgtp GGCA GGCAT A T C G GGC G GG GGCATG

Sample Output 1 lane

Phred http://www.phrap.org/phrap.docs/phred.html

Sanger sequencing Laser excitation of fluorescent labels as fragments of discreet lengths exit the capillary, coupled to four color detection of emission spectra, provides the readout that is represented in a Sanger sequencing trace. Software translates these traces into DNA sequence, while also generating error probabilities for each base call. Simultaneous electrophoresis in 96 or 384 independent capillaries provides a limited level of parallelization. After three decades of gradual improvement, the Sanger biochemistry can be applied to achieve read lengths of up to ~1,000 bp, and per base raw accuracies as high as 99.999%. In the context of highthroughput shotgun genomic sequencing, Sanger sequencing costs on the order of $0.50 per kilobase.

Comparison Sanger Method Enzymatic Requires DNA synthesis Termination of chain elongation Maxam Gilbert Method Chemical Requires DNA Requires long stretches of DNA Breaks DNA at different nucleotides

How to obtain the human genome sequence The Sanger sequencing can only generate 1kb long DNA segments. How to obtain the human genome that are 3 billion letters? The answer is to get pieces of DNA segments and assemble them into the genome.

Challenges with Fragment Assembly Sequencing errors ~1 2% of bases are wrong Repeats false overlap due to repeat Bacterial genomes:5% Mammals: 50%

Repeat Types Low-Complexity DNA (e.g. ATATATATACATA ) Microsatellite repeats (a 1 a k ) N where k ~ 3 6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) LINE Long Interspersed Nuclear Elements ~500 5,000 bp long, 200,000 copies LTR retroposons Long Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies

Strategies for whole genome sequencing 1. Hierarchical Clone by clone yeast, worm, human i. Break genome into many long fragments ii. Map each long fragment onto the genome iii. Sequence each fragment with shotgun 2. Online version of (1) Walking rice genome i. Break genome into many long fragments ii. Start sequencing each fragment with shotgun iii. Construct map as you go 3. Whole Genome Shotgun fly, human, mouse, rat, fugu One large shotgun pass on the whole genome

Hierarchical Sequencing vs. Whole Genome Shotgun Hierarchical Sequencing Advantages: Easy assembly Disadvantages: Build library & physical map; Redundant sequencing Whole Genome Shotgun (WGS) Advantages: No mapping, no redundant sequencing Disadvantages: Difficult to assemble and resolve repeats Whole Genome Shotgun appears to get more popular

Whole Genome Shotgun Sequencing genome cut many times at random known dist forward-reverse paired reads ~500 bp ~500 bp

Fragment Assembly reads Cover region with ~7-fold redundancy Overlap reads and extend to reconstruct the original genomic region

Read Coverage C Length of genomic segment: Number of reads: Length of each read: G N L Definition: Coverage C = NL/ G

Enough Coverage How much coverage is enough? According to the Lander Waterman model: Assuming uniform distribution of reads, C=7 results in 1 gap per 1,000 nucleotides

Lander Waterman Model Major Assumptions Reads are randomly distributed in the genome The number of times a base is sequenced follows a Poisson distribution px ( x) Average times x! G= genome length, L=read length, N = # reads Mean of Poisson: =LN/G (coverage) % bases not sequenced: p(x=0) =0.0009 = 0.09% Total gap length: p(x=0)*g Total number of gaps: p(x=0)*n Implications x e This model was used to plan the Human Genome Project

Overlap Layout Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors..acgattacaataggtt..

Overlap Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

Overlapping Reads Sort all k mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment throw away if not >95% similar TACA TAGATTACACAGATTACT GA TAGT TAGATTACACAGATTACTAGA

Overlapping Reads and Repeats A k mer that appears N times, initiates N 2 comparisons For an Alu that appears 10 6 times 10 12 comparisons too much Solution: Discard all k mers that appear more than t Coverage, (t ~ 10)

Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

Finding Overlapping Reads (cont d) Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 A: 15 A: 25 - A: 40 A: 25 Score alignments Accept alignments with good scores C: 20 C: 35 C: 0 C: 35 C: 40 A: 15 A: 25 A: 0 A: 40 A: 25 Multiple alignments will be covered later in the course

Layout Repeats are a major challenge Do two aligned fragments really overlap, or are they from two copies of a repeat?

Merge Reads into Contigs repeat region Merge reads up to potential repeat boundaries

Merge Reads into Contigs (cont d) repeat region Ignore non maximal reads Merge only maximal reads into contigs

Merge Reads into Contigs (cont d) repeat boundary??? sequencing error a b Ignore hanging reads, when detecting repeat boundaries

Merge Reads into Contigs (cont d)????? Unambiguous Insert non-maximal reads whenever unambiguous

Link Contigs into Supercontigs Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed?

Link Contigs into Supercontigs (cont d) Find all links between unique contigs Connect contigs incrementally, if 2 links

Link Contigs into Supercontigs (cont d) Fill gaps in supercontigs with paths of overcollapsed contigs

Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected

Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting