Introduction to Next Generation Sequencing (NGS)

Introduction to Next eneration Sequencing (NS) Simon Rasmussen Assistant Professor enter for Biological Sequence analysis Technical University of Denmark 2012

Today 9.00-9.45: Introduction to NS, How it works 10.00-10.30: Data basics - what does the data look like? 10.30-11.00: de novo assembly exercise 11.15-12.00: Alignment of reads 13.00-13.30: Introduction to exercise (variations, alignment processing, genotyping) 13.30-16.30: Afternoon exercise

DNA sequencing Reading the order of bases in DNA fragments

Why NS? Transforming how we are doing biological science (and bioinformatics) by allowing experiments that could not have been done before, and perform experiments much faster

How? by producing massive amounts of sequence data, really fast

1st generation to NS Kilobases per day per machine 1,000,000,000 100,000,000 10,000,000 1,000,000 100,0000 10,000 1,000 100 10 el-based systems Manual slab gel Automated slab gel Massively parallel sequencing apillary sequencing irst-generation capillary Microwell pyrosequencing Second-generation capillary sequencer Single molecule? Short-read sequencers 1980 1985 1990 1995 2000 Year 2005 2010 uture 1977 - Sanger hain-termination method Stratton et al., Nature 2009

1st generation to NS 1,000,000,000 Single molecule? Illumina Kilobases per day per machine 100,000,000 10,000,000 1,000,000 100,0000 10,000 1,000 100 10 el-based systems Manual slab gel Automated slab gel Massively parallel sequencing apillary sequencing irst-generation capillary Microwell pyrosequencing Second-generation capillary sequencer Short-read sequencers Solid 454 Ion Torrent Pacific Biosciences Oxford Nanopore 1980 1985 1990 1995 2000 Year 2005 2010 uture 1977 - Sanger hain-termination method Human genome Stratton et al., Nature 2009

Read throughput 1977 2006 x very big number x 1 2007 x even bigger number 1998 2008 x gigantic number x 384 2011 x big number

Read throughput 1977 2006 x very big number x 1 1-384 2007 x even bigger number 1998 2008 x gigantic number x 384 2011 x big number

Read throughput 1977 2006 x very big number x 1 2007 1-384 10 5-10 9 x even bigger number 1998 2008 x gigantic number x 384 2011 x big number

Sequencing costs Drop in costs is faster than Moore s Law (omputer power doubles every 2 years)

Human sequencing irst draft genome of human in 2001, final 2004 Estimated costs $3 billion, time 13 years Today: Illumina: 1 week, 4000$ Exome: 6 weeks*, $998 Towards 1000$ genome? * Real-time, not machine-time

Storage and analysis Highest cost is (almost) not the sequencing but storage and analysis A standard human (30-40x) wholegenome sequencing exp. would create 100 b of data

Storage and analysis Highest cost is (almost) not the sequencing but storage and analysis A standard human (30-40x) wholegenome sequencing exp. would create 100 b of data BI, based in hina, is the world s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2-4,000 human genomes a day.

The X enomes projects 1000 genomes project: atalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts Sequence 2500 unidentified people from about 25 populations around the world 10.000 microbial genomes project, Earth Microbiome project, ancer genome project, Plants and animals,...

NS & Bioinformatics Extreme data size causes problems Just transferring and storing the data Standard comparisons fail (N*N) Standard tools can not be used Think in fast and parallel programs

What can we use it for? Whole genome re-sequencing Ancient genomes Metagenomics ancer genomics Exome sequencing (targeted) RNA sequencing hip-seq enomic Epidemiology anything with DNA

How it works?

irst generation: Sanger (dye) ragment DNA lone into plasmid and amplify Sequence using dntp + labelled ddntps (stops reaction) Run capillary electrophoresis and read DNA code Low output, long reads (~300-1000 nt), high quality

2nd generation 1. reate library molecule 2. Amplification (PR) 3. Massive parallel sequencing

2nd generation 1. reate library molecule 2. Amplification (PR) 3. Massive parallel sequencing DNA from extract ragment & polish DNA Adapters Library molecule

2nd generation 1. reate library molecule 2. Amplification (PR) 3. Massive parallel sequencing Library

Amplification and immobilization Emulsion PR (454, Solid, IonTorrent): Water, oil, beads, one DNA template/droplet Bridge PR (Illumina): One DNA template/cluster, primers on surface, grow by bridging primers Metzker, Naten Rev. 2010

luorescence detection REVIEWS Illumina - yclic reversible termination 454 - Pyrosequencing a Illumina/Solexa Reversible terminators A T A T c Helicos BioSciences Reversible terminators Add all dntps Incorporate all four nucleotides, each label with a different dye labelled w. diff dye T A A T Incorporate single, dye-labelled nucleotides Load template beads into wells reate fourcolor image Wash, fourcolour imaging T A T Wash, onecolour imaging low one dntp across wells Polymerase incorporates nucleotide leave dye and repeat next cycle leave dye and terminating groups, wash T A T leave dye and inhibiting groups, cap, wash Release of PPi leads to light 27803 - Biological Sequence b Analysis Repeat cycles d Imaging, next Repeat cycles dntp Metzker, Naten Rev. 2010 T A

groups, wash groups, cap, wash 2: Imaging handout Repeat cycles Repeat b d Illumina 1: T A Illumina 2: T A T A Top: ATT Bottom: Top: Bottom: TAT ATA One-base-encoded probe An oligonucleotide sequence in which one interrogation base is 27803 associated - Biological with Sequence a particular Analysis igure 2 our-colour and one-colour cyclic reversible termination methods. a The four-colour cyclic termination (RT) method uses Illumina/Solexa s 3 -O-azidomethyl reversible terminator chemistry 23,101 (B solid-phase-amplified template clusters (I. 1b, shown as single templates for illustrative purposes). ollo imaging, a cleavage step removes the fluorescent dyes and regenerates the 3 -OH group using the reduci tris(2-carboxyethyl)phosphine (TEP) 23. b The four-colour images highlight the sequencing data from tw amplified templates. c Unlike Illumina/Solexa s terminators, 454: the Helicos Virtual Terminators 33 are labelled same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition ollowing total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and groups using TEP to permit the addition of the next y5-2 -deoxyribonucleoside triphosphate (dntp) an free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition 33 (step d The one-colour images highlight the sequencing data from two single-molecule templates. Metzker, Naten Rev. 2010

2.5: Ion Torrent IonTorrent video Based on semiconductors, ie. no fluorescence Release of hydrogen when a nucl. is incorporated is measured by ph-meter Small machine, low price pr. run

3rd generation No amplification (PR introduces bias!) Simple sample preparation Helicos Pacific Biosciences Oxford Nanopore

Platform 3730XL 454 LX 454 S JR HiSeq 2000 MiSeq SOLiD 5500 IonTorrent PacBio RS Method of amplification lonal plasmid amplificatio n emrr on beads emrr on beads Bridge PR amplification Bridge PR amplification empr on bead empr on bead None hemistry hain termination Synthesis (Pyrosequencing) Synthesis (Pyrosequencing) Synthesis (Reversible termination) Synthesis (Reversible termination) Ligation (dual-base encoding) Synthesis (H + detection) Synthesis Instrument ost $376k $500k $108k $690k $125k $595k $67.5k $695k Yield per Run 60 kb 900 Mb 50 Mb 600 b 1 b 155 b 1 b 20-80 Mb Read Length (bases) 650 750 400 100 150 75 + 35 200 (318 chip) <1,800 - >5,000 Reagent ost (library + run) $96 $6 200 $1 100 $23 610 $1 035 $10 503 $925 $272 ost per Mb $1600 $7 $22 $0.039 $1 $0.068 $0.93 $3.4-13.6 Primary error & error rate Substitution 0.1-1 % Indel 1% Indel 1% Substitution >0.1% Substitution >0.1% indel >0.01% Indel ~1% Indel ~15% Primary Advantage Low cost for small study Long read length Long read length Most output at lowest cost Easy workflow & fast run Each lane can be run independent ly & ability to rescue failed cycle ast run, low cost, and trajectory to longer read Longest read length, single molecule real-time seq Primary Disadvantage High cost for large study Unreliable for homopolym er region; High cost NS High cost per Mb High capital cost & computation need ew reads & higher cost per Mb Relatively short read, more gap in assemblies Unreliable for long homopolymer region High error rates, Low output, expensive