Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Genome 373: High- Throughput DNA Sequencing Doug Fowler

Tasks give ML unity We learned about three tasks that are commonly encountered in ML

Models/Algorithms Give ML Diversity Classification Regression Clustering We covered a handful of the many models that can be employed to solve these tasks

Models/Algorithms Give ML Diversity Classification Regression Clustering Decision trees Random forests K-nearest neighbor Neural networks Naïve Bayes classifier SVM Perceptron SVR Linear regression Kernel regression Generalized linear models K-means Expectation-max Hierarchical Mean-shift Density-based Graph-based Biclustering We learned about three tasks that are commonly encountered in ML

ML In Genomics Has Been Driven By an Onslaught of Data

Outline HTS technology and a bit of history HTS applications: a data onslaught The future

DNA sequencing throughput 1996 1998 2001 2009 100's 1,000's 1,000,000's nucleotides sequenced per day per instrument 1,000,000,000's

How? New Sequencing Paradigms How does Sanger sequencing work? Sanger sequencing

How? New Sequencing Paradigms Each DNA molecule of interest is amplified and sequenced separately Sanger sequencing

How? New Sequencing Paradigms Massively parallel sequencing Sanger sequencing Millions of DNA molecules are amplified (maybe) and sequenced in parallel

Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR

How Does PCR Work?

Bridge PCR To Amplify DNA Cluster generating primer #2 DNA to be sequenced Cluster generating primer #1 The DNA to be sequenced is prepared by attaching cluster generating primers to each end

Bridge PCR To Amplify DNA Single stranded DNA molecules are attached a glass flowcell seeded with surface-bound cluster generation primers

Bridge PCR To Amplify DNA Templates are annealed to the complimentary surfacebound cluster generating primers

Bridge PCR To Amplify DNA A single-stranded bridge Templates are annealed to the complimentary surfacebound cluster generating primers

Bridge PCR To Amplify DNA A double-stranded bridge DNA polymerase extends the annealed primer

Bridge PCR To Amplify DNA This completes one cycle of bridge PCR (note we now have two surface-bound copies of each template)

Bridge PCR To Amplify DNA Repeated cycles result in many copies of each clonal template per cluster The clusters form a randomly addressed DNA array

Bridge PCR Lends Itself to an Optical Readout A single Illumina NextSeq flowcell will contain ~400,000,000 clusters each grown from an individual DNA molecule

Massively Parallel DNA Sequencing A T Cycle 1 C A G G T A C Cycle 2 G A T A G G A C C A Cycle 3 T T C C C G G T AGT GAC ACG CTT GGC TAA TAG AGC CCT What is Base 1? What is Base 2? What is Base 3? One cycle: 1. Polymerase extension with fluorescent nucleotides 2. Imaging of the slide 3. Cleave & wash

109-plex à effective reagent volume of femotoliters per sequencing reaction

Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR

Emulsion PCR to amplify DNA As with bridge PCR we start by attaching amplification adapters to DNA template molecules

Emulsion PCR to amplify DNA Next we mix these template molecules with beads containing a single surface-bound amplification primer

Emulsion PCR to amplify DNA An emulsion is formed from template and beads such that each droplet gets one bead and one template (on average)

Emulsion PCR to amplify DNA An emulsion is a stable mixture of water droplets in oil each droplet is like a tiny PCR tube.

Emulsion PCR to amplify DNA The droplet contains all the ingredients necessary for PCR including the other primer and the polymerase

Emulsion PCR to amplify DNA The free template is annealed to a surface bound primer, extended with the polymerase and the new strand is melted off the bead. This is one cycle

Emulsion PCR to amplify DNA Multiple cycles result in clonal amplification of the single template DNA strand

Emulsion PCR to amplify DNA Finally, we break the emulsion and isolate beads with amplified product

So, what next? We now have millions of beads each with clonally amplified template strands what should we do next?

Bead-Bound DNA Lends Itself to Microwells One cycle: 1. Add first NTP (e.g. ATP) 2. ph change from reaction detected 3. Wash 4. Repeat for each other NTP

Many ways to skin a cat Rothberg et al. (2011) Clonal Amp emulsion PCR Sequencing ph microsensor Fedurco et al. (2006) bridge PCR Optical imaging

2001:$3 billion

2001:$3 billion 2007:$100 million

2008:$1.5 million 2007:$100 million 2001:$3 billion

2015:$1,000 2008:$1.5 million 2007:$100 million 2001:$3 billion

What can a sequencer do today? In 8 days (Illumina HiSeq 2000) Paired end 100 bp reads >2 billion read- pairs >400 gigabases (Gb) of total output 1 in 1,000 error rate for most base- calls IniEal drag of human genome based on 23 Gb

What s the catch? 1. Very short, error-laden sequence reads 2. Sequence biases (e.g. G+C content) 3. Days to weeks per instrument run 4. Substantial bioinformatics burden 5. It s cheap, but it still isn t free

Outline HTS technology and a bit of history HTS applications: a data onslaught The future

The ENCODE Project As An Example The goal of the Encyclopedia Of DNA Elements (ENCODE) was to identify all functional elements of the human genome This was a massive and audacious undertaking that generated a ridiculous quantity of data of multiple types The results are somewhat controversial, but a significant fraction of the human genome has some function

Where Did All This Data Come From? Every position in the genome was annotated with a huge amount of data on histone modifications, protein binding, accessibility, etc

HTS Enables High- Throughput Biology Biological phenomenon of interest We start with something we want to measure for each position in a genome

HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Then, we develop a way to measure that phenomenon with sequencing

HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon A current-generation HT sequencer can collect ~400m reads each of which can be an individual measurement

HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon What are some things we might want to measure about positions in a genome?

DNA = machine-readable format for capturing biological information 1. Genetic variation in individuals 2. de novo genome assembly 3. Digital expression profiling (RNA-Seq) 4. Protein-DNA interactions (ChIP-Seq) 5. DNA accessibility (DNAse-Seq) 6. Methylation profiling (Methyl-Seq)

DNAse hypersensitivity: DNA Accessibility Why? DNA in the nucleus is not naked. Chromatin is a complex assembly of genomic DNA and proteins

DNAse hypersensitivity: DNA Accessibility We want to know where individual proteins bind to DNA

DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins

DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins Before HTS, individual loci were probed using Southern blots (e.g. gels followed by hybridization of radioactively labeled probe DNA)

DNAse-Seq: DNA Accessibility A first linker is ligated to mark the DNAse sensitive sites

DNAse-Seq: DNA Accessibility The DNA is digested with another nuclease to generate small fragments

DNAse-Seq: DNA Accessibility A second linker is added and the fragments are amplified Note, we now have a bunch of sequenceable fragments, one end of each corresponding to a DNAsesensitive site in the genome!

DNAse-Seq: DNA Accessibility Once we sequence our fragments, what informatics step do we need to take next?

DNAse-Seq: DNA Accessibility Alignment of the reads to the genome to find where the hypersensitive sites are!

DNAse-Seq: DNA Accessibility After read mapping we know the number of DNAse fragments at each position in the genome

DNAse-Seq: DNA Accessibility HTS reveals the number of DNAse fragments at each position in the genome This is a digital readout that tells us how sensitive to DNAse each site in the genome is (e.g. it s quantitative)

HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point

HTS-based Readouts Are Genomewide Based on the sequence of protected DNA we can predict what transcription factors bind in each location

HTS-based Readouts Are Exquisitely Sensitive We can actually see the effect of DNA contacts made by different transcription factors!

ChIP-Seq: Protein Binding How would you find the location of every binding site for a particular transcription factor?

ChIP-Seq: Protein Binding Crosslink DNA and proteins using formaldehyde (why?) then shear

ChIP-Seq: Protein Binding Use an antibody against the protein of interest to immunoprecipitate the protein-dna complexes

ChIP-Seq: Protein Binding Reverse chemical crosslinks and sequence DNA fragments

CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How do you guys think chromosomes are arranged within nuclei?

CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom In fact, this is an active area of investigation and the degree to which chromosome conformation is ordered is unclear

CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How would you design an assay that could reveal chromosome conformation (hint: remember that in chromosomes DNA is coated with proteins)?

CCC (Chromosome Conformation Capture) The key idea is that we crosslink the proteins so that adjacent DNA is physically linked

CCC (Chromosome Conformation Capture)

CCC (Chromosome Conformation Capture) This is the conformation of chromosomes in the yeast nucleus as determined by sequencing!

Outline HTS technology and a bit of history HTS applications: a data onslaught The future

Single Molecule Sequencing Many of the problems of HTS inhere to: 1) Massively parallel DNA amplification (errors, etc) 2) Very short reads (mapping problems) What if we could sequence long, unamplified individual DNA molecules?

Single Molecule Sequencing: Fluorescent Uses zero-mode waveguide enables detection of fluorescence only very near the bottom of the nanowell

Single Molecule Sequencing: Fluorescent Each nanowell contains one DNA polymerase and one template DNA molecule

Single Molecule Sequencing: Fluorescent Sequencing is detected in real time by looking for pulses of fluorescence associated with base addition

Single Molecule Sequencing: Fluorescent Fluorophores are attached to the triphosphate group so they diffuse away from the detection volume rapidly!

Single Molecule Sequencing: Nanopores Protein nanopores enable water and ions to cross membranes

Single Molecule Sequencing: Nanopores The result is a current flow if there is a difference in potential across the membrane

Single Molecule Sequencing: Nanopores If a large molecule passes through the aperture of the pore, current flow is reduced

Single Molecule Sequencing: Nanopores Objects of different sizes cause different (and characteristic) current flows

Single Molecule Sequencing: Nanopores Sensing the different current flow induced by each different base is the key concept of nanopore sequencing