Genome 373: High- Throughput DNA Sequencing Doug Fowler
Tasks give ML unity We learned about three tasks that are commonly encountered in ML
Models/Algorithms Give ML Diversity Classification Regression Clustering We covered a handful of the many models that can be employed to solve these tasks
Models/Algorithms Give ML Diversity Classification Regression Clustering Decision trees Random forests K-nearest neighbor Neural networks Naïve Bayes classifier SVM Perceptron SVR Linear regression Kernel regression Generalized linear models K-means Expectation-max Hierarchical Mean-shift Density-based Graph-based Biclustering We learned about three tasks that are commonly encountered in ML
ML In Genomics Has Been Driven By an Onslaught of Data
Outline HTS technology and a bit of history HTS applications: a data onslaught The future
DNA sequencing throughput 1996 1998 2001 2009 100's 1,000's 1,000,000's nucleotides sequenced per day per instrument 1,000,000,000's
How? New Sequencing Paradigms How does Sanger sequencing work? Sanger sequencing
How? New Sequencing Paradigms Each DNA molecule of interest is amplified and sequenced separately Sanger sequencing
How? New Sequencing Paradigms Massively parallel sequencing Sanger sequencing Millions of DNA molecules are amplified (maybe) and sequenced in parallel
Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR
How Does PCR Work?
Bridge PCR To Amplify DNA Cluster generating primer #2 DNA to be sequenced Cluster generating primer #1 The DNA to be sequenced is prepared by attaching cluster generating primers to each end
Bridge PCR To Amplify DNA Single stranded DNA molecules are attached a glass flowcell seeded with surface-bound cluster generation primers
Bridge PCR To Amplify DNA Templates are annealed to the complimentary surfacebound cluster generating primers
Bridge PCR To Amplify DNA A single-stranded bridge Templates are annealed to the complimentary surfacebound cluster generating primers
Bridge PCR To Amplify DNA A double-stranded bridge DNA polymerase extends the annealed primer
Bridge PCR To Amplify DNA This completes one cycle of bridge PCR (note we now have two surface-bound copies of each template)
Bridge PCR To Amplify DNA Repeated cycles result in many copies of each clonal template per cluster The clusters form a randomly addressed DNA array
Bridge PCR Lends Itself to an Optical Readout A single Illumina NextSeq flowcell will contain ~400,000,000 clusters each grown from an individual DNA molecule
Massively Parallel DNA Sequencing A T Cycle 1 C A G G T A C Cycle 2 G A T A G G A C C A Cycle 3 T T C C C G G T AGT GAC ACG CTT GGC TAA TAG AGC CCT What is Base 1? What is Base 2? What is Base 3? One cycle: 1. Polymerase extension with fluorescent nucleotides 2. Imaging of the slide 3. Cleave & wash
109-plex à effective reagent volume of femotoliters per sequencing reaction
Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR
Emulsion PCR to amplify DNA As with bridge PCR we start by attaching amplification adapters to DNA template molecules
Emulsion PCR to amplify DNA Next we mix these template molecules with beads containing a single surface-bound amplification primer
Emulsion PCR to amplify DNA An emulsion is formed from template and beads such that each droplet gets one bead and one template (on average)
Emulsion PCR to amplify DNA An emulsion is a stable mixture of water droplets in oil each droplet is like a tiny PCR tube.
Emulsion PCR to amplify DNA The droplet contains all the ingredients necessary for PCR including the other primer and the polymerase
Emulsion PCR to amplify DNA The free template is annealed to a surface bound primer, extended with the polymerase and the new strand is melted off the bead. This is one cycle
Emulsion PCR to amplify DNA Multiple cycles result in clonal amplification of the single template DNA strand
Emulsion PCR to amplify DNA Finally, we break the emulsion and isolate beads with amplified product
So, what next? We now have millions of beads each with clonally amplified template strands what should we do next?
Bead-Bound DNA Lends Itself to Microwells One cycle: 1. Add first NTP (e.g. ATP) 2. ph change from reaction detected 3. Wash 4. Repeat for each other NTP
Many ways to skin a cat Rothberg et al. (2011) Clonal Amp emulsion PCR Sequencing ph microsensor Fedurco et al. (2006) bridge PCR Optical imaging
2001:$3 billion
2001:$3 billion 2007:$100 million
2008:$1.5 million 2007:$100 million 2001:$3 billion
2015:$1,000 2008:$1.5 million 2007:$100 million 2001:$3 billion
What can a sequencer do today? In 8 days (Illumina HiSeq 2000) Paired end 100 bp reads >2 billion read- pairs >400 gigabases (Gb) of total output 1 in 1,000 error rate for most base- calls IniEal drag of human genome based on 23 Gb
What s the catch? 1. Very short, error-laden sequence reads 2. Sequence biases (e.g. G+C content) 3. Days to weeks per instrument run 4. Substantial bioinformatics burden 5. It s cheap, but it still isn t free
Outline HTS technology and a bit of history HTS applications: a data onslaught The future
The ENCODE Project As An Example The goal of the Encyclopedia Of DNA Elements (ENCODE) was to identify all functional elements of the human genome This was a massive and audacious undertaking that generated a ridiculous quantity of data of multiple types The results are somewhat controversial, but a significant fraction of the human genome has some function
Where Did All This Data Come From? Every position in the genome was annotated with a huge amount of data on histone modifications, protein binding, accessibility, etc
HTS Enables High- Throughput Biology Biological phenomenon of interest We start with something we want to measure for each position in a genome
HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Then, we develop a way to measure that phenomenon with sequencing
HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon A current-generation HT sequencer can collect ~400m reads each of which can be an individual measurement
HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon What are some things we might want to measure about positions in a genome?
DNA = machine-readable format for capturing biological information 1. Genetic variation in individuals 2. de novo genome assembly 3. Digital expression profiling (RNA-Seq) 4. Protein-DNA interactions (ChIP-Seq) 5. DNA accessibility (DNAse-Seq) 6. Methylation profiling (Methyl-Seq)
DNAse hypersensitivity: DNA Accessibility Why? DNA in the nucleus is not naked. Chromatin is a complex assembly of genomic DNA and proteins
DNAse hypersensitivity: DNA Accessibility We want to know where individual proteins bind to DNA
DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins
DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins Before HTS, individual loci were probed using Southern blots (e.g. gels followed by hybridization of radioactively labeled probe DNA)
DNAse-Seq: DNA Accessibility A first linker is ligated to mark the DNAse sensitive sites
DNAse-Seq: DNA Accessibility The DNA is digested with another nuclease to generate small fragments
DNAse-Seq: DNA Accessibility A second linker is added and the fragments are amplified Note, we now have a bunch of sequenceable fragments, one end of each corresponding to a DNAsesensitive site in the genome!
DNAse-Seq: DNA Accessibility Once we sequence our fragments, what informatics step do we need to take next?
DNAse-Seq: DNA Accessibility Alignment of the reads to the genome to find where the hypersensitive sites are!
DNAse-Seq: DNA Accessibility After read mapping we know the number of DNAse fragments at each position in the genome
DNAse-Seq: DNA Accessibility HTS reveals the number of DNAse fragments at each position in the genome This is a digital readout that tells us how sensitive to DNAse each site in the genome is (e.g. it s quantitative)
HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point
HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point
HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point
HTS-based Readouts Are Genomewide Based on the sequence of protected DNA we can predict what transcription factors bind in each location
HTS-based Readouts Are Exquisitely Sensitive We can actually see the effect of DNA contacts made by different transcription factors!
ChIP-Seq: Protein Binding How would you find the location of every binding site for a particular transcription factor?
ChIP-Seq: Protein Binding Crosslink DNA and proteins using formaldehyde (why?) then shear
ChIP-Seq: Protein Binding Use an antibody against the protein of interest to immunoprecipitate the protein-dna complexes
ChIP-Seq: Protein Binding Reverse chemical crosslinks and sequence DNA fragments
ChIP-Seq: Protein Binding Reverse chemical crosslinks and sequence DNA fragments
CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How do you guys think chromosomes are arranged within nuclei?
CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom In fact, this is an active area of investigation and the degree to which chromosome conformation is ordered is unclear
CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How would you design an assay that could reveal chromosome conformation (hint: remember that in chromosomes DNA is coated with proteins)?
CCC (Chromosome Conformation Capture) The key idea is that we crosslink the proteins so that adjacent DNA is physically linked
CCC (Chromosome Conformation Capture)
CCC (Chromosome Conformation Capture) This is the conformation of chromosomes in the yeast nucleus as determined by sequencing!
Outline HTS technology and a bit of history HTS applications: a data onslaught The future
Single Molecule Sequencing Many of the problems of HTS inhere to: 1) Massively parallel DNA amplification (errors, etc) 2) Very short reads (mapping problems) What if we could sequence long, unamplified individual DNA molecules?
Single Molecule Sequencing: Fluorescent Uses zero-mode waveguide enables detection of fluorescence only very near the bottom of the nanowell
Single Molecule Sequencing: Fluorescent Each nanowell contains one DNA polymerase and one template DNA molecule
Single Molecule Sequencing: Fluorescent Sequencing is detected in real time by looking for pulses of fluorescence associated with base addition
Single Molecule Sequencing: Fluorescent Fluorophores are attached to the triphosphate group so they diffuse away from the detection volume rapidly!
Single Molecule Sequencing: Nanopores Protein nanopores enable water and ions to cross membranes
Single Molecule Sequencing: Nanopores The result is a current flow if there is a difference in potential across the membrane
Single Molecule Sequencing: Nanopores If a large molecule passes through the aperture of the pore, current flow is reduced
Single Molecule Sequencing: Nanopores Objects of different sizes cause different (and characteristic) current flows
Single Molecule Sequencing: Nanopores Sensing the different current flow induced by each different base is the key concept of nanopore sequencing