Genome 373: High- Throughput DNA Sequencing. Doug Fowler

Size: px
Start display at page:

Download "Genome 373: High- Throughput DNA Sequencing. Doug Fowler"

Transcription

1 Genome 373: High- Throughput DNA Sequencing Doug Fowler

2 Tasks give ML unity We learned about three tasks that are commonly encountered in ML

3 Models/Algorithms Give ML Diversity Classification Regression Clustering We covered a handful of the many models that can be employed to solve these tasks

4 Models/Algorithms Give ML Diversity Classification Regression Clustering Decision trees Random forests K-nearest neighbor Neural networks Naïve Bayes classifier SVM Perceptron SVR Linear regression Kernel regression Generalized linear models K-means Expectation-max Hierarchical Mean-shift Density-based Graph-based Biclustering We learned about three tasks that are commonly encountered in ML

5 ML In Genomics Has Been Driven By an Onslaught of Data

6 Outline HTS technology and a bit of history HTS applications: a data onslaught The future

7

8 DNA sequencing throughput 's 1,000's 1,000,000's nucleotides sequenced per day per instrument 1,000,000,000's

9 How? New Sequencing Paradigms How does Sanger sequencing work? Sanger sequencing

10 How? New Sequencing Paradigms Each DNA molecule of interest is amplified and sequenced separately Sanger sequencing

11 How? New Sequencing Paradigms Massively parallel sequencing Sanger sequencing Millions of DNA molecules are amplified (maybe) and sequenced in parallel

12 Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR

13 How Does PCR Work?

14 Bridge PCR To Amplify DNA Cluster generating primer #2 DNA to be sequenced Cluster generating primer #1 The DNA to be sequenced is prepared by attaching cluster generating primers to each end

15 Bridge PCR To Amplify DNA Single stranded DNA molecules are attached a glass flowcell seeded with surface-bound cluster generation primers

16 Bridge PCR To Amplify DNA Templates are annealed to the complimentary surfacebound cluster generating primers

17 Bridge PCR To Amplify DNA A single-stranded bridge Templates are annealed to the complimentary surfacebound cluster generating primers

18 Bridge PCR To Amplify DNA A double-stranded bridge DNA polymerase extends the annealed primer

19 Bridge PCR To Amplify DNA This completes one cycle of bridge PCR (note we now have two surface-bound copies of each template)

20 Bridge PCR To Amplify DNA Repeated cycles result in many copies of each clonal template per cluster The clusters form a randomly addressed DNA array

21 Bridge PCR Lends Itself to an Optical Readout A single Illumina NextSeq flowcell will contain ~400,000,000 clusters each grown from an individual DNA molecule

22 Massively Parallel DNA Sequencing A T Cycle 1 C A G G T A C Cycle 2 G A T A G G A C C A Cycle 3 T T C C C G G T AGT GAC ACG CTT GGC TAA TAG AGC CCT What is Base 1? What is Base 2? What is Base 3? One cycle: 1. Polymerase extension with fluorescent nucleotides 2. Imaging of the slide 3. Cleave & wash

23 109-plex à effective reagent volume of femotoliters per sequencing reaction

24 Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR

25 Emulsion PCR to amplify DNA As with bridge PCR we start by attaching amplification adapters to DNA template molecules

26 Emulsion PCR to amplify DNA Next we mix these template molecules with beads containing a single surface-bound amplification primer

27 Emulsion PCR to amplify DNA An emulsion is formed from template and beads such that each droplet gets one bead and one template (on average)

28 Emulsion PCR to amplify DNA An emulsion is a stable mixture of water droplets in oil each droplet is like a tiny PCR tube.

29 Emulsion PCR to amplify DNA The droplet contains all the ingredients necessary for PCR including the other primer and the polymerase

30 Emulsion PCR to amplify DNA The free template is annealed to a surface bound primer, extended with the polymerase and the new strand is melted off the bead. This is one cycle

31 Emulsion PCR to amplify DNA Multiple cycles result in clonal amplification of the single template DNA strand

32 Emulsion PCR to amplify DNA Finally, we break the emulsion and isolate beads with amplified product

33 So, what next? We now have millions of beads each with clonally amplified template strands what should we do next?

34 Bead-Bound DNA Lends Itself to Microwells One cycle: 1. Add first NTP (e.g. ATP) 2. ph change from reaction detected 3. Wash 4. Repeat for each other NTP

35 Many ways to skin a cat Rothberg et al. (2011) Clonal Amp emulsion PCR Sequencing ph microsensor Fedurco et al. (2006) bridge PCR Optical imaging

36 2001:$3 billion

37 2001:$3 billion 2007:$100 million

38 2008:$1.5 million 2007:$100 million 2001:$3 billion

39 2015:$1, :$1.5 million 2007:$100 million 2001:$3 billion

40 What can a sequencer do today? In 8 days (Illumina HiSeq 2000) Paired end 100 bp reads >2 billion read- pairs >400 gigabases (Gb) of total output 1 in 1,000 error rate for most base- calls IniEal drag of human genome based on 23 Gb

41 What s the catch? 1. Very short, error-laden sequence reads 2. Sequence biases (e.g. G+C content) 3. Days to weeks per instrument run 4. Substantial bioinformatics burden 5. It s cheap, but it still isn t free

42 Outline HTS technology and a bit of history HTS applications: a data onslaught The future

43 The ENCODE Project As An Example The goal of the Encyclopedia Of DNA Elements (ENCODE) was to identify all functional elements of the human genome This was a massive and audacious undertaking that generated a ridiculous quantity of data of multiple types The results are somewhat controversial, but a significant fraction of the human genome has some function

44 Where Did All This Data Come From? Every position in the genome was annotated with a huge amount of data on histone modifications, protein binding, accessibility, etc

45 HTS Enables High- Throughput Biology Biological phenomenon of interest We start with something we want to measure for each position in a genome

46 HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Then, we develop a way to measure that phenomenon with sequencing

47 HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon A current-generation HT sequencer can collect ~400m reads each of which can be an individual measurement

48 HTS Enables High- Throughput Biology Biological phenomenon of interest Sequencing-based assay Massively parallel way to measure the phenomenon What are some things we might want to measure about positions in a genome?

49 DNA = machine-readable format for capturing biological information 1. Genetic variation in individuals 2. de novo genome assembly 3. Digital expression profiling (RNA-Seq) 4. Protein-DNA interactions (ChIP-Seq) 5. DNA accessibility (DNAse-Seq) 6. Methylation profiling (Methyl-Seq)

50 DNAse hypersensitivity: DNA Accessibility Why? DNA in the nucleus is not naked. Chromatin is a complex assembly of genomic DNA and proteins

51 DNAse hypersensitivity: DNA Accessibility We want to know where individual proteins bind to DNA

52 DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins

53 DNAse hypersensitivity: DNA Accessibility DNA is first digested with DNAse an enzyme that cleaves DNA when it is not blocked by other proteins Before HTS, individual loci were probed using Southern blots (e.g. gels followed by hybridization of radioactively labeled probe DNA)

54 DNAse-Seq: DNA Accessibility A first linker is ligated to mark the DNAse sensitive sites

55 DNAse-Seq: DNA Accessibility The DNA is digested with another nuclease to generate small fragments

56 DNAse-Seq: DNA Accessibility A second linker is added and the fragments are amplified Note, we now have a bunch of sequenceable fragments, one end of each corresponding to a DNAsesensitive site in the genome!

57 DNAse-Seq: DNA Accessibility Once we sequence our fragments, what informatics step do we need to take next?

58 DNAse-Seq: DNA Accessibility Alignment of the reads to the genome to find where the hypersensitive sites are!

59 DNAse-Seq: DNA Accessibility After read mapping we know the number of DNAse fragments at each position in the genome

60 DNAse-Seq: DNA Accessibility HTS reveals the number of DNAse fragments at each position in the genome This is a digital readout that tells us how sensitive to DNAse each site in the genome is (e.g. it s quantitative)

61 HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point

62 HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point

63 HTS-based Readouts Are Genomewide We can look at patterns of DNAse sensitivity across the whole genome every position has an attached DNAse reads data point

64 HTS-based Readouts Are Genomewide Based on the sequence of protected DNA we can predict what transcription factors bind in each location

65 HTS-based Readouts Are Exquisitely Sensitive We can actually see the effect of DNA contacts made by different transcription factors!

66 ChIP-Seq: Protein Binding How would you find the location of every binding site for a particular transcription factor?

67 ChIP-Seq: Protein Binding Crosslink DNA and proteins using formaldehyde (why?) then shear

68 ChIP-Seq: Protein Binding Use an antibody against the protein of interest to immunoprecipitate the protein-dna complexes

69 ChIP-Seq: Protein Binding Reverse chemical crosslinks and sequence DNA fragments

70 ChIP-Seq: Protein Binding Reverse chemical crosslinks and sequence DNA fragments

71 CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How do you guys think chromosomes are arranged within nuclei?

72 CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom In fact, this is an active area of investigation and the degree to which chromosome conformation is ordered is unclear

73 CCC (Chromosome Conformation Capture) Randomly assorted Nonrandom How would you design an assay that could reveal chromosome conformation (hint: remember that in chromosomes DNA is coated with proteins)?

74 CCC (Chromosome Conformation Capture) The key idea is that we crosslink the proteins so that adjacent DNA is physically linked

75 CCC (Chromosome Conformation Capture)

76 CCC (Chromosome Conformation Capture) This is the conformation of chromosomes in the yeast nucleus as determined by sequencing!

77 Outline HTS technology and a bit of history HTS applications: a data onslaught The future

78 Single Molecule Sequencing Many of the problems of HTS inhere to: 1) Massively parallel DNA amplification (errors, etc) 2) Very short reads (mapping problems) What if we could sequence long, unamplified individual DNA molecules?

79 Single Molecule Sequencing: Fluorescent Uses zero-mode waveguide enables detection of fluorescence only very near the bottom of the nanowell

80 Single Molecule Sequencing: Fluorescent Each nanowell contains one DNA polymerase and one template DNA molecule

81 Single Molecule Sequencing: Fluorescent Sequencing is detected in real time by looking for pulses of fluorescence associated with base addition

82 Single Molecule Sequencing: Fluorescent Fluorophores are attached to the triphosphate group so they diffuse away from the detection volume rapidly!

83 Single Molecule Sequencing: Nanopores Protein nanopores enable water and ions to cross membranes

84 Single Molecule Sequencing: Nanopores The result is a current flow if there is a difference in potential across the membrane

85 Single Molecule Sequencing: Nanopores If a large molecule passes through the aperture of the pore, current flow is reduced

86 Single Molecule Sequencing: Nanopores Objects of different sizes cause different (and characteristic) current flows

87 Single Molecule Sequencing: Nanopores Sensing the different current flow induced by each different base is the key concept of nanopore sequencing