Sequencing Theory Brett E. Pickett, Ph.D. J. Craig Venter Institute Applications of Genomics and Bioinformatics to Infectious Diseases GABRIEL Network
Agenda Sequencing Instruments Sanger Illumina Ion Torrent Oxford Nanopore PacBio
Virus (or Pathogen) Sequencing Application of NGS to study of Virus Evolution and Molecular Epidemiology Track the evolution of viruses over time Better understand the selective pressures that drive virus evolution Identify the origins (reservoir) of outbreak strains Investigate transmission dynamics Identify molecular determinants of host range Identification of evolutionarily conserved regions for targeted vaccines
Some Trivia What year was the first whole genome sequence reported? a) 1969 b) 1977 c) 1981 d) 1985 For which organism? Bacteriophage ΦX174 (5,375 bp) What method was used? dideoxy chain termination with 32 P (aka Sanger sequencing) What year was the first whole genome sequence for a free living organism reported? a) 1979 b) 1984 c) 1989 d) 1995 For which organism? Haemophilus influenza (1.8 x 10 6 bp) What method was used?
Some Trivia What year was the first whole genome sequence reported? a) 1969 b) 1977 c) 1981 d) 1985 For which organism? Bacteriophage ΦX174 (5,375 bp) What method was used? dideoxy chain termination with 32 P (aka Sanger sequencing) What year was the first whole genome sequence for a free living organism reported? a) 1979 b) 1984 c) 1989 d) 1995 For which organism? Haemophilus influenza (1.8 x 10 6 bp) What method was used? Sanger sequencing with fluorescence
JCVI Joint Technology Core ABI 3730xl Capacity: 240,000 sequences/day or 80 million lanes/year at 24 runs per day
New JCVI Joint Technology Core Illumina NextSeq/MiSeq 800 million reads/runs Oxford Nanopore MinION
Sanger vs NGS
Change in Cost 1st Generation Next Generation Next Generation w/broad adoption Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: http://www.genome.gov/sequencingcosts/. Accessed 23AUG2017.
Sanger
Sanger 1 => Sanger 2
A chromatogram
Illumina
Illumina Instruments https://www.illumina.com/systems/sequencing-platforms/comparison-tool.html
Illumina Sequencing (Optics-Based)
ION Torrent
ION Torrent Sequencing (H+ Based) High-throughput
PacBio
Single molecule detection Sequencing by synthesis Single base incorporation Sequences same molecule multiple times Random error detection Easy to generate consensus
Oxford Nanopore
Oxford Nanopore sequencing DNA pushed through a nanopore in a lipid membrane Speed control provided by a Phi29 DNA polymerase Measure changes in the ionic current of an applied electric field Combined w/ other platform to improve quality of assembly
Read Lengt 2e+05 Oxford 5 produces high quality reads 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 >50 Read kb; GC Content longest 1e+05 >800 Read GC Content kb 1e+05 C Read Read Lengt D 2e+05 Read 5 0e+00 0e+00 E 0.00 0.252e+05 0.50 0.75 1.00 Read Length Read GC Content 1e+05 Oxford F Read Length 2e+05 1e+05 0.00 0.25 0.50 0.75 1.00 Read GC Content PacBio 0e+00 0e+00 0.00 0.25 0.50 0.75 1.00 Read GC Content 0.00 0.25 0.50 0.75 1.00 Read GC Content Read Length 2e+05 1e+05 0e+00 E Read Length 2e+05 1e+05 Read Length F Read Length 2e+05 1e+05 2e+05 1e+05 0e+00 0e+00 5 10 15 20 Read Quality Score 5 10 15 20 Read Quality Score 0e+00 5 10 15 20 5 10 15 20 Read Quality Score Read Quality Score 4k 8k 12k 4k 8k 12k 20k 40k 60k 20k 40k 60k Read Count Read Count Read Count Read Count
Long Read Technology Comparison Advantages Full length transcriptomes, including splice variants Resolution of long repeat regions in genomes Genomic structural variants Haplotype phasing Disadvantages High error rates Higher cost Lower throughput
Instrument Comparison Platform Advantages Disadvantages HiSeq PE run (2x75) MiSeq PE run (2x300) Ion Torrent (200bp, 318 chip) high throughput, lowest per base cost high throughput, low per base cost, fast turnaround? fast turn-around short reads, long run time data quality, homopolymers Oxford Nanopore PacBio Sanger run (96 wells) Fast turn-around, various use cases, long reads, low-cost instrument base-calling various use cases, long reads, high quality data, long read length intense library prep, instrument cost high cost, low throughput
FastQ Format @HWUSI-EAS582_157:6:1:1:1501/1 NCACAGACACACACGAACACACAAAGACATGCCCATATGAAGAT + %.7786867:778556858746575058873/347777476035 @HWUSI-EAS582_157:6:1:1:1606/1 NCTGGCACCTTGATTTTGGACTTCCCAGCCTCCAGAACTGTGAG + %1948988888798988366898888648998788898888588 Header @HWUSI-EAS582_157:6:1:1:453/1 NCTGCTTGCACCCCTGAAGTCACTGATCACATTTCAGGGTCACC + %/868998988888867668888986644788988413488885 @HWUSI-EAS582_157:6:1:1:1844/1 NGATTGACATTGGCAAAGAGGACAACTGATTGCAAACTTCACAC + %-7;:::::;86499;75574586::635:62687666887879 @HWUSI-EAS582_157:6:1:1:1707/1 NAGGCTCAGGCGCACGGCCTACATCGTCGCTGTCGGCCAAGGGG + Read (sequence) Quality scores (phred-33)
Assessing Quality: Phred scores Phred quality scores were originally produced by the Phred base calling program using a statistical analysis of Sanger chromatogram trace files in support of the Human Genome Project. Subsequently adapted to NGS technologies for judging qualities of sequences. Q = -10 log 10 P e P e = error probability of a given base call
Acknowledgements JCVI Vinita Puri William Nierman, Ph.D. Karen Nelson, Ph.D. Alan Durbin Torrey Williams Kari A. Dilley, Ph.D. Lauren Oldfield, Ph.D. Susmita Shrivastava Nadia Fedorova Mark Novotny U19AI110819 Paolo Amedeo, Ph.D. Reed S. Shabman, Ph.D. Gene Tan, Ph.D.
Questions? bpickett@jcvi.org