Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

Sequencing Theory Brett E. Pickett, Ph.D. J. Craig Venter Institute Applications of Genomics and Bioinformatics to Infectious Diseases GABRIEL Network

Agenda Sequencing Instruments Sanger Illumina Ion Torrent Oxford Nanopore PacBio

Virus (or Pathogen) Sequencing Application of NGS to study of Virus Evolution and Molecular Epidemiology Track the evolution of viruses over time Better understand the selective pressures that drive virus evolution Identify the origins (reservoir) of outbreak strains Investigate transmission dynamics Identify molecular determinants of host range Identification of evolutionarily conserved regions for targeted vaccines

Some Trivia What year was the first whole genome sequence reported? a) 1969 b) 1977 c) 1981 d) 1985 For which organism? Bacteriophage ΦX174 (5,375 bp) What method was used? dideoxy chain termination with 32 P (aka Sanger sequencing) What year was the first whole genome sequence for a free living organism reported? a) 1979 b) 1984 c) 1989 d) 1995 For which organism? Haemophilus influenza (1.8 x 10 6 bp) What method was used?

JCVI Joint Technology Core ABI 3730xl Capacity: 240,000 sequences/day or 80 million lanes/year at 24 runs per day

New JCVI Joint Technology Core Illumina NextSeq/MiSeq 800 million reads/runs Oxford Nanopore MinION

Sanger vs NGS

Change in Cost 1st Generation Next Generation Next Generation w/broad adoption Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: http://www.genome.gov/sequencingcosts/. Accessed 23AUG2017.

Sanger

Sanger 1 => Sanger 2

A chromatogram

Illumina

Illumina Instruments https://www.illumina.com/systems/sequencing-platforms/comparison-tool.html

Illumina Sequencing (Optics-Based)

ION Torrent

ION Torrent Sequencing (H+ Based) High-throughput

PacBio

Single molecule detection Sequencing by synthesis Single base incorporation Sequences same molecule multiple times Random error detection Easy to generate consensus

Oxford Nanopore

Oxford Nanopore sequencing DNA pushed through a nanopore in a lipid membrane Speed control provided by a Phi29 DNA polymerase Measure changes in the ionic current of an applied electric field Combined w/ other platform to improve quality of assembly

Read Lengt 2e+05 Oxford 5 produces high quality reads 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 >50 Read kb; GC Content longest 1e+05 >800 Read GC Content kb 1e+05 C Read Read Lengt D 2e+05 Read 5 0e+00 0e+00 E 0.00 0.252e+05 0.50 0.75 1.00 Read Length Read GC Content 1e+05 Oxford F Read Length 2e+05 1e+05 0.00 0.25 0.50 0.75 1.00 Read GC Content PacBio 0e+00 0e+00 0.00 0.25 0.50 0.75 1.00 Read GC Content 0.00 0.25 0.50 0.75 1.00 Read GC Content Read Length 2e+05 1e+05 0e+00 E Read Length 2e+05 1e+05 Read Length F Read Length 2e+05 1e+05 2e+05 1e+05 0e+00 0e+00 5 10 15 20 Read Quality Score 5 10 15 20 Read Quality Score 0e+00 5 10 15 20 5 10 15 20 Read Quality Score Read Quality Score 4k 8k 12k 4k 8k 12k 20k 40k 60k 20k 40k 60k Read Count Read Count Read Count Read Count

Long Read Technology Comparison Advantages Full length transcriptomes, including splice variants Resolution of long repeat regions in genomes Genomic structural variants Haplotype phasing Disadvantages High error rates Higher cost Lower throughput

Instrument Comparison Platform Advantages Disadvantages HiSeq PE run (2x75) MiSeq PE run (2x300) Ion Torrent (200bp, 318 chip) high throughput, lowest per base cost high throughput, low per base cost, fast turnaround? fast turn-around short reads, long run time data quality, homopolymers Oxford Nanopore PacBio Sanger run (96 wells) Fast turn-around, various use cases, long reads, low-cost instrument base-calling various use cases, long reads, high quality data, long read length intense library prep, instrument cost high cost, low throughput

FastQ Format @HWUSI-EAS582_157:6:1:1:1501/1 NCACAGACACACACGAACACACAAAGACATGCCCATATGAAGAT + %.7786867:778556858746575058873/347777476035 @HWUSI-EAS582_157:6:1:1:1606/1 NCTGGCACCTTGATTTTGGACTTCCCAGCCTCCAGAACTGTGAG + %1948988888798988366898888648998788898888588 Header @HWUSI-EAS582_157:6:1:1:453/1 NCTGCTTGCACCCCTGAAGTCACTGATCACATTTCAGGGTCACC + %/868998988888867668888986644788988413488885 @HWUSI-EAS582_157:6:1:1:1844/1 NGATTGACATTGGCAAAGAGGACAACTGATTGCAAACTTCACAC + %-7;:::::;86499;75574586::635:62687666887879 @HWUSI-EAS582_157:6:1:1:1707/1 NAGGCTCAGGCGCACGGCCTACATCGTCGCTGTCGGCCAAGGGG + Read (sequence) Quality scores (phred-33)

Assessing Quality: Phred scores Phred quality scores were originally produced by the Phred base calling program using a statistical analysis of Sanger chromatogram trace files in support of the Human Genome Project. Subsequently adapted to NGS technologies for judging qualities of sequences. Q = -10 log 10 P e P e = error probability of a given base call

Acknowledgements JCVI Vinita Puri William Nierman, Ph.D. Karen Nelson, Ph.D. Alan Durbin Torrey Williams Kari A. Dilley, Ph.D. Lauren Oldfield, Ph.D. Susmita Shrivastava Nadia Fedorova Mark Novotny U19AI110819 Paolo Amedeo, Ph.D. Reed S. Shabman, Ph.D. Gene Tan, Ph.D.

Questions? bpickett@jcvi.org