Jared Simpson Ontario Institute for Cancer Research & Department of Computer Science University of Toronto

Size: px
Start display at page:

Download "Jared Simpson Ontario Institute for Cancer Research & Department of Computer Science University of Toronto"

Transcription

1 Portable DNA Sequencing: Analysis Methods and Applications Jared Simpson Ontario Institute for Cancer Research & Department of Computer Science University of Toronto Computing for Medicine November 22nd, 2016

2 Timeline of DNA Sequencing 1970s Sanger sequencing invented Manual process; low throughput 2

3 Timeline of DNA Sequencing s Automation of Sanger sequencing Megabases of data per run 3

4 Timeline of DNA Sequencing 2000s Massively Parallel Sequencing s Gigabases per run 4

5 Timeline of DNA Sequencing 2010s Centralisation of sequencing 18,000 human genomes/year 5

6 Miniature, Portable Sequencing Joseph Bore running the MinION in Guinea, Quick et al Nature (2016) 6

7

8

9 @explornaut

10 @NASA_Astronauts

11 Why use genome sequencing during an outbreak? Sequence viral genomes to calculate mutation rate Identify host adaptations Characterise response to therapy/immunization Relate cases to each other, identify chains of transmission and monitor geographic spread 11

12 Why use genome sequencing during an outbreak? Sequence viral genomes to calculate mutation rate Identify host adaptations Characterise response to therapy/immunization Relate cases to each other, identify chains of transmission and monitor geographic spread 12

13 Ebola Surveillance Ebola is passed through direct contact A 13

14 Ebola Surveillance Ebola is passed through direct contact A B C Person A transmits the virus to B and C 14

15 Ebola Surveillance Ebola is passed through direct contact A B C D E F Person A transmits the virus to B and C Person B transmits the virus to D and E Person C transmits the virus to F 15

16 Ebola Surveillance Ebola virus mutates at a rate of ~ mutations/bp/year These mutations allow us to track patterns of transmission A B C D E F Ebola-A AGTAGCCTACGATACTACGATCGACTTA Ebola-B AGTAGCCTACGATATTACGATCGACTTA Ebola-C AGTAGCCGACGATACTACGATCGACTTA Ebola-D AGTTGCCTACGATATTACGATCGACTTA Ebola-E AGTAGCCTACGATATTACGATCGACATA Ebola-F AGTAGCCGACGATACTACGATGGACTTA 16

17 Ebola Surveillance Ebola virus mutates at a rate of ~ mutations/bp/year These mutations allow us to track patterns of transmission A B 10:T>G C D E F Ebola-A AGTAGCCTACGATACTACGATCGACTTA Ebola-B AGTAGCCTACGATATTACGATCGACTTA Ebola-C AGTAGCCGACGATACTACGATCGACTTA Ebola-D AGTTGCCTACGATATTACGATCGACTTA Ebola-E AGTAGCCTACGATATTACGATCGACATA Ebola-F AGTAGCCGACGATACTACGATGGACTTA T>G here indicates C/F lineage 17

18 Ebola Surveillance Ebola virus mutates at a rate of ~ mutations/bp/year These mutations allow us to track patterns of transmission 17:C>T A 10:T>G B C D E F Ebola-A AGTAGCCTACGATACTACGATCGACTTA Ebola-B AGTAGCCTACGATATTACGATCGACTTA Ebola-C AGTAGCCGACGATACTACGATCGACTTA Ebola-D AGTTGCCTACGATATTACGATCGACTTA Ebola-E AGTAGCCTACGATATTACGATCGACATA Ebola-F AGTAGCCGACGATACTACGATGGACTTA C>T here indicates B/D/E lineage 18

19 Ebola Surveillance We can t sequence every case in the outbreak but by sampling enough cases we can build a picture of how the virus is spreading (e.g. geographically) A B C Ebola-D AGTTGCCTACGATATTACGATCGACTTA Ebola-E AGTAGCCTACGATATTACGATCGACATA Ebola-F AGTAGCCGACGATACTACGATGGACTTA D E F Sequence these cases; D and E share a mutation - possibly related? 19

20 Why use portable genome sequencing? West Africa does not have sequencing infrastructure Old process: take samples locally, send to lab in Europe/North America, sequence, return results to local epidemiologists Slow, difficult to ship samples New process: use portable sequencers directly at the location they are needed 20

21 Image credit: Genome Research Limited 21

22

23

24

25

26 Porton9Down9validation9set,989.1%9coverage Guinea9199reactions9v1,998.1%9coverage Guinea9119reactions9v1,995.9%9coverage Guinea9119reactions9v2,998.4%9coverage Coverage Depth

27 Computational Challenges of Nanopore Analysis Input: a set of nanopore reads (r1,r2,,rn) from an Ebola genome (g) Output: the sequence of the Ebola genome, g 27

28 Computational Challenges of Nanopore Analysis Main challenge: the base-calling error rate is quite high (~10%) 28

29 29 Calculating the Genome Sequence GCTACGATCGACTTACGA Test9 Possible9 Mutations ACTACGATCGACTTACGA CCTACGATCGACTTACGA TCTACGATCGACTTACGA... -CTACGATCGACTTACGA G-TACGATCGACTTACGA GC-ACGATCGACTTACGA GCT-CGATCGACTTACGA... GACTACGATCGACTTACGA GCCTACGATCGACTTACGA GGCTACGATCGACTTACGA GTCTACGATCGACTTACGA Score9using9 Probabilistic9 Model

30 Nanopore Sequencing Illustration by David Eccles 30

31 Nanopore Sequencing GCTACGATT Sample Current Current (pa) time (s) 31

32 Nanopore Sequencing GCTACGATT Sample Current 50 Current (pa) time (s) 32

33 Nanopore Sequencing GCTACGATT Sample Current Current (pa) time (s) 33

34 Nanopore Sequencing GCTACGATT Sample Current Current (pa) time (s) 34

35 Nanopore Sequencing CTACGATT Sample Current Current (pa) time (s) 35

36 Base Calling Speech recognition: predict a sentence from a recording of someone s voice Welcome,to,Computing, for,medicine Base calling: predict a nucleotide sequence from a nanopore signal R1 ACGATAGACCGTATTAG The same methods that are used to process speech (hidden Markov Models, deep neural networks) can be used for nanopore data. 36

37 Ebola Genomes We sequenced and reconstructed 142 Ebola genomes using the MinION and hidden Markov model-based analysis software (nanopolish) Ebola-1 Ebola-2 Ebola-3 Ebola-4 AGTAGCCTACGATACTACGATCGACTTA AGTAGCCTACGATATTACGATCGACTTA AGTAGCCGACGATACTACGATCGACTTA AGTTGCCTACGATATTACGATCGACTTA... Ebola-141 AGTAGCCTACGATACTACGATCGAGTTA Ebola-142 AGTAGGCTACGATACTACGATCGACTTA Next computational challenge: Arrange these genomes into a phylogenetic tree 37

38 Ebola Phylogeny 38

39 Rapid sequencing 39

40 zibraproject.org

41

42 Discussion/Future Outlook Overview of portable sequencing: new applications possible Data is challenging to work with but similar to classic machine learning problems Sequencing is likely to become a larger and larger part of diagnostics (e.g. sequencing an infection to discover antimicrobial resistance genes) Novel applications of Nanopore sequencing for human genomes: phasing variation, directly detecting methylated bases If you re interested in discussing further contact me: jared.simpson@oicr.on.ca 42

43 Acknowledgements Jonathan Dursi Matei David Phil Zuzarte Nick Loman Josh Quick