Machine Learning and Data Fusion Methods for Phenotype-based Threat Assessment of Unknown Bacteria

Size: px
Start display at page:

Download "Machine Learning and Data Fusion Methods for Phenotype-based Threat Assessment of Unknown Bacteria"

Transcription

1 Machine Learning and Data Fusion Methods for Phenotype-based Threat Assessment of Unknown Bacteria Dr. Paul Sheehan Program Manager, BTO SBIR/STTR Industry Day DARPA Conference Center Arlington, VA August 2018 the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 1

2 Operational Motivation: Emerging and newlyengineered pathogens DoD Problem: Biological threats, either adversary-engineered or currently unknown, can evade our current forensics, which require a priori knowledge of a bacterium s function or biochemical makeup. New drug resistance mechanisms Emergence from animal reservoirs New releases in the environment Unknown bacterium found in a cave 1,000 feet underground Pawlowski AC et al. Nature Communications 2016 Unknown bacterium killing wild animals in West Africa Antonation KS et al. PLOS Negl Trop Dis 2016 Melting permafrost in Svalbard, Norway Engineered threats from Jeff Vanuga/Getty Images state and non-state actors Image: CDC Public Health Image Library the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 2

3 SOA: Bioinformatic methods for threat identification Metagenomics Extract and Fragment DNA Assign genomes Sequence DNA Phylogenetic binning 1) GCAGATACCAGACCC 2) CATAGGAAACA 3) TTAGACCATCCGGA 4) GCACATTAGAACAGG Gene assembly and phylogenetic binning Provides an inventory of known bacteria present in the sample Only genotype, no phenotype Difficult to determine and identify unknown bacteria Takes too long, ~36 hours Machine & Deep Learning Feature Extraction + Classification Feature extraction and metadata compilation and integration often necessitate human curation Algorithms require large training data sets the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 3

4 Identifying pathogens via phenotype The Problem Condensing and processing diverse phenotypic data including discrete and continuous datasets Deriving decisive algorithms using limited training datasets Examples of critical pathogen phenotypes: Harms host Production of exotoxins that damage host cells Ability to disrupt cell membranes Production of endotoxins that cause septic shock Self-preservation Production of proteins that block antibody function Presence of polysaccharide capsule coating the cell Production of enzymes that counter antibiotics Niche finding Induction of virulence factors in host environment Adherence to host membrane Ability to scavenge iron from host the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 4

5 Machine Learning approach to pathogen prediction Deliverable: Develop an integrated computational platform that fuses multiple phenotypic data and applies machine learning algorithms to identify pathogenic potential of bacterial samples based on phenotype alone Data integration and mapping Challenge: Combine a variety of data types from experiments and literature to identify threats and map genotype to phenotype Algorithm generation Challenge: Training mathematical models accurately using datasets with a limited number of data points. Datase t 1Datase t 2 Datase t n the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 5

6 Phase I Overview Phase I Objectives: Describe an approach to algorithm training and provide sufficient support for any claims regarding the ability to identify previously unknown pathogens. Describe gaps in existing methods for phenotypic data fusion and machine learning of biological data and specific advances offered by the proposed approach Platforms must be able to incorporate and operate on data related to, at a minimum, three main categories of pathogenic traits: Niche finding (e.g., ability to adhere to host tissue) Ability to harm a host (e.g., exotoxin secretion, host cell damage) Self preservation (e.g., immune system evasion) Demonstrate integration for known and pathogenic and non-pathogenic strains into computational structures amenable to Machine Learning Think about: What are the novel approaches for high-throughput data curation and condensation? What types of training datasets should be used? What are the measures of performance and effectiveness? the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 6

7 Phase II Overview Phase II Objectives: Incorporate methods for rapid ingestion of new data and demonstrate expandability Demonstrate quantifiable improvements in identification accuracy and speed For known pathogens For species which there has been no definitive identification Establishment of roadmap for platform transition Accessibility to users with no data computation background Intuitive GUI interface Think about: How can indicators be identified in diverse and uncertain data? What critical phenotypes are necessary for pathogen identification? How can Machine Learning increase its effectiveness while decreasing the requisite number of data points? the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 7

8 Phase III Overview Objectives: Possible transition to performers and government partners Other DARPA Programs Commercial and military applications NIH Think about: Scalability Flexibility Interpretability the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 8

9 Deliverables: Machine Learning approach to pathogen prediction Phase 1 Ability to identify well known pathogen/non-pathogens with >90% accuracy Report describing the training data, data source, and algorithm implementation Documentation related to the underlying architecture and use of the platform Demonstrated understanding of the field and needs for endusers. Phase 2 Final report including improvement details including quantitative comparison of performance Data description and results of any applications to sample species A plan for Phase 3 transition The algorithm itself with appropriate documentation the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 9

10 Questions? the Distribution Statement A (Approved for Public Release, Distribution Unlimited) 10