Food Safety (Bio-)Informatics - PDF Free Download

Food Safety (Bio-)Informatics Henk C. den Bakker Assistant Professor in Bioinformatics and Epidemiology Center for Food Safety University of Georgia hcd82599@uga.edu

Overview Short introduction of Food Safety Informatics The digital immune system

Food Safety Informatics? The use of information and computer science to advance food safety A combination of different individual disciplines: Statistics Computer science Epidemiology Bioinformatics Using Big Data approaches, The Internet of Things

The rise of a digital immune system (DIS) Coined by David Lipman Further worked out by Michael Schatz and Adam Phillippy in 2012* Would work in much the same way as an adaptive, biological immune system: Observe the microbial landscape Detect potential threats Neutralize threads before they can cause widespread harm Distributed sensor sequencing and bioinformatics where a network of mobile sequencing devices serves a real-time stream of microbial genomes to a global compute cloud for analysis. *Schatz, M.C, & A. Phillippy. 2012.GigaScience 1 (1): 4. doi:10.1186/2047-217x-1-4.

What is necessary for a digital immune system? A catalogue of microbial diversity, so we can tell the normal from the abnormal (a potential thread) Centralized (genome) databases, such as NCBI, EMBL and DDBJ Rapid bioinformatics tools to deal with the growing amount of (realtime) data sequencing devices (preferably inexpensive and portable) that can act as the sensors in a distributed, real-time sequencing network

The digital immune system http://hint.fm/wind/

Applying the digital immune system to food safety: The GenomeTrakr project Project spear-headed by the FDA* GenomeTrakr is the first distributed network of labs to utilize whole genome sequencing for pathogen identification Consists of 15 federal labs, 25 state health and university labs, 1 U.S. hospital lab, 2 other labs located in the U.S., 20 labs located outside of the U.S., and collaborations with independent academic researchers. Data curation and bioinformatic analyses and support are provided by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health The GenomeTrakr network has sequenced more than 167,000 isolates, and closed more than 175 genomes. The network is regularly sequencing over 5,000 isolates each month. *https://www.fda.gov/food/foodscienceresearch/wholegenomesequencingprogramwgs/default.htm

The sensors and the network Illumina short read sequencers, in particular the MiSeq Generate genome sequences as short reads, typically >> 200,000 per bacterial genome https://www.illumina.com

The sensors and the network

Using whole genome sequencing (WGS) data in outbreak investigations WGS data give unprecedented resolution Ability to use genomic changes that can help us to infer relatedness with strains in past and present (Single Nucleotide Polymorphisms). After ~ 2 years of using WGS for outbreak investigations*: aid in finding the food vehicle for cold cases and sporadic cases, as WGS can phylogenetically link isolates from human cases and food. Sequencing of both food product and patient derived isolates, outbreaks can be confirmed following product testing, allowing for an early association of an outbreak with a contaminated food. WGS can help in a rapid and precise outbreak case definition, and thus productively redirect epidemiological resources * Jackson et al. 2016. Clin Infect Dis.;63(3):380-6

NCBI Pathogen detection

The database is growing

How close is the GenomeTrakr network to a digital immune system? Close, but far from real-time: Still dependent on classical microbiology to isolate pathogens, which adds days to weeks to the protocol Sequenchers are state of the art, but the sequencing procedure takes 2 to 3 days The increasing size of the database becomes prohibitively large for real-time searches

New sequencing technologies and (quasi- )metagenome sequencing Novel sequencing protocols that need either no or very limited steps for enrichment of target organisms Novel sequencing technologies e.g., Oxford Nanopore https://nanoporetech.com

The databases are getting larger and larger

Fortunately we can surf the Big Data wave Source: http://www.tech-dynamics.com/wp-content/uploads/2014/02/bigdatachart.png

A rediscovery of old data structures/algorithms Big Data is years ahead of the big increase in genomic data In an effort to speed up analyses and searches of genomic data, old data structures and algorithms are rediscovered and/or reimplemented: De Bruijn Graph (De Bruijn, 1946) genome assembly Bloom filter (Bloom, 1970) MinHash (Broder, 1997); efficient comparison of datasets

MinHash; comparing large datasets with smaller sketches Originally developed to compare large electronic documents (Broder, 1998) Summarizes documents as subsets (sketch) of a fixed size of their information, using a specific criterion to select the members of the subset Example: a sketch of a thousand words is approximately large enough to infer the similarity of a document with millions of words Translated to bacterial genomes, we can use the same strategy to divide genomes up in words (kmers) and use a MinHash approach to estimate the relatedness of these genomes Ondov, Brian D. et al. 2016. Genome Biology 17 (1): 132.

BIGSI: Searching microbial big data BLAST has been the traditional search algorithm for genetic and genomic database centers such as NCBI (US), EBI (Europe). However the majority of genomic data (by now hundreds of thousands) are stored as un-assembled genomes, consisting of hundreds of thousands to millions of small reads BLAST is generally not fast enough to search these databases realtime

From Bloom filters to BIGSI By David Eppstein - self-made, originally for a talk at WADS 2007, Public Domain,https://commons.wikimedia.org/w/index.php?curid=2609777 Advantages: small storage for large sets of elements Fast search Disadvantage: False positives

BIGSI: extension of the bloom filter bitsliced genomic signature index (BIGSI) Allows for superfast search of big sequence data databases 3 antibiotic resistance genes (MCR-1, MCR-2, MCR3) could be searched in 1.73 seconds in a data-base of 447,833 viral and bacterial genomes. P. Bradley, H.C. den Bakker, E. Rocha, G. McVean, Z. Iqbal. 2017. biorxiv 234955; doi: https://doi.org/10.1101/234955

Summary In food safety, the Genome Trackr network is the closest thing we have to a digital immune system In order to use this network to detect early threads we need further improvements: Improvement of sample preparation methods/culture free methods Sequenching technology (faster, easier, smaller) Bioinformatics These improvements are coming fast