Big Data Standards and the Potential Long-Term Benefits for Research and Clinical Development

Size: px
Start display at page:

Download "Big Data Standards and the Potential Long-Term Benefits for Research and Clinical Development"

Transcription

1 Big Data Standards and the Potential Long-Term Benefits for Research and Clinical Development Eric Engelhard, Ph.D. Director of Informatics Mouse Biology Program UC Davis

2 UC Davis Mouse Biology Program, KOMP, MMRRC, and MMPC UC Davis Mouse Biology Program (mousebiology.org) A division of the Center of Comparative Medicine at UC Davis Full service mouse transgenics, cryoarchiving, phenotyping, and bioinformatics Knockout Mouse Program / KOMP Repository / KOMP-312 / KOMP2 International collaborations for high throughput transgenics, phenotyping (mousephenotype.org), and repository of mouse biomedical models (komp.org) Original KOMP-312 phenotyping pilot at UC Davis (kompphenotype.org) Both transgenic mouse production and full phenotyping pipeline at UC Davis as part of the DTCC consortium Uniquely includes RNAseq analysis of compensatory gene expression Independent, selective genomic sequencing of mouse strains and embryonic stem cells Mutant Mouse Regional Resource Centers (mmrrc.org) Repository of donor submitted mutant mouse lines with published phenotypes UCD MBP serves as one of the repository centers as well as the Informatics and Customer Service Center for all four centers Metabolic Mouse Phenotyping Center (mmpc.org, mmpc.ucdavis.edu) One of six centers providing the scientific community with standardized, high quality metabolic and physiologic phenotyping services for mouse models of diabetes, diabetic complications, obesity and related disorders.

3 Big Data Opportunities and Consequences Opportunities Specifics Consequences Facilitates the use of very large data sets Integration of diverse data types (whole systems) Very large, complex tables and astronomically large combinations Non-normalized, nonrigid data table structures across multiple modalities and domains Requires specialized hardware and software infrastructure, limited real time interactivity, centralization of services Potentially steep learning curves of nonstandard structures

4 Analytical Goals Stage Exploration Author Benefits from Standards X Reader Benefits from Standards Proposing and Testing Hypotheses Communicating Results A Source for Additional Exploration X X X

5 R and Sweave: An Existing Model for Communicating Analytics Sweave is a method for integrating R analyses into Latex documents Facilitates the encapsulation of analytics Data access Analytical methods Interpretation Discussion Full communication of formalized thought

6 Big Data: Existing Implementation and Analysis Standards Machine images Data volumes Map/Reduce frameworks Hadoop and its family of tools High-level query methods Pig and Pig Latin Integrated Machine Learning Mahout Keep in mind that Map/Reduce results represent data reductions that can be sent on to more common analytical tools R, S, SAS, Matlab, PyNum/PySci, BioPerl/Python/Java Associated visualization methods Interactive web displays

7 Machine Learning and Mahout Machine learning approaches for Big Data analysis Mahout (mahout.apache.org) Scalable machine learning libraries integrated with Hadoop Open community

8 Data Structure Standards Genomic variation formats applied after alignment to a reference sequence Requires genomic reference standard(s) SNPs, Indels CASAVA, GATK Breakpoints BreakDancer Data reduction

9 Controlled Vocabularies Enhances access for both novice and experienced users Facilitates understanding and acceptance by the scientific community Enhances programmatic access Individual queries Mapped vocabularies and interoperability

10 Biological Ontologies Controlled vocabulary of entities Semantic relationships between entities Structural constraints Directed Acyclic Graphs (DAGs) Open Biology Ontologies Links to ontologies Existing mappings: Educational access through associated browsers

11 Useful Ontologies for Translational & Personal Medicine Gene Ontology (GO) Process, Function, Component Sequence Ontology (SO) Mammalian Phenotype (MP) Mouse Anatomy (MA) Human Phenotype (HPO)

12 Inferring Functionality Through GO Graph Enrichment R goseq Web GOrilla (cbl-gorilla.cs.technion.ac.il)

13 Ontology Limitations Benefits to interoperability and a common language, BUT... Data loss Ontologies represent subset of extant knowledge at a particular point in time Annotations will change If at all possible, then include original data and meta-data

14 Other Controlled Vocabularies and Mapped Structures Standardized methods lead to standardized informatics structures and finally to standard implementations European Mouse Phenotyping Resource of Standardised Screens (EMPRESS) European Mouse Disease Clinic (EUMODIC) EMPRESS SLIM UC Davis Phenotyping LIMS Gene homologs Synteny maps

15 Omics Conclusion: The Anatomy of a Single Row Structured data reduction (e.g. variation mapped against a reference) Raw data and meta-data Structural conventions Controlled vocabularies Ontologies

16 Beyond the Big Crunch Big Data tables and astronomically large combinations stay within the data center A sharp reduction in volume after processing means that results CAN be shipped to the edge Harvesting domain expertise for better data utilization Data views Data interactivity Boolean queries Visual query building Data visualization Extending Bid Data tables Increase in sample size Association of additional table row information environmental variables, epigenetics, and other factors that may influence the presentation of phenotypes Real-time interaction Intelligent caching and pre-fetching Data standards, APIs, and pushing Big Data capabilities to end users Crossing domains Research to clinicians

17 Domain Specific Presentations Static Data views Interactive Complex queries Boolean Visual query building Model storage Data visualization Fuzzy feedback

18

19

20

21 Co-Clustering Clustering gene knockouts by phenotype exceptions Overlay of predicted tumor suppressors from RTCGD Putative tumor suppressor genes Sulf2, Tmprss4, and Slc44a3 in local cluster of neurological phenotype exceptions Enpp5 may play a role in neuronal cell communication (SwissProt) Hhipl2 is a homolog to HHIPL2, a human gene deregulated in gastric carcinomas p = 5x10-7 RTCGD

22 Translational Medicine, Epidemiology, and Clinical Medicine

23 In Standard Compliant Systems, Developers Embrace and Extend You 1. Controlled vocabularies 2. Standardized data structures and methods 3. Application programming interfaces (APIs) 4. Integration of your creation into someone else's local system