Biomedical Informatics in BIG DATA Era

Size: px
Start display at page:

Download "Biomedical Informatics in BIG DATA Era"

Transcription

1 Biomedical Informatics in BIG DATA Era Yang C. Fann, Ph.D. Director, Intramural IT and Bioinformatics Program National Institute of Neurological Disorders and Stroke

2 Disclaimer The opinions or assertions contained herein are the private ones of the author/speaker and are not to be construed as official or reflecting the views or policies of the author s employer (NINDS/NIH/HHS) nor the US government. Therefore, they can not be liable or assume any responsibility for any consequence of using information presented. The speaker makes no claims of any conflict of interest with any parties mentioned in this presentation.

3 Outline Evolution of Biomedical Informatics in BIG DATA Era BIG DATA Trends and Challenges in Biomedical Research BIG DATA - New Frontier of Biomedical Informatics?

4 State of Biomedical Informatics Research Bench-to-Bedside Translational Research Genomics Proteomics Systems Biology Structural Biology Molecular Biology Bioinformatics Bioinformatics Patient Care Clinical Research Basic Research Clinical Trials GWAS Epigenetics Behavior or Nat. History Population Study ICU In/out-patient Disease outbreak Gene therapy Prevention Re-hab. Medical Informatics Interdisciplinary Research Team Research Networks, Centers of Excellence, etc.

5 Evolution of NIH Funded Research & BIG DATA Research Funding: PI Initiated Team Science Interdisciplinary Research Networks Data Sharing: Lab Silo Data Multi-site Data Sharing Data Repository

6 Evolution of Biomedical Informatics Collaboration Interoperability Enable BIG Data Science Biomedical Informatics is ahead of other research fields in working collaboratively and will play an important role in next BIG Data era!

7 What is BIG DATA? Wikipedia: a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine realtime roadway traffic conditions.

8 BIG DATA is Everywhere! BIG Data trends in Biomedical Research?

9 Challenges of Biomedical Informatics Projects Vast amount of datasets? Diverse data types (e.g. genomics, imaging, clinical, EHR, sensing, mhealth )? Clear defined project goals and aims? Literature/publications/knowledge discovery Knowing what data/resources are available? Integration of different types of data/information? Finding right tools and applications for analysis? Collaboration and data sharing Analyze data/information leading to new findings and publications Resources and Infrastructure ($$ & people) Translate research findings to biomedical discovery?

10 百萬 NIH Funding on Big Data Projects $800 $700 $600 $500 $400 $300 $200 $100 $ Over 75% are Biomedical Informatics involved projects

11 NIH Big Data Funding Categories ( )

12 BIG Data Publications a). By Google Scholar b). Related to health research, indexed by PubMed, ACM, Web of Science, etc. Andreu-Perez, et al, IEEE J. Biomed. & Health Informatics, V. 19, p1193, 2015

13 BIG DATA Impacts to Translational Bioinformatics? The growth of biomedical research data is evident in many ways: GenBank (which as of August 2015 contains more than 199 billion DNA bases from more than 187 million reported sequences), plus Whole Genome Sequences (WGS) Translational and clinical research has experienced similar growth in data volume, in which gigabyte scale digital images are common, and complex phenotypes derived from clinical data involve data extracted from millions of records with billions of observable attributes; how about genomic medicine? The biomedical research community is in an era of thousanddollar human genome needing a milliondollar interpretation.

14 Next Frontier of Biomedical Informatics Initiatives Four Scientific Areas Facilitating Broad Use of Biomedical Big Data Developing and Disseminating Analysis Methods and Software Enhancing Training for Biomedical Big Data Establishing Centers of Excellence for Biomedical Big Data

15 Facilitating Broad Use of Biomedical Big Data New Policies to Encourage Data & Software Sharing Catalog of Research Datasets to Facilitate Data Location & Citation (Data Discovery Index) Frameworks for the development of community-based standards Enabling Research Use of Clinical Data

16 Developing and Disseminating Analysis Methods and Software Software to Meet Needs of the Biomedical Research Community, both analytic software and management/processing software The creation of a Catalog of NIH-funded Software Facilitating Data Analysis: Access to Largescale Computing

17 Enhancing Training for Biomedical Big Data Increase the Number of Computationally Skilled Biomedical Trainees Strengthen the Quantitative Skills of All Biomedical Researchers Enhance NIH Review and Program Oversight Next generation of data scientist!

18 Future Biomedical Informatics Workforce Needs Driven in part by $30B investment in EHR adoption Office-based physicians (Hsiao, 2014) Emergency departments (Jamoom, 2015) Outpatient departments (Jamoom, 2015) Non-federal hospitals (Charles, 2014)

19 As well as by opportunities in research

20 Establishing Centers of Excellence for Biomedical Big Data Advance the science of Big Data in the context of biomedical and behavioral research, and to create innovative new approaches, methods, software, and tools.

21 Initiatives The Commons is a shared and interoperable computing environment intended to take advantage of emergent public and private cloud computing platforms and existing high performance computing (HPC) resources. The Commons is intended to facilitate access and catalyze the sharing, use, reuse, interoperability and discoverability of shared digital research objects.

22 New Data Sharing Policy for BIG Data Science

23 NIH Genomic and Human Data Sharing Policy

24 NEW JOURAL: nature.com/scientificdata

25 Making data discoverable Linking between research papers, Data Descriptors, and data records

26 Neuroscience New Dataset Data in OpenfMRI Source code in GitHub Big Data Code in GitHub

27 Stem Cells Associated Nature Article Data at figshare & NCBI GEO Integrated figshare data viewer

28 Some BIG Data Science NIH Geno- and Pheno-typing correlation NCI TCGA/CGHub NCBI dbgap

29 Correlating Genotype and Phenotype Findings About $3 billions to analyze 500,000 subjects relating disease and outcome phenotypes to genetic variants Variability in definitions of phenotypic data (map to common agreed-upon standards (CDE)) Utilize EHR data (too diverse)? Data sharing issues across research networks Terabytes of high throughout NGS data deduced down to gigabytes of consensus sequences Network bandwidth (compression?) Develop reference standard genomes (<1% difference) Protection and sharing of sensitive phenotype data and biospecimen De-identification acceptable-use and sharing policies

30 NCI TCGA

31 NCBI dbgap

32 in Biomedical Research needs a re-usable, extensible, sharable and interoperable informatics infrastructure to enable and streamline collaboration and data sharing for translational research

33 Sustainable Biomedical Informatics Infrastructure

34 Integrated Biomedical Informatics System (IBIS) Supporting the life cycle of biomedical research From quality data collection, study management, to data repository enabling big data science

35 IBIS Supported Projects Parkinson Biomarker Federal TBI DB CNRM Repository Alzheimer, eyegene, Rare Diseases, etc.

36 Neuro-Grid Data Cloud Informatics for Translational Discovery Stroke DR PD DR Alzheimer DR Neuro DR TBI DR

37 BIG DATA Trends in US Health Care? A collaboration among government, non-profit, and private sector organizations working to foster the availability and innovative use of data to improve health and health care.

38 Opportunities to BIG DATA Science Infrastructures, policies and incentives to promote data sharing Foster the development, dissemination, and effective use of computational tools for the analysis of datasets Sustainable infrastructure and funding Integration of molecular and clinical datasets for biomedical discovery Collaboration & interoperability (common data standards) Logistical and analytical challenges How to contextualize and comprehend? Data should not be integrated for the sake of integration to become BIG, but rather as a means to address specific biomedical questions and needs

39 BIG Data Project Guiding Principle (reverse thinking) Putting funding, technology, or other roadblock aside, if we have such a BIG Data now, 1). What can we do with it (benefit)? 2). What research study can we perform? 3). What analysis can be done? 4). What can we learn to advance science? 5). What we currently do now, without this big data?

40 Summary The future innovations of biomedical research will require a multi-discipline team empowered by data analysis, collection, management, repository and minding technology Sustainable informatics infrastructure enables researchers to effectively manage their study and data as well as collaborating and sharing to accelerate biomedical discovery and catalyze translational innovations Recent advances in big data will expand our knowledge for testing new hypotheses about disease management from diagnosis to prevention to personalized treatment. The rise of big data, however, also raises challenges in terms of privacy, security, data ownership, data stewardship, and governance.

41 Q & A fann@mail.nih.gov