BIG DATA IN HEALTH CARE AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health

Size: px
Start display at page:

Download "BIG DATA IN HEALTH CARE AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health"

Transcription

1 BIG DATA IN HEALTH CARE AND BIOMEDICAL RESEARCH John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health

2 Background and Disclosures Professor of Biostatistics and Computational Biology, Dana-Farber Cancer Institute Professor of Computational Biology and Bioinformatics, Harvard School of Public Health Many other academic titles Numerous advisory boards Co-Founder of GenoSpace, a Precision Genomic Medicine Software Company

3 Why are you not in Brisbane?

4 Big Data

5 Every revolution in science from Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin s theory of evolution and natural selection to the theory of the gene has been driven by one and only one thing: access to data. John Quackenbush

6 Every revolution in the history science has been driven by one and only one thing: access to data. John Quackenbush Twitter version, 115 characters with spaces, including -name

7 Driving Trends in Big Data

8 The Cloud Image from

9 The iphone (and other smart phones) Images from

10 New Sources of Health and Medical Data Drug Research Social Media Patient Records Genomics Test Results Claims Data Home Monitoring Mobile Apps

11 The Body as a Source of Big Data It is estimated that by 2015, the average US hospital will generate 665TB of data annually. Medical imaging archives are increasing by 20-40% annually. Today, 80% of data is unstructured such as images, video, and notes. Graphic from NetApp

12 Transforming Medicine? New technologies are providing unprecedented data that have opened new avenues of investigation, transforming biomedical research into an information science. The challenge is to bring this information together with other information to better address fundamental problems, including a wide range of problems in health and biomedical research. In patient care, we want to increase efficiency, improve outcomes, and decrease cost.

13 The Need for Data-Driven Precision Medicine

14 US Cancer Prevalence Estimates 2010 and 2020 # People (thousands) % Site change Breast Prostate Colorectal Melanoma Lymphoma Uterus Bladder Lung Kidney Leukemia All Sites 13,772 18, From: A.B. Mariotto et al. (2011) J. Nat. Cancer Inst. 103, 117

15 Estimates of U.S. National Expenditures for Cancer Care $124 billion and projected to rise to $207 billion (66% increase) by 2020 Ini. = within 1 year of Dx; Con = continuing; Last = last year From: A. B. Mariotto et al. (2011) J. Nat. Cancer Inst. 103, 117

16 Biomarkers, Disease Subtyping and Targeted Therapy Her-2+ (Herceptin) (Perjeta) EML4-ALK (Xalkori) K-ras (Erbitux) (Vectibix) BRAF-V600 (Zelboraf) Companion Diagnostics: the Right Rx for the Right Disease (Subtype). CFTR-G551 (Kalydeco) Precision Medicine: The use of genomic data (such as gene variants) with clinical information to arrive at a therapeutic protocol likely to produce a positive response.

17 Non-Responders to Oncology Therapeutics Are Highly Prevalent and Very Costly

18 Evidence: We Still Need Research Even within NCCN, certainly the majority of decision nodes that are enshrined in NCCN [guidelines] are not supported by high level evidence. Dr. Clifford Hudis President, ASCO Interview in Cancer Letter 22 Nov. 2013, 39 NCCN = National Cooperative Cancer Center Network

19 The Solution Must Be Driven by Data

20 We Need to Invest in Big Data Research

21 National Research Council on Big Data National Research Council s 2013 Frontiers of Massive Data Analysis report concluded: The challenges associated with massive data go far beyond the technical aspects of data. The key element in meeting Big Data s challenges is the development of rigorous quantitative and statistical methods. It is quite possible to turn data into something resembling knowledge when actually it is not. Overlooking this foundation may yield results that are, at best, not useful, or harmful at worst.

22 Key Challenges in Big Data Preprocessing (Normalization) and Hot Spot Detection Need methods to compare measurements across sources and to rapidly identify salient features Data Integration Need methods that can combine data from various sources where there are hidden correlations in the data Reproducible Research Need to leverage the volume and velocity of the data to provide opportunities for validation of findings Network Methods Need to move beyond correlations in studying relationships in data Data Access and Utility Need to assure that different constituencies have access to data in a form that meets their unique needs

23 What can we learn from networks? Normal Tissue Network Chemosensitive Tumor Chemoresistant Tumor

24 Modeling Gene Regulatory Networks

25 Integrative Network Inference: PANDA

26 Regulation of Transcription regulatory sequences promoter Specific transcription factors

27 A Simple Idea: Message Passing Transcription Factor The TF is Responsible for communicating with its Target Downstream Target The Target must be Available to respond to the TF Kimberly Glass, GC Yuan

28 Message-Passing Networks: PANDA TF-Motif Scan PPI 0 Network 0 Co-regulation 0 Responsibility Availability PPI 1 Network 1 Co-regulation 1 Glass et. al. Passing Messages Between Biological Networks to Refine Predicted Interactions. PLoS One May 31;8(5):e Code and related material available on sourceforge:

29 Subtypes of Ovarian Cancer

30 PANDA: Integrative Network Models Genes Condition s Network for Angiogenic Subtype Expression data (Angiogenic) Condition s Network for Non-angiogenic Subtype Compare/Identify Differences Genes Expression data (Non-angiogenic) Kimberly Glass, GC Yuan

31 Kimberly Glass, GC Yuan

32 Kimberly Glass, GC Yuan

33 Kimberly Glass, GC Yuan

34 Inner ring: key TFs Colored by Edge Enrichment (A or N) Outer ring: genes Colored by Differential Expression (A or N) Interring Connections Colored by Subnetwork (A or N) Ticks genes annotated to angiogenesis in GO,

35 Complex Regulatory Patterns Emerge TF1 TF2 sig. # Class ARID3A PRRX2 1.16E A+ ARID3A SOX5 1.01E A+ PRRX2 SOX5 3.83E A+ ARNT MZF1 5.83E N- AHR ARNT 6.13E N- ETS1 MZF1 9.08E N- Co-regulatory TF Pairs Kimberly Glass, GC Yuan

36 Regulatory Patterns suggest Therapies Kimberly Glass, GC Yuan

37 Beyond THE Network

38 We Normally Estimate Aggregate Networks Multiple Samples/Conditions Genes Network Reconstruction Gene Expression Data One Network Is there a way to reverse engineer each of the individual networks that contribute to the aggregate network?

39 Let s Make a Simple Assumption Assume that the edges estimated in an aggregate network are a linear combination of edges corresponding to the networks of the individual samples. We can build two aggregate networks, one with and one without a given sample. By subtracting these two equations we can then solve for the individual network s contribution. N ( r ) = s ( s) ˆ ij wseij s= 1 Aggregate network using all samples e ( r ') ˆ ij e N = s s q w' s e ( s) ij Aggregate network using all but one sample

40 Network Modeling with LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) LIONESS Linear Equation: SAMPLES Learn red sample network by estimating two networks (1) using all tan samples, and (2) using all samples (red and tan). SS network can then be calculated using LIONESS equation. e ( q) ij = Edge in Single-sample Network N s scale (ˆ e ( r ) ij eˆ Difference Between Network Estimates ( r ') ij ) + eˆ ( r ') ij Estimate (contribution) from other samples Application Using PANDA Networks

41 A Quick Test Using Yeast Cell Cycle Data Data includes 48 total expression arrays taken over a timecourse (every 5 min) on synchronized yeast cells (~2 cell cycles) Includes technical replicates (Cy3/Cy5 and Cy5/Cy3) Estimate single-sample networks using data for each replicate. Correlation in Expression 0.8 Rep1 Replicate 1 (24 Samples 24 Networks) Rep2 Correlation in Networks 0.8 Rep1 Rep1 Rep2-0.8 Replicate 2 (24 Samples 24 Networks) Rep2 Rep1 Rep2 0.5

42 Validation and Insight

43 Solving the GWAS Puzzle

44 We identified 697 variants at genome-wide significance that together explained one-fifth of heritability for adult height. By testing different numbers of variants in independent studies, we show that the most strongly associated ~2,000, ~3,700 and ~9,500 SNPs explained ~21%, ~24% and ~29% of phenotypic variance.

45 The 97 loci account for 2.7% of BMI variation, and genome-wide estimates suggest that common variation accounts for 20% of BMI variation.

46 John Platig eqtl Analysis Use genome-wide data on genetic variants (SNPs = Single Nucleotide Polymorphisms) and gene expression data together Treat gene expression as a quantitative trait Ask, Which SNPs are correlated with the degree of gene expression? Most people concentrate on cis-acting SNPs What about trans-acting SNPs?

47 eqtl Networks: A simple idea eqtls should group together with core SNPs regulating particular cellular functions Perform a standard eqtl analysis: Y = β 0 + β 1 ADD + ε where Y is the quantitative trait and ADD is the allele dosage of a genotype. John Platig

48 Which SNPs affect function? Many strong eqtls are found near the target gene. But what about multiple SNPs that are correlated with multiple genes? SNPs Can a network of SNPgene associations inform the functional roles of these SNPs? Genes John Platig

49 Results: COPD John Platig

50 What about GWAS SNPs? John Platig

51 What about GWAS SNPs? The hubs are a GWAS desert! John Platig

52 What are the critical areas? Abraham Wald: Put the armor where the bullets aren t!

53 Network Structure Matters? Are disease SNPs skewed towards the top of my SNP list as ranked by the overall out degree? No! The highest-degree SNPs are devoid of disease-related SNPs Highly deleterious SNPs that affect many processes are probably removed by evolutionary sweeps. John Platig

54 Can we use this network to identify groups of SNPs and genes that play functional roles in the cell? Try clustering the nodes into communities based on the network structure John Platig

55 Communities are groups of highly intra-connected nodes Community structure algorithms group nodes such that the number of links within a community is higher than expected by chance Formally, they assign nodes to communities such that the modularity, Q, is optimized Fraction of network links in community i John Platig Fraction of links expected by chance Newman 2006 (PNAS)

56 Communities are groups of highly intra-connected nodes Community structure algorithms group nodes such that the number of links within a community is higher than expected by chance. Bipartite networks require a different null model John Platig Newman 2006 (PNAS)

57 Communities in COPD eqtl networks John Platig

58 Communities in COPD eqtl networks We identified 35 communities, with Q = 0.77 (out of 1) Of 35 communities, 13 are enriched for GO terms (P<5x10-4 ) John Platig

59 Communities in COPD eqtl networks - Microtubule organization - Cell cycle - Centrosome - Immune response - Stress response - T cell stimulation John Platig - Chromatin Assembly - DNA conformation change - Nucleosome

60 Calculate Local Connectivity Modularity of node i Modularity of community c John Platig

61 Network Structure Matters! Are disease SNPs skewed towards the top of my SNP list as ranked by the community core score (Q ic )? Yes! John Platig

62 Disease-associated SNPs significantly contribute to community structure Is the distribution of GWAS core scores stochastically larger than a randomly sub-sampled non-gwas distribution? The median core score for GWAS SNPs is 1.7 times higher than the median for the non-gwas SNPs

63 What have we learned? Data will drive the rise of precision medicine Networks provide unique insight into biological systems. The properties of the networks themselves afford a unique opportunity to discover key factors in the networks that may control disease processes. There are many unexplored areas that are ripe for investigation, many of which may provide singular insight into biological processes.

64 Big Data Alone Are Not a Panacea Spurious Correlations:

65 The future is here. It's just not widely distributed yet. - William Gibson

66 Before I came here I was confused about this subject. After listening to your lecture, I am still confused but at a higher level. - Enrico Fermi, ( )

67 Acknowledgments Gene Expression Team Fieda Abderazzaq Aedin Culhane Jessica Mar Renee Rubio Systems Support Stas Alekseev, Sys Admin University of Queensland Christine Wells Lizzy Mason Center for Cancer Computational Biology Fieda Abderazzaq Stas Alekseev Jalil Farid Nicole Flanagan Ed Harms Alex Holman Lev Kuznetsov Brian Lawney Kshithija Nagulapalli Antony Partensky John Quackenbush Renee Rubio Yaoyu E. Wang Students and Postdocs Martin Aryee Stefan Bentink Kimberly Glass Benjamin Haibe-Kains Marieke Kuijjer Kaveh Maghsoudi Jess Mar Melissa Merritt Megha Padi John Platig Alejandro Qiuiroz J. Fah Sathirapongsasuti Administrative Support Julianna Coraccio