Applications of Big Data in Evidence-Based Medicine Carolyn Compton, MD, PhD Professor Life Sciences, Arizona State University Professor Laboratory Medicine and Pathology, Mayo Clinic Adjunct Professor of Pathology, Johns Hopkins CMO, National Biomarker Development Alliance CMO, Complex Adaptive Systems Institute Scottsdale, AZ Financial Disclosure Unpaid board positions with travel reimbursements AJCC Indivumed GmbH CloudLims Roche/Ventana BBMRI PAIGE.AI, Inc HealthTell Inc Royalties: Up-To-Date Learning Objectives Explain the difference between big data and other data Describe the interplay between big data and artificial intelligence in enabling precision oncology Recognize the pearls in big biomedical data Recognize the perils in biomedical data, big and small Predict how big data will change oncology practice
Big Data COMPUTING: Noun extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations Precision Oncology is a Big Data Domain Exhibit #1: Sequencing a Genome Human genome (normal haploid) = 3.2 billion base pairs variable Whole Genome Analysis (60x coverage) 1,000 GB/sample* *Cancer genome is more complex. ** Each biopsy sample of a cancer may have a different genome. How Big Is Extremely Large? One genome = 1,000 gigabytes = 1 terabyte 235 TB = entire contents of the Library of Congress 1,0000 genomes = 1,000 terabytes (1,000 genomes) = 1 petabyte The 1 PB zettabyte = entire era storage (1,000,000,000,000,000,000,000 capacity (hard drive) of the human bytes) brain has arrived 2 PB = content of all US research libraries 50 PB = entire written works of humankind from the beginning of time Large Hadron collider produces 100 PB per day!
The Zettabyte Era WE ARE HERE The Big Data Dilemma: It s Getting Even Bigger Variety (multi-dimensional genomic, phenotypic, clinical data - complexity will increase) Velocity (sheer rate of data - generation now exceeds Moore s Law) Veracity Is it true or Value not? Is it useful or not? Volume (unprecedented amounts - it s still early) Adapted from Laney: Gartner 2001, 2012 NSF/NIH 2012 BIG DATA Cancer Is the Best Example of Enterprise-Wide, Large-Scale Data Production and Usage Clinical data Pathology data: Cancer morphological care has become and morphometric a 4-M effort: - Multi-expert Imaging and functional imaging data - Multi-modality Molecular data: genomic, - Multi-institutional transcriptomic, epigenomic and proteomic data - Multi-million Microbiomic data Health care environmental data
Mastery of Big Data Enables Big Opportunities for Cancer Medicine Health Care Itself Is Now a Big-Business, Data-Data Enterprise Clinical data, Laboratory data, Imaging data, Outcomes data, Billing data, Operational data, Quality management data, Personnel data, Compliance data, Revenue data, Costs data, Logistics data, Malpractice data, Insurance data, etc, etc, etc. But It s Complicated: Data Quality and Data Sharing Challenges The standards needed to assure data quality and reproducibility are lacking The incentives to change this have been lacking Require time, effort and resources for standards development and enforcement We have been misled into thinking that data volume can compensate for poor or unknown data quality The incentives to share data have been weak and the incentives to hoard data have MIT Technical Review been strong (e.g., IP and privacy)
The Reality: Biomedical Data Have Far Outstripped the Cognitive Capacity of Human Beings Precision medicine is completely dependent on artificial intelligence systems that integrate and interpret huge amounts of data to aid in decision-making. >2,000,000 proteins* 10 7 SNPs in HapMap 25,000 genes *AMA Science 2014 Stead W. Beyond Expert Based Practice. Presented at: Institute Of Medicine; October 8, 2007. Much of the World s Medical Data is Unstructured IBM s Watson: Processes natural language and uses cognitive computing to make decisions Instantly reads all medical literature and medical records, makes associations to answer questions Machine learning: constantly adapting to shifts in knowledge Jeopardy 16 February 2011 Utility for medical purposes is currently under study at major cancer centers Human Brain vs. Neuromorphic Computing Current Scorecard: Wetware vs. Hardware Human brain: Excels at complex tasks with low energy usage No programming needed 100 billion neurons; 100 trillion synapses Super computer brain (in evolution): IBM s Sequoia (1.5M networked processor chips) simulates network communication in human brain but is highly energy inefficient Would require 12 gigawatts to perform at brain speed 12 gigawatts = combined power consumption of LA and NYC
Human Brain vs. Neuromorphic Computing Current Scorecard: Wetware vs. Hardware IBM s TrueNorth neuromorphic computer, (transistors wired to form 1M digital neurons with 256M synapses ) accomplishes complex tasks like pattern recognition at high speed and low energy: brain on a chip Google s Deep Mind technology: goal create intelligence by combining machine learning and neuroscience to build algorithms for decision-making What Is Artificial Intelligence? Machine Learning? Deep Learning? Artificial Intelligence (AI) is intelligence exhibited by machines A machine that mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving What Is Artificial Intelligence? Machine Learning? Deep Learning? Machine learning (ML), a fundamental part of AI, is the use of computer algorithms built on statistical techniques to improve the machine s performance automatically through experiences using data Unsupervised: ability to find patterns in a stream of data Supervised: the ability to generalize from a specific set of training data to a new set of data in a reasonable way Deep learning (ML subset): use of neural networks to understand large amounts of data and train itself to perform tasks
What Is Artificial Intelligence? Machine Learning? Deep Learning? Artificial Intelligence Equals or Outperforms Humans in Most Domains The Big Risks in AI: Variation, Vulnerability, Volition Variation The AI community is fragmented and diverse: there is no medical AI community Programmers Engineers Data Scientists Roboticists No universally employed best practices: variation reigns Strong individualism in AI research: no culture of sharing and harmonization
The Big Risks in AI: Variation, Vulnerability, Volition Vulnerability Hacking System failures (network crashes) Volition Can AI autonomously replace human decision??? Will AI ignore us? What about ethics? OpenAI (AI to benefit humanity and extension of Human Wills Is this enough?) Artificial Intelligence Equals or Outperforms Humans in Most Domains The Complexity Challenge of Cancer Dilemma: Extravagant Genomic Alterations Mutations per megabase tumor DNA (3K megabases in human genome) Average 10 per megabase for lung cancer and melanoma Copy number alterations in solid tumors
Landscape of Extreme Genomic Heterogeneity in Lung Cancer Each column is a separate cancer Malignant snowflakes : each cancer carries multiple unique mutations and other genome perturbations (such as epigenomic changes) Disturbing implications for therapeutic cure and development of new Rx Phylogenetic Profiles of Intratumoral Clonal Heterogeneity in 11 Lung Cancers: DIFFERENT CLONES IN SAME CANCER Trunk (Blue): Mutations present in all regions Branch (Green): Mutations present in some regions Private (Red): Mutations present in only one region J. Zhang et al. (2014) Science 346; 256 Malignant Snowflakes
Medicinal Complexity: Integrating the Complex Layers of Biology and Evolution as Disease Progresses Genomics Proteomics Molecular Pathways and Networks Network Regulatory Mechanisms ID of Causal Relationships Between Network Perturbations and Disease Patient-Specific Signals and Signatures of Disease or Predisposition to Disease Conclusion? Imperatives Ahead! - OpenMind May 2017 We can t do this without machines We must embrace this: learn to use machines to extend human capability We must control it We must build safety strong provisions Machines can t do this without data We must assure that the data are of high quality We must share the data: build a culture of data sharing The data are astronomically voluminous and complex This is just the beginning: more data flavors are on the way We MUST do this: patients are counting on us Applications of Big Data in Evidence-Based Medicine Carolyn Compton, MD, PhD Professor Life Sciences, Arizona State University Professor Laboratory Medicine and Pathology, Mayo Clinic Adjunct Professor of Pathology, Johns Hopkins CMO, National Biomarker Development Alliance CMO, Complex Adaptive Systems Institute Scottsdale, AZ