Data Intensive Scientific Discovery Vijay Chandru Hon. Professor, NIAS Chairman, Strand Life Sciences chandru@alum.mit.edu
The Promise Peta (10 15 )and Exa (10 18 ) scale Computing Astrophysics (Large Synoptic Survey Telescope) Materials Science (Nanoscale Chemistry & Physics) Earth Science (data assimilation for ocean, carbon cycle, etc.) Energy Assurance (combustion, power grids, fusion physics) Fundamental Science (Accelerator Physics, RHIC LHC) Biology & Medicine (1000 Genomes, Real-time Biology) National Security (Cybersecurity, Weapons Simulations) Engineering Design (Communication Networks)
The Challenge There is a crisis in all sciences these days. We are drowning in a sea of data, and yet we are thirsty. - Sydney Brenner, at IISc, 2008 The IT Challenge Storage, Computing The Computer Science Challenge Algorithm design and implementation The Mathematics Challenge Statistical Analysis, Systems Theory The Multi-Disciplinarity Challenge Contextual problem solving
The Mathematics Challenges Visualization Statistics and Optimization Uncertainty Quantification Mumford - persistence Models Statistical Ab Initio Simulation
The Algorithms Challenges Visualization Scalability Machine Learning Network and Graph Analysis Analysis of Streaming Data Text Mining Distributed Data Architectures Data and Dimension Reduction
Cultural Challenges Mathematicians and Applications Research Communities Computer Scientists as Intellectual Partners not Technicians Problem driven, directed funding forcing multi-disciplinary collaborations.
At the end of the last millennium Today, the most successful craft industries are concerned with software and biotechnology Freeman Dyson, The Sun, The Genome, The Internet: Tools of scientific revolutions, 1999 Biology should keep Computer Scientists busy for at least 50 years Donald Knuth, Vision for the 21 st Century, 1999 In 50 years people will assume that computers and computing were actually developed for biology Buzz at Yorktown Heights, 1999 7
There is a crisis in all sciences these days. We are drowning in a sea of data, and yet we are thirsty. - Sydney Brenner, at IISc One NGS run generates 3x the sequence data generated during the Human Genome Project over 13 years. Current by 2010 Size of data from 1 run 1 TB 5 TB Data from these centers need to be acquired, analyzed, interpreted, viewed, managed, stored, compared and shared effectively & securely. 8
Genomics Data Deluge Growth in number of bases deposited in EMBL (1982-2009) The size in data volume and nucleotide numbers on EMBL, trace archive & SRA The Genomes OnLine Database Instrument currently using: One human genome (30x cov) raw data: ~90Gb; 1 Billion 75-100bp raw reads; intermediate data: 120-130Gb; tertiary data: ~10Gb
Next Gen Sequencing analysis
Reads up close
NGS challenges One sample could have a billion reads Align them against the reference (a few days) Analyse for SNP patterns Do analysis for multiple several disease and normal samples Statistically determine which SNPs are correlated with the disease
Central Dogma of Biology Transcription factors are proteins that bind to the DNA and trigger this sequence
Control of a gene Copyright 2002, Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter; Copyright 1983, 1989, 1994, Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson
Self-protection
Heat protection Pockley, G. (2001) Heat shock proteins in health and disease, Expert Reviews in Molecular Medicine. Cambridge University Press;
Gene expression at various stages A gene regulatory network armature for T lymphocyte specification, PNAS December 23, 2008 vol. 105 no. 51 20100-20105
Next Gen Sequencing (NGS) ChIP-Seq: Each experiment is for one regulatory protein, x Analysis output is the list of DNA regions to which the protein x binds Can hypothesize that the genes in these regions are regulated in some manner by x. RNA-Seq: Determines the expression levels of all genes in the sample.
Interpreting ChIP-Seq and RNA-Seq together Heat on Heat off ChIP-Seq HSTFs bind near heat shock genes CHBF binds near heat shock genes RNA-Seq Heat shock protein levels are up. Heat shock protein levels are down. When heat is on, HSTF upregulate expression of heat shock proteins When heat is off, CHBF suppresses expression of heat shock proteins Need 4 ChIP-Seq experiments (num conditions X num Tfs) and 2 RNA-Seq experiments (num conditions) to reach this conclusion
Grand Idea For a particular condition ChIP-Seq experiment for TF X, tells us which all genes could X effect Conduct ChIP-Seq experiments for all Tfs to know exactly which combination of Tfs are binding ahead of which gene Conduct an RNA-Seq expriment to determine the expression levels of each gene. Repeat for all conditions Now we know under condition C, protein X,Y,Z were bound upstream of gene G with expression E Solving all these equations will give an idea of the regulatory network across the range of conditions
Biomarker Collaboration with IISc- Breast cancer Goal: Breast cancer marker discovery program Kidwai Memorial Institute of Oncology Indian Institute of Science Strand Life Sciences Patient samples & Histopathology RNA preps and Microarrays Data analysis (Putative markers) Pathway based analysis of known cancer targets revels consistent up-regulation of therapy targets across multiple datasets, in a rare subclass of triple negative breast cancer. Results have been confirmed in a 80 breast cancer patients of the Indian cohort. Ongoing: testing hypothesis about pathway combination therapies to inactivate a pathway, instead of individual targets.
ERBBs Triple Negative vs Rest * * A PLCx PLCxx D Cross-talk? JAK1 JAK2 E C * F STAT3 STAT5 G I * Receptor degradation Transformation Differentiation Apoptosis Proliferation Differentiation Tumor survival Cell proliferation Oncogenesis
A global In Vivo Drosophila RNAi Screen Identifies NOT3 as a Conserved Regulator of Heart Function, Cell, April 2010 Drosophila RNAi screen data Human Ortholog analysis Mouse Gene Ontology KEGG GSEA Find first degree neighbors and build connected network Heart Systems Map
Systems map of Cardiac function Find first degree neighbors and build connected network
Introducing Scientific Intelligence Business Intelligence Put results in business context Scientific Context Put results in scientific context Scientific Intelligence Scientific Visualization Analyze & visualize vast amounts of data Systems Modeling Create mathematical models Application of data integration, analysis and visualization, scientific context and modeling to effectively mine large amounts of data, from varied sources, and convert it to usable knowledge, insight and decisions 26
The Power of Scientific Intelligence in Genomics Proteomics Next Generation Sequencing Tox/ADME Clinical Decisions Microscopy 27
AVADIS The Scientific Intelligence Platform The AVADIS platform is rich development platform for the management, analysis and visualization of complex scientific data Written in JAVA with JYTHON scripting capabilities Produces rich, interactive environments for data exploration Optimized for tackling life science-specific problems The AVADIS Platform 28