Data Intensive Scientific Discovery Vijay Chandru

Similar documents
IPA : Maximizing the Biological Interpretation of Gene, Transcript & Protein Expression Data with IPA

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Agilent Genomics Software Future Directions

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

Ion S5 and Ion S5 XL Systems

China National Grid --- BioNode. Jun Wang Beijing Genomics Institute

Course Presentation. Ignacio Medina Presentation

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

Top 5 Lessons Learned From MAQC III/SEQC

Introduction to Bioinformatics

Overview of Health Informatics. ITI BMI-Dept

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Bridging the Gap Between Basic and Clinical Research. Julio E. Celis Danish Cancer Society

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Network System Inference

Gene expression connectivity mapping and its application to Cat-App

SYMPOSIUM March 22-23, 2018

Microbial Metabolism Systems Microbiology

Personalized Medicine

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Year III Pharm.D Dr. V. Chitra

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Corporate Medical Policy

Welcome to the NGS webinar series

MAYO CLINIC CENTER FOR BIOMEDICAL DISCOVERY EXCEPTIONAL RESEARCH LEADS TO EXCEPTIONAL PATIENT CARE

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Introduction to Bioinformatics and Gene Expression Technology

Introducing a Highly Integrated Approach to Translational Research: Biomarker Data Management, Data Integration, and Collaboration

DNA Transcription. Visualizing Transcription. The Transcription Process

Introduction to the UCSC genome browser

Introduction to Bioinformatics

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

The Integrated Biomedical Sciences Graduate Program

Cancer ImmunoTherapy Accelerator (CITA) Dr Shalini Jadeja

B I O I N F O R M A T I C S

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Gene Identification in silico

3.1.4 DNA Microarray Technology

TOTAL CANCER CARE: CREATING PARTNERSHIPS TO ADDRESS PATIENT NEEDS

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Introduction to Bioinformatics

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

The Pathways to Understanding Diseases

Introduction to BIOINFORMATICS

Cancer Genetics Solutions

RNA-Seq with the Tuxedo Suite

PCR Arrays. An Advanced Real-time PCR Technology to Empower Your Pathway Analysis

6. GENE EXPRESSION ANALYSIS MICROARRAYS

BIOINFORMATICS Introduction

Lecture #1. Introduction to microarray technology

MOLECULAR BIOLOGY OF EUKARYOTES 2016 SYLLABUS

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

MICROARRAYS+SEQUENCING

Applications of Big Data in Evidence-Based Medicine

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

Transcription. DNA to RNA

Developing an Accurate and Precise Companion Diagnostic Assay for Targeted Therapies in DLBCL

Algorithms in Bioinformatics

Cytomics in Action: Cytokine Network Cytometry

Proteomics And Cancer Biomarker Discovery. Dr. Zahid Khan Institute of chemical Sciences (ICS) University of Peshawar. Overview. Cancer.

Sanger vs Next-Gen Sequencing

Bioinformatics, in general, deals with the following important biological data:

Opportunities and Impacts

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

disaccharides = two mono-s linked together e.g. lactose = glucose + galactose sucrose = glucose + fructose

IPA Advanced Training Course

NCBI web resources I: databases and Entrez

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

Introduction to Bioinformatics

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

WELCOME. Norma J. Nowak, PhD Executive Director, NY State Center of Excellence in Bioinformatics and Life Sciences (CBLS)

Bioinformatics and computational tools

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

DEPEI QIAN. HPC Development in China: A Brief Review and Prospect

Introduction to Microarray Analysis

General Education Learning Outcomes

Product Applications for the Sequence Analysis Collection

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Analysing genomes and transcriptomes using Illumina sequencing

Introduction to Bioinformatics

Goals of pharmacogenomics

Gene Expression on the Fluidigm BioMark HD

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Molecular Diagnostics

Outline and learning objectives. From Proteomics to Systems Biology. Integration of omics - information

Dana-Farber Cancer Institute Speeds Medical Research with Advanced Data Warehouse

Ontologies - Useful tools in Life Sciences and Forensics

Algorithms in Nature. (brief) introduction to biology

Review of Biomedical Image Processing

Ion S5 and Ion S5 XL Systems

Biomedical Big Data and Precision Medicine

Introduction to Bioinformatics. Fabian Hoti 6.10.

Transcription:

Data Intensive Scientific Discovery Vijay Chandru Hon. Professor, NIAS Chairman, Strand Life Sciences chandru@alum.mit.edu

The Promise Peta (10 15 )and Exa (10 18 ) scale Computing Astrophysics (Large Synoptic Survey Telescope) Materials Science (Nanoscale Chemistry & Physics) Earth Science (data assimilation for ocean, carbon cycle, etc.) Energy Assurance (combustion, power grids, fusion physics) Fundamental Science (Accelerator Physics, RHIC LHC) Biology & Medicine (1000 Genomes, Real-time Biology) National Security (Cybersecurity, Weapons Simulations) Engineering Design (Communication Networks)

The Challenge There is a crisis in all sciences these days. We are drowning in a sea of data, and yet we are thirsty. - Sydney Brenner, at IISc, 2008 The IT Challenge Storage, Computing The Computer Science Challenge Algorithm design and implementation The Mathematics Challenge Statistical Analysis, Systems Theory The Multi-Disciplinarity Challenge Contextual problem solving

The Mathematics Challenges Visualization Statistics and Optimization Uncertainty Quantification Mumford - persistence Models Statistical Ab Initio Simulation

The Algorithms Challenges Visualization Scalability Machine Learning Network and Graph Analysis Analysis of Streaming Data Text Mining Distributed Data Architectures Data and Dimension Reduction

Cultural Challenges Mathematicians and Applications Research Communities Computer Scientists as Intellectual Partners not Technicians Problem driven, directed funding forcing multi-disciplinary collaborations.

At the end of the last millennium Today, the most successful craft industries are concerned with software and biotechnology Freeman Dyson, The Sun, The Genome, The Internet: Tools of scientific revolutions, 1999 Biology should keep Computer Scientists busy for at least 50 years Donald Knuth, Vision for the 21 st Century, 1999 In 50 years people will assume that computers and computing were actually developed for biology Buzz at Yorktown Heights, 1999 7

There is a crisis in all sciences these days. We are drowning in a sea of data, and yet we are thirsty. - Sydney Brenner, at IISc One NGS run generates 3x the sequence data generated during the Human Genome Project over 13 years. Current by 2010 Size of data from 1 run 1 TB 5 TB Data from these centers need to be acquired, analyzed, interpreted, viewed, managed, stored, compared and shared effectively & securely. 8

Genomics Data Deluge Growth in number of bases deposited in EMBL (1982-2009) The size in data volume and nucleotide numbers on EMBL, trace archive & SRA The Genomes OnLine Database Instrument currently using: One human genome (30x cov) raw data: ~90Gb; 1 Billion 75-100bp raw reads; intermediate data: 120-130Gb; tertiary data: ~10Gb

Next Gen Sequencing analysis

Reads up close

NGS challenges One sample could have a billion reads Align them against the reference (a few days) Analyse for SNP patterns Do analysis for multiple several disease and normal samples Statistically determine which SNPs are correlated with the disease

Central Dogma of Biology Transcription factors are proteins that bind to the DNA and trigger this sequence

Control of a gene Copyright 2002, Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter; Copyright 1983, 1989, 1994, Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson

Self-protection

Heat protection Pockley, G. (2001) Heat shock proteins in health and disease, Expert Reviews in Molecular Medicine. Cambridge University Press;

Gene expression at various stages A gene regulatory network armature for T lymphocyte specification, PNAS December 23, 2008 vol. 105 no. 51 20100-20105

Next Gen Sequencing (NGS) ChIP-Seq: Each experiment is for one regulatory protein, x Analysis output is the list of DNA regions to which the protein x binds Can hypothesize that the genes in these regions are regulated in some manner by x. RNA-Seq: Determines the expression levels of all genes in the sample.

Interpreting ChIP-Seq and RNA-Seq together Heat on Heat off ChIP-Seq HSTFs bind near heat shock genes CHBF binds near heat shock genes RNA-Seq Heat shock protein levels are up. Heat shock protein levels are down. When heat is on, HSTF upregulate expression of heat shock proteins When heat is off, CHBF suppresses expression of heat shock proteins Need 4 ChIP-Seq experiments (num conditions X num Tfs) and 2 RNA-Seq experiments (num conditions) to reach this conclusion

Grand Idea For a particular condition ChIP-Seq experiment for TF X, tells us which all genes could X effect Conduct ChIP-Seq experiments for all Tfs to know exactly which combination of Tfs are binding ahead of which gene Conduct an RNA-Seq expriment to determine the expression levels of each gene. Repeat for all conditions Now we know under condition C, protein X,Y,Z were bound upstream of gene G with expression E Solving all these equations will give an idea of the regulatory network across the range of conditions

Biomarker Collaboration with IISc- Breast cancer Goal: Breast cancer marker discovery program Kidwai Memorial Institute of Oncology Indian Institute of Science Strand Life Sciences Patient samples & Histopathology RNA preps and Microarrays Data analysis (Putative markers) Pathway based analysis of known cancer targets revels consistent up-regulation of therapy targets across multiple datasets, in a rare subclass of triple negative breast cancer. Results have been confirmed in a 80 breast cancer patients of the Indian cohort. Ongoing: testing hypothesis about pathway combination therapies to inactivate a pathway, instead of individual targets.

ERBBs Triple Negative vs Rest * * A PLCx PLCxx D Cross-talk? JAK1 JAK2 E C * F STAT3 STAT5 G I * Receptor degradation Transformation Differentiation Apoptosis Proliferation Differentiation Tumor survival Cell proliferation Oncogenesis

A global In Vivo Drosophila RNAi Screen Identifies NOT3 as a Conserved Regulator of Heart Function, Cell, April 2010 Drosophila RNAi screen data Human Ortholog analysis Mouse Gene Ontology KEGG GSEA Find first degree neighbors and build connected network Heart Systems Map

Systems map of Cardiac function Find first degree neighbors and build connected network

Introducing Scientific Intelligence Business Intelligence Put results in business context Scientific Context Put results in scientific context Scientific Intelligence Scientific Visualization Analyze & visualize vast amounts of data Systems Modeling Create mathematical models Application of data integration, analysis and visualization, scientific context and modeling to effectively mine large amounts of data, from varied sources, and convert it to usable knowledge, insight and decisions 26

The Power of Scientific Intelligence in Genomics Proteomics Next Generation Sequencing Tox/ADME Clinical Decisions Microscopy 27

AVADIS The Scientific Intelligence Platform The AVADIS platform is rich development platform for the management, analysis and visualization of complex scientific data Written in JAVA with JYTHON scripting capabilities Produces rich, interactive environments for data exploration Optimized for tackling life science-specific problems The AVADIS Platform 28