Gene Expression Data Analysis

Similar documents
Gene Expression Data Analysis (I)

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Seven Keys to Successful Microarray Data Analysis

EECS730: Introduction to Bioinformatics

Introduction to Bioinformatics. Fabian Hoti 6.10.

Microarray Informatics

Bioinformatics for Biologists

Gene Expression Technology

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Microarray Informatics

STATISTICAL CHALLENGES IN GENE DISCOVERY

Bioinformatics for Biologists

First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes

Gene expression: Microarray data analysis. Copyright notice. Outline: microarray data analysis. Schedule

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

Estoril Education Day

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Functional genomics + Data mining

Gene expression analysis: Introduction to microarrays

Microarrays & Gene Expression Analysis

Analysis of Microarray Data

Analysis of Microarray Data

10.1 The Central Dogma of Biology and gene expression

The essentials of microarray data analysis

Microarray Technique. Some background. M. Nath

Analysis of Microarray Data

Analysis of Microarray Data

Computational Biology I

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Pre processing and quality control of microarray data

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Bioinformatics : Gene Expression Data Analysis

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Measuring and Understanding Gene Expression

Introduction to microarrays

Introduction to gene expression microarray data analysis

Data Mining for Biological Data Analysis

Basic aspects of Microarray Data Analysis

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Exploration and Analysis of DNA Microarray Data

Analysis of a Proposed Universal Fingerprint Microarray

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

Comparison of Microarray Pre-Processing Methods

Introduction to Bioinformatics: Chapter 11: Measuring Expression of Genome Information

Exploration and Analysis of DNA Microarray Data

Microarray Experiment Design

1. Introduction Gene regulation Genomics and genome analyses

Machine Learning Methods for Microarray Data Analysis

Standard Data Analysis Report Agilent Gene Expression Service

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

Gene Expression Profiling of Prokaryotic Samples using Low Input Quick Amp WT Kit

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Supplementary Figures Supplementary Figure 1

Measuring gene expression

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Introduction to Quantitative Genomics / Genetics

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison

Preprocessing Methods for Two-Color Microarray Data

Analysis of microarray data

Our view on cdna chip analysis from engineering informatics standpoint

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Computational Approaches to Analysis of DNA Microarray Data

Using 2-way ANOVA to dissect gene expression following myocardial infarction in mice

Exam 1 from a Past Semester

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

BIOSTATISTICS AND MEDICAL INFORMATICS (B M I)

V10-8. Gene Expression

ALLEN Human Brain Atlas

ChIP-seq and RNA-seq. Farhat Habib

Affymetrix probe-set remapping and probe-level filtering leads to dramatic improvements in gene expression measurement accuracy

Data Mining and Applications in Genomics

RNA-Sequencing analysis

Optimal alpha reduces error rates in gene expression studies: a meta-analysis approach

DNA Microarrays and Computational Analysis of DNA Microarray. Data in Cancer Research

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Microarray data analysis: from disarray to consolidation and consensus

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Microarray analysis of gene expression in male germ cell tumors

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Lecture 2: March 8, 2007

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

RNA-Seq Analysis. Simon Andrews, Laura v

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Supervised Learning from Micro-Array Data: Datamining with Care

RNA

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Integrative Genomics 1a. Introduction

Bioinformatics for Biologists

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Transcription:

Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009

Gene expression technologies (summary) Hybridization-based approaches Printed arrays cdna arrays: customizable, high array variation Synthesized oligo arrays Affymetrix arrays: high density, low array variation Classic arrays: probes on 3 UTR Exon arrays: probes on all known exons Tiling arrays: probes spread across the genomic sequence Sequencing-based approaches Traditional Sanger sequencing-based approaches Serial analysis of gene expression: ~10bp tag at the 3 end 2 nd generation sequencing based approaches RNA-Seq: high-throughput unbiased profiling 2 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 3 BMIF 310, Fall 2009

Well begun is half done A clearly defined biological question Well control of potential sources of variation (biological and technical) Statistically sound microarray experimental arrangement (replicates) Compliance with the standard of microarray information collection (MIAME) http://www.mged.org/workgroups/miame/miame.html 4 BMIF 310, Fall 2009

Image analysis Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array. Gridding: align a grid to the spots Segmentation: identify the shape of each spot Intensity extraction: extract intensity for each spot and potentially for each surrounding background Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot 5 BMIF 310, Fall 2009

Garbage in, garbage out Remove bad arrays Remove poor-quality spots Remove data points with low signal/noise ratio Remove data points with too many missing value Bad Array 6 BMIF 310, Fall 2009

Normalization The purpose of normalization is to remove systematic variation in a microarray experiment which affects the measured gene expression levels Systematic Variation Unequal quantities of starting RNA Differences in labelling and detection efficiencies Topographical slide variation Scanner introduced bias 7 BMIF 310, Fall 2009

Normalization method Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization) Match the percentiles of each array (Quantile normalization) Adjust using a nonlinear smoothing curve Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples Adjust using spike control No normalization Global normalization Quantile normalization 8 BMIF 310, Fall 2009

Get to know your data matrix Genes Samples ID Samp 1 Samp 2 Samp 3 Samp m-1 Samp m Gene 1 5.25 6.37 7.30 6.02 7.17 Gene 2 6.96 5.01 7.23 5.87 5.02 Gene 3 5.44 5.67 4.23 5.33 6.34 Gene 4 12.83 10.35 12.56 9.98 11.13 Gene 5 3.20 3.07 3.19 3.27 3.16 Gene 6 7.74 7.66 7.12 7.46 7.95 Gene n 6.06 6.04 6.35 6.44 6.60 Gene n-1 8.92 8.52 7.62 7.90 8.02 9 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 10 BMIF 310, Fall 2009

Differential Gene Expression n-fold change Arbitrarily selected fold change cut-offs Pros Usually 2 fold Intuitive and easily visualised Simple and rapid Cons Statistically inefficient Magnitude does not necessarily indicate importance Often too restrictive MVA plot M: log ratio ( log 2 (A/B) ) A: average log intensity ( log 2 (A*B)/2 ) 11 BMIF 310, Fall 2009

Differential Gene Expression Statistical tests Test for significant change between repeated measurements of a variable in two groups/multiple groups Calculation of statistics, selection of a cut-off value, reject the null-hypothesis Methods Two independent groups Student s t-test: parametric Mann-Whitney U test: nonparametric Two or more independent groups ANOVA (Analysis of variance): parametric Kruskal-Wallis test: nonparametric 12 BMIF 310, Fall 2009

Correction for multiple testing Why? In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,1000x0.05=500 genes would be inferred as significant even though none is differentially expressed Unadjusted p-value is likely to exaggerate Type I errors (false positives) Methods Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of gene tested Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction. 13 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 14 BMIF 310, Fall 2009

What is clustering Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities Unsupervised techniques, does not require the incorporation of any prior knowledge in the process 15 BMIF 310, Fall 2009

Why clustering? Exploratory data analysis, providing rough maps and suggesting directions for further study Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes 16 BMIF 310, Fall 2009

Clustering method Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters Partitioning: divide the data into g groups using some reallocation algorithm, e.g. K-means Fuzzy clustering: each object has a set of weights suggesting the probability of it belonging to each cluster 17 BMIF 310, Fall 2009

Hierarchical clustering Agglomerative clustering (bottom-up) Start with n groups, join the two closest, continue Divisive clustering (top-down) Start with 1 group, split into 2, then into 3,, into n Require distance measurement Between two objects Between clusters 18 BMIF 310, Fall 2009

Between objects distance measurement Euclidean distance Focus on the absolute expression value Pearson correlation coefficient Focus on the expression profile shape Parametric, normally distributed and follow the linear regression model Spearman correlation coefficient Focus on the expression profile shape Non-parametric, no assumption Less sensitive than Pearson 19 BMIF 310, Fall 2009

Different measurement, different distance Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink) Pearson: GeneC (green) Spearman: GeneD (red) 20 BMIF 310, Fall 2009

Between cluster distance measurement Single linkage: the smallest distance of all pairwise distances Complete linkage: the maximum distance of all pairwise distances Average linkage: the average distance of all pairwise distances 21 BMIF 310, Fall 2009

Hierarchical clustering Dendrogram Output of a hierarchical clustering Tree structure with the genes or samples as the leaves The height of the join indicates the distance between the left branch and the right branch Problems Hard to define distinct clusters 22 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 23 BMIF 310, Fall 2009

What is classification Classification algorithms are methods to classify objects into predefined classes Supervised techniques, requires training data and predefined classes Two step process Model construction: describe a set of predetermined classes using training data Model application: classify new objects into predefined classes 24 BMIF 310, Fall 2009

Classification methods K-nearest neighbor Decision tree Support vector machine Naïve Bayes classifier Artificial neural network 25 BMIF 310, Fall 2009

Feature selection Microarray data are characterized by large numbers of variables (genes) with respect to very few observations (samples), we need to select a subset of genes likely to be predictive (i.e. highly related with particular classes for classification) 26 BMIF 310, Fall 2009

Model construction Classification Algorithms Training Data Sample GeneA GeneB Tumor A H H N B H L Y C L L N D H L Y E L L N F L H N Classifier (Model) IF GeneA = H AND GeneB = L THEN Tumor= yes 27 BMIF 310, Fall 2009

Model application New objects Classifier (Model) IF GeneA = H Sample GeneA GeneB Tumor Z H L? AND GeneB = L THEN Tumor= yes Sample Z = Tumor? Yes 28 BMIF 310, Fall 2009

K-Nearest neighbor Objects are points in an n-d space Compute the distance between the new case and all learning cases Return the most common value among the k learning cases nearest to the new case = 29 BMIF 310, Fall 2009

Over-fitting and cross-validation Over-fitting The classifier is very effective in classifying the training samples but not accurate enough for new samples Cross-validation Hold-out N-fold Split data into Training and Testing data Learn with Training data and estimate true error with Testing data Randomly Split data into Training and Testing data n times Learn with Training and estimate true error with Testing in each split separately Average test performance Leave-one-out Leave one case for Testing Learn with the remaining data and estimate true error with the Testing Average test performance 30 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 31 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 32 BMIF 310, Fall 2009

Importance of biological interpretation Importance of biological interpretation Normalize, Filter, Cluster and Visualize Identification of sets of genes of potential interest Numerical technique, does not reveal the biological implications encrypted in expression data Evaluation of the functional significance of large, heterogeneous and noisy sets of genes constitutes a big challenge 33 BMIF 310, Fall 2009

Gene Ontology Structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products Three major categories that describe the attributes of biological process, molecular function and cellular component for a gene product Categories of concepts are held within a Directed Acyclic Graph (DAG) http://geneontology.org 34 BMIF 310, Fall 2009

Gene Ontology Tree Machine (GOTM) A web-based tool for the analysis and visualization of sets of genes identified from high-throughput technologies User friendly data navigation and visualization Statistical analysis suggesting biological areas that warrant further study http://bioinfo.vanderbilt.edu/gotm 35 BMIF 310, Fall 2009

GOTM observed 24 p=1.92e-34 expected 0.5 69 147 69 147 Up-regulated mitotic cell cycle random mitotic cell cycle 36 BMIF 310, Fall 2009

Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 37 BMIF 310, Fall 2009