Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009
Gene expression technologies (summary) Hybridization-based approaches Printed arrays cdna arrays: customizable, high array variation Synthesized oligo arrays Affymetrix arrays: high density, low array variation Classic arrays: probes on 3 UTR Exon arrays: probes on all known exons Tiling arrays: probes spread across the genomic sequence Sequencing-based approaches Traditional Sanger sequencing-based approaches Serial analysis of gene expression: ~10bp tag at the 3 end 2 nd generation sequencing based approaches RNA-Seq: high-throughput unbiased profiling 2 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 3 BMIF 310, Fall 2009
Well begun is half done A clearly defined biological question Well control of potential sources of variation (biological and technical) Statistically sound microarray experimental arrangement (replicates) Compliance with the standard of microarray information collection (MIAME) http://www.mged.org/workgroups/miame/miame.html 4 BMIF 310, Fall 2009
Image analysis Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array. Gridding: align a grid to the spots Segmentation: identify the shape of each spot Intensity extraction: extract intensity for each spot and potentially for each surrounding background Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot 5 BMIF 310, Fall 2009
Garbage in, garbage out Remove bad arrays Remove poor-quality spots Remove data points with low signal/noise ratio Remove data points with too many missing value Bad Array 6 BMIF 310, Fall 2009
Normalization The purpose of normalization is to remove systematic variation in a microarray experiment which affects the measured gene expression levels Systematic Variation Unequal quantities of starting RNA Differences in labelling and detection efficiencies Topographical slide variation Scanner introduced bias 7 BMIF 310, Fall 2009
Normalization method Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization) Match the percentiles of each array (Quantile normalization) Adjust using a nonlinear smoothing curve Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples Adjust using spike control No normalization Global normalization Quantile normalization 8 BMIF 310, Fall 2009
Get to know your data matrix Genes Samples ID Samp 1 Samp 2 Samp 3 Samp m-1 Samp m Gene 1 5.25 6.37 7.30 6.02 7.17 Gene 2 6.96 5.01 7.23 5.87 5.02 Gene 3 5.44 5.67 4.23 5.33 6.34 Gene 4 12.83 10.35 12.56 9.98 11.13 Gene 5 3.20 3.07 3.19 3.27 3.16 Gene 6 7.74 7.66 7.12 7.46 7.95 Gene n 6.06 6.04 6.35 6.44 6.60 Gene n-1 8.92 8.52 7.62 7.90 8.02 9 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 10 BMIF 310, Fall 2009
Differential Gene Expression n-fold change Arbitrarily selected fold change cut-offs Pros Usually 2 fold Intuitive and easily visualised Simple and rapid Cons Statistically inefficient Magnitude does not necessarily indicate importance Often too restrictive MVA plot M: log ratio ( log 2 (A/B) ) A: average log intensity ( log 2 (A*B)/2 ) 11 BMIF 310, Fall 2009
Differential Gene Expression Statistical tests Test for significant change between repeated measurements of a variable in two groups/multiple groups Calculation of statistics, selection of a cut-off value, reject the null-hypothesis Methods Two independent groups Student s t-test: parametric Mann-Whitney U test: nonparametric Two or more independent groups ANOVA (Analysis of variance): parametric Kruskal-Wallis test: nonparametric 12 BMIF 310, Fall 2009
Correction for multiple testing Why? In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,1000x0.05=500 genes would be inferred as significant even though none is differentially expressed Unadjusted p-value is likely to exaggerate Type I errors (false positives) Methods Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of gene tested Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction. 13 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 14 BMIF 310, Fall 2009
What is clustering Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities Unsupervised techniques, does not require the incorporation of any prior knowledge in the process 15 BMIF 310, Fall 2009
Why clustering? Exploratory data analysis, providing rough maps and suggesting directions for further study Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes 16 BMIF 310, Fall 2009
Clustering method Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters Partitioning: divide the data into g groups using some reallocation algorithm, e.g. K-means Fuzzy clustering: each object has a set of weights suggesting the probability of it belonging to each cluster 17 BMIF 310, Fall 2009
Hierarchical clustering Agglomerative clustering (bottom-up) Start with n groups, join the two closest, continue Divisive clustering (top-down) Start with 1 group, split into 2, then into 3,, into n Require distance measurement Between two objects Between clusters 18 BMIF 310, Fall 2009
Between objects distance measurement Euclidean distance Focus on the absolute expression value Pearson correlation coefficient Focus on the expression profile shape Parametric, normally distributed and follow the linear regression model Spearman correlation coefficient Focus on the expression profile shape Non-parametric, no assumption Less sensitive than Pearson 19 BMIF 310, Fall 2009
Different measurement, different distance Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink) Pearson: GeneC (green) Spearman: GeneD (red) 20 BMIF 310, Fall 2009
Between cluster distance measurement Single linkage: the smallest distance of all pairwise distances Complete linkage: the maximum distance of all pairwise distances Average linkage: the average distance of all pairwise distances 21 BMIF 310, Fall 2009
Hierarchical clustering Dendrogram Output of a hierarchical clustering Tree structure with the genes or samples as the leaves The height of the join indicates the distance between the left branch and the right branch Problems Hard to define distinct clusters 22 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 23 BMIF 310, Fall 2009
What is classification Classification algorithms are methods to classify objects into predefined classes Supervised techniques, requires training data and predefined classes Two step process Model construction: describe a set of predetermined classes using training data Model application: classify new objects into predefined classes 24 BMIF 310, Fall 2009
Classification methods K-nearest neighbor Decision tree Support vector machine Naïve Bayes classifier Artificial neural network 25 BMIF 310, Fall 2009
Feature selection Microarray data are characterized by large numbers of variables (genes) with respect to very few observations (samples), we need to select a subset of genes likely to be predictive (i.e. highly related with particular classes for classification) 26 BMIF 310, Fall 2009
Model construction Classification Algorithms Training Data Sample GeneA GeneB Tumor A H H N B H L Y C L L N D H L Y E L L N F L H N Classifier (Model) IF GeneA = H AND GeneB = L THEN Tumor= yes 27 BMIF 310, Fall 2009
Model application New objects Classifier (Model) IF GeneA = H Sample GeneA GeneB Tumor Z H L? AND GeneB = L THEN Tumor= yes Sample Z = Tumor? Yes 28 BMIF 310, Fall 2009
K-Nearest neighbor Objects are points in an n-d space Compute the distance between the new case and all learning cases Return the most common value among the k learning cases nearest to the new case = 29 BMIF 310, Fall 2009
Over-fitting and cross-validation Over-fitting The classifier is very effective in classifying the training samples but not accurate enough for new samples Cross-validation Hold-out N-fold Split data into Training and Testing data Learn with Training data and estimate true error with Testing data Randomly Split data into Training and Testing data n times Learn with Training and estimate true error with Testing in each split separately Average test performance Leave-one-out Leave one case for Testing Learn with the remaining data and estimate true error with the Testing Average test performance 30 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 31 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 32 BMIF 310, Fall 2009
Importance of biological interpretation Importance of biological interpretation Normalize, Filter, Cluster and Visualize Identification of sets of genes of potential interest Numerical technique, does not reveal the biological implications encrypted in expression data Evaluation of the functional significance of large, heterogeneous and noisy sets of genes constitutes a big challenge 33 BMIF 310, Fall 2009
Gene Ontology Structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products Three major categories that describe the attributes of biological process, molecular function and cellular component for a gene product Categories of concepts are held within a Directed Acyclic Graph (DAG) http://geneontology.org 34 BMIF 310, Fall 2009
Gene Ontology Tree Machine (GOTM) A web-based tool for the analysis and visualization of sets of genes identified from high-throughput technologies User friendly data navigation and visualization Statistical analysis suggesting biological areas that warrant further study http://bioinfo.vanderbilt.edu/gotm 35 BMIF 310, Fall 2009
GOTM observed 24 p=1.92e-34 expected 0.5 69 147 69 147 Up-regulated mitotic cell cycle random mitotic cell cycle 36 BMIF 310, Fall 2009
Bioinformatics tasks Biological question Experiment design Microarray experiment Image analysis Normalization Data Mining Experimental verification Data storage Data integration Data visualization Differential expression Clustering Classification Network analysis Biological interpretation Hypothesis 37 BMIF 310, Fall 2009