advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Size: px
Start display at page:

Download "advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA"

Transcription

1 advanced analysis of gene expression microarray data aidong zhang State University of New York at Buffalo, USA World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TAIPEI CHENNAI

2 Contents Preface vii 1. Introduction The Microarray: Key to Functional Genomics and Systems Biology Applications of Microarray Gene Expression Profiles in Different Tissues Developmental Genetics Gene Expression Patterns in Model Systems Differential Gene Expression Patterns in Diseases Gene Expression Patterns in Pathogens Gene Expression in Response to Drug Treatments Genotypic Analysis Mutation Screening of Disease Genes Framework of Microarray Data Analysis Summary Basic Concepts of Molecular Biology Introduction Cells Proteins Nucleic Acids DNA RNA Central Dogma of Molecular Biology Genes and the Genetic Code Transcription and Gene Expression 25

3 x Advanced Analysis of Gene Expression Microarray Data Translation and Protein Synthesis Genotype and Phenotype Summary Overview of Microarray Experiments Introduction Microarray Chip Manufacture Deposition-Based Manufacture In Situ Manufacture The Affymetrix GeneChip Steps of Microarray Experiments Sample Preparation and Labeling Hybridization Image Scanning Image Processing Microarray Data Cleaning and Preprocessing Data Transformation Missing Value Estimation Data Normalization Global Normalization Approaches Standardization Iterative linear regression Intensity-Dependent Normalization LOWESS: Locally weighted linear regression Distribution normalization Summary Analysis of Differentially-Expressed Genes Introduction Basic Concepts in Statistics Statistical Inference Hypothesis Test Fold Change Methods /c-fold Change Unusual Ratios Model-Based Methods Parametric Tests Paired i-test. 62

4 Contents xi Unpaired i-test Variants of t-test Non-Parametric Tests Classical Non-Parametric Statistics Other Non-Parametric Statistics Bootstrap Analysis Multiple Testing Family-Wise Error Rate Sidak correction and Bonferroni correction Holm's step-wise correction False Discovery Rate Permutation Correction SAM: Significance Analysis of Microarrays ANOVA: Analysis of Variance One-Way ANOVA Two-Way ANOVA Summary Gene-Based Analysis Introduction Proximity Measurement for Gene Expression Data Euclidean Distance Correlation Coefficient Pearson's correlation coefficient Jackknife correlation Spearman's rank-order correlation Kullback-Leibler Divergence Partition-Based Approaches K-means and its Variations SOM and its Extensions Graph-Theoretical Approaches HCS and CLICK CAST: Cluster affinity search technique Model-Based Clustering Hierarchical Approaches Agglomerative Algorithms Divisive Algorithms DAA: Deterministic annealing algorithm SPC: Super-paramagnetic clustering

5 Advanced Analysis of Gene Expression Microarray Data 5.5 Density-Based Approaches DBSCAN OPTICS DENCLUE GPX: Gene Pattern explorer The Attraction Tree The distance measure The density definition The attraction tree An example of attraction tree Interactive Exploration of Coherent Patterns Generating the index list The coherent pattern index and its graph Drilling down to subgroups Experimental Results Interactive exploration of Iyer's data and Spellman's data Comparison with other algorithms Efficiency and Scalability Cluster Validation Homogeneity and Separation Agreement with Reference Partition Reliability of Clusters P-value of a cluster Prediction strength Summary 139 Sample-Based Analysis Introduction Selection of Informative Genes Supervised Approaches Differentially expressed genes Gene pairs Virtual genes Genetic algorithms Unsupervised Approaches PCA: Principal component analysis Gene shaving Class Prediction 155 Contents Linear Discriminant Analysis Instance-Based Classification KNN: fc-nearest Neighbor Weighted voting Decision Trees Support Vector Machines Class Discovery Problem statement CLIFF: CLustering via Iterative Feature Filtering The sample-partition process The gene-filtering process ESPD: Empirical Sample Pattern Detection Measurements for phenotype structure detection Algorithms Experimental results Classification Validation Prediction Accuracy Prediction Reliability Summary Pattern-Based Analysis Introduction Mining Association Rules Concepts of Association-Rule Mining The Apriori Algorithm The FP-Growth Algorithm The CARPENTER Algorithm Generating Association Rules in Microarray Data Rule filtering Rule grouping Mining Pattern-Based Clusters in Microarray Data Heuristic Approaches Coupled two-way clustering (CTWC) Plaid model Biclustering and (^-Clusters Deterministic Approaches pCluster OP-Cluster 213

6 Advanced Analysis of Gene Expression Microarray Data Contents xv 7.4 Mining Gene-Sample-Time Microarray Data Three-dimensional Microarray Data Coherent Gene Clusters Problem description Maximal coherent sample sets The mining algorithms Experimental results Tri-Clusters The tri-cluster model Properties of tri-clusters Mining tri-clusters Summary 238 Visualization of Microarray Data Introduction Single-Array Visualization Box Plot Histogram Scatter Plot Gene Pies Multi-Array Visualization Global Visualizations Optimal Visualizations Projection Visualization VizStruct Fourier Harmonic Projections Discrete-time signal paradigm The Fourier harmonic projection algorithm Properties of FHPs Basic properties Advanced properties Harmonic equivalency Effects of harmonic twiddle power index Enhancements of Fourier Harmonic Projections Exploratory Visualization of Gene Profiling Microarray data sets for visualization Identification of informative genes Classifier construction and evaluation Dimension arrangement Visualization of various data sets Comparison of FFHP to Sammon's mapping Confirmative Visualization of Gene Time-series Data sets for visualization The harmonic projection approach Rat kidney data set Yeast-A data set Yeast-B data set Summary New Trends in Mining Gene Expression Microarray Data Introduction Meta-Analysis of Microarray Data Meta-Analysis of Differential Genes Meta-Analysis of Co-Expressed Genes Semi-Supervised Clustering General Semi-Supervised Clustering Algorithms A Seed-Generation Approach Seed-generation methods Pattern-selection rules The framework for the seed-generation approach Integration of Gene Expression Data with Other Data A Probabilistic Model for Joint Mining A Graph-Based Model for Joint Mining Summary Conclusion 305 Bibliography Index