Seven Keys to Successful Microarray Data Analysis

Size: px
Start display at page:

Download "Seven Keys to Successful Microarray Data Analysis"

Transcription

1 Seven Keys to Successful Microarray Data Analysis Experiment Design Platform Selection Data Management System Access Differential Expression Biological Significance Data Publication Type of experiment Two groups Time series Multiple conditions Replicates The more the better Technical vs biological Platforms cdna Oligo One color Two color Feature Extraction Software File formats Databases Raw Data Storing Retrieving Experiment Annotation Samples Protocols Usability Intuitive Special training System Access Single user desktop Single user server Web-based Sharing data In the lab Collaboration Normalization Differential Expression Fold change Comparison statistics FWER/FDR Pattern Identification Clustering Visualization Partitioning Gene Annotation UniGene LocusLink Gene Ontology KEGG OMIM Single Genes Gene Summaries Gene Lists Ontology Report Pathway Report MIAME What is it? Publication Public databases GEO Using public data

2 Microarrays 70,000 65,000 Data Points per Experiment 60,000 50,000 40,000 30,000 20,000 10, c c Today (Northern Blot) (PCR) (Microarray) Microarrays Allow Measurement of Expression Profiles for Entire Genomes Rather than Individual Genes

3 The impact of microarrays in biological research Data Data Experiments Experiments Traditional Biological Research Microarrays in Biological Research

4 Microarray Challenges Data Quantity 500,000+ data points per single experiment Data management Hardware requirements Data Quality Poor spot quality Poor specificity of probes Reproducibility Sample variability Software Complexity Steep learning curve Requires bioinformatics specialists Expensive, enterprise-wide Usability and accessibility problems Separate software for each step Labs Responding 80% 70% 60% 50% 40% 30% 20% 10% 0% Throughput Reliability Software Bioinformatics Funding Commitments

5 Seven Keys to Successful Microarray Data Analysis Experiment Design Platform Selection Data Management System Access Differential Expression Biological Significance Data Publication Type of experiment Two groups Time series Multiple conditions Replicates The more the better Technical vs biological Platforms cdna Oligo One color Two color Feature Extraction Software File formats Databases Raw Data Storing Retrieving Experiment Annotation Samples Protocols Usability Intuitive Special training System Access Single user desktop Single user server Web-based Sharing data In the lab Collaboration Normalization Differential Expression Fold change Comparison statistics FWER/FDR Pattern Identification Clustering Visualization Partitioning Gene Annotation UniGene LocusLink Gene Ontology KEGG OMIM Single Genes Gene Summaries Gene Lists Ontology Report Pathway Report MIAME What is it? Publication Public databases GEO Using public data

6 microarraysuccess.com Experiment Design Experimental design determines what can be inferred from the data as well as determining the confidence that can be assigned to those inferences. Careful experimental design and the presence of biological replicates are essential to the successful use of microarrays. Type of experiment Two groups (Control vs. Treated, Normal vs. Cancer, etc.) Three or more groups Time series Multiple treatments Dose Response The type of experiment and number of groups will affect the statistical methods used to detect differential expression Replicates The more the better, but at least 3 Biological better than technical Rigorous statistical inferences cannot be made with a sample size of one. The more replicates, the stronger the inference. Supporting material - Experimental Design and Other Issues in Microarray Studies - Kathleen Kerr -

7 microarraysuccess.com Differential Expression The fundamental goal of microarray experiments is to identify genes that are differentially expressed in the conditions being studied. Comparison statistics can be used to help identify differentially expressed genes and cluster analysis can be used to identify patterns of gene expression and to segregate a subset of genes based on these patterns. Statistical Significance Fold change Fold change does not address the reproducibility of the observed difference and cannot be used to determine the statistical significance. Comparison statistics Parametric t-test, Welch s t-test, ANOVA Non-parametric Wilcoxon Rank Sum, Kruskal-Wallis, Permutation t-test Comparison tests require replicates and use the variability within the replicates to assign a confidence level as to whether the gene is differentially expressed. Supporting material - Draghici S. (2002) Statistical intelligence: effective analysis of high-density microarray data. Drug Discov Today, 7(11 Suppl).: S55-63.

8 microarraysuccess.com Differential Expression Calculate t statistic t-test for comparison of two groups difference between groups Mean grp 1 Mean grp 2 = t = difference within groups ((s 12 /n 1 ) + (s 22 /n 2 )) 1/2 Determine confidence level for t (probability that t could occur by chance) df = n 1 + n 2-2 s = variance n = size of sample The larger the difference between the groups and the lower the variance the bigger t will be and the lower p will be

9 microarraysuccess.com Differential Expression 2 groups, 4 replicates each Mean, standard deviation, fold change and p-value calculated 8 Mean Signal Exp Con Gene 1 Fold Change = 5.3 p = 0.19 Mean Signal Exp Con Gene 2 Fold Change = 5.3 p = 0.03 Fold change vs. p value

10 microarraysuccess.com Differential Expression Correction for multiple testing- Methods for adjusting the p-value from a comparison test based on the number of tests performed. These adjustments help to reduce the number of false positives in an experiment. FWER : Family Wise Error Rate (FWER) corrections adjust the p-value so that it reflects the chance of at least 1 false positive being found in the list. Bonferonni, Holm, W & Y MaxT FDR : False Discovery Rate corrections (FDR) adjust the p-value so that it reflects the frequency of false positives in the list. Benjamini and Hochberg, SAM The FWER is more conservative, but the FDR is usually acceptable for discovery experiments, i.e. where a small number of false positives is acceptable Supporting material - Dudoit, S., et al. (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 18(1): Reiner, A., et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3):

11 microarraysuccess.com Differential Expression Cluster Analysis - clustering methods are descriptive or exploratory tools that can be used to identify groups within complex datasets. Clustering methods can be used to identify patterns of gene expression in microarray datasets. Visualization Methods such as hierarchical clustering can be used to help identify patterns in a large dataset. Partitioning this type of cluster analysis can be used to separate data into discrete groups or clusters. - K-means -PAM (Partitioning around medoids) Cluster analysis is used to identify patterns of gene expression within large datasets and to segregate those genes based on these patterns.

12 microarraysuccess.com Differential Expression Cluster analysis Used to identify groups, or clusters, of similar objects (gene expression profiles) on the basis of a set of feature vectors (expression measurements). Two general types - Hierarchical methods, either divisive or agglomerative. These methods provide a hierarchy of clusters, from the smallest, where all objects are in one cluster, through to the largest set, where each observation is in its own cluster. Partitioning methods. These usually require the specification of the number of clusters. Then, a mechanism for apportioning objects to clusters must be determined.

13 microarraysuccess.com Differential Expression Hierarchical Partitioning Cluster analysis Example (1846 genes)

14 microarraysuccess.com Biological Significance Once a list of differentially expressed genes has been generated, the next task is determining the biological significance of those genes. By combining the identification of broad biological themes with the ability to focus on a particular gene, it is possible to rapidly characterize the biology involved in a particular experiment, and to identify particular genes of interest from list of potential targets. Gene Annotation Sources Genbank:Sequence information for particular clones and probes. UniGene: Sets of sequences that represent unique genes, and information for those genes. LocusLink: Curated sequence and descriptive information about genes. Gene Ontology: Controlled vocabularies for the description of molecular function, biological process and cellular component of gene products. KEGG: Information about both regulatory and metabolic pathways for genes. OMIM: A catalog of human genes and genetic disorders. Homologene: A gene homology tool that compares nucleotide sequences between pairs of organisms Utilizing these resources allows a researcher to determine the biological significance of a gene of interest.

15 microarraysuccess.com Biological Significance The Gene Ontology Consortium Provides controlled vocabularies for the description of gene products. These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them.

16 microarraysuccess.com Biological Significance The three organizing principles of GO are molecular function, biological process and cellular component. These are the highest level groupings of genes. Definitions of the terms within all three of these ontologies are contained in a single definitions file, available from the Gene Ontology Consortium.

17 Biological Significance microarraysuccess.com

18 microarraysuccess.com Biological Significance Gene Ontology Terms Included in LocusLink Annotation

19 microarraysuccess.com Biological Significance Single genes Gene summaries Gene lists Information from annotation sources can be used to identify targets for further study, or to prioritize the genes in a list. Ontology report Pathway report By examining the gene list as a whole, broad biological themes may be identified.

20 Seven Keys to Successful Microarray Data Analysis Experiment Design Platform Selection Data Management System Access Differential Expression Biological Significance Data Publication Type of experiment Two groups Time series Multiple conditions Replicates The more the better Technical vs biological Platforms cdna Oligo One color Two color Feature Extraction Software File formats Databases Raw Data Storing Retrieving Experiment Annotation Samples Protocols Usability Intuitive Special training System Access Single user desktop Single user server Web-based Sharing data In the lab Collaboration Normalization Differential Expression Fold change Comparison statistics FWER/FDR Pattern Identification Clustering Visualization Partitioning Gene Annotation UniGene LocusLink Gene Ontology KEGG OMIM Single Genes Gene Summaries Gene Lists Ontology Report Pathway Report MIAME What is it? Publication Public databases GEO Using public data

21 Accessible GeneChip Data Analysis N. Eric Olson, Ph.D. VizX Labs, LLC

22 GeneSifter Microarray Data Analysis Convenient access Web-based Secure Database Data Annotation (MIAME) Multiple upload tools Affymetrix (CHP, CEL, RMA) Agilent, CodeLink, etc. Custom GEO Powerful and accessible tools for determining Statistical Significance R based statistics Bioconductor Comparison Tests t-test, Welch s t-test, Wilcoxon Rank sum test, ANOVA, Correction for Multiple Testing Bonferroni, Holm, Westfall and Young maxt, Benjamini and Hochberg Unsupervised Clustering PAM, CLARA, Hierarchical clustering Silhouettes

23 GeneSifter Microarray Data Analysis Integrated tools for determining Biological Significance One Click Gene Summary Ontology Report Pathway Report Search by ontology terms Search by KEGG terms or Chromosome

24 Seven Keys to Successful Microarray Data Analysis Experiment Design Platform Selection Data Management System Access Differential Expression Biological Significance Data Publication Type of experiment Two groups Time series Multiple conditions Replicates The more the better Technical vs biological Platforms cdna Oligo One color Two color Feature Extraction Software File formats Databases Raw Data Storing Retrieving Experiment Annotation Samples Protocols Usability Intuitive Special training System Access Single user desktop Single user server Web-based Sharing data In the lab Collaboration Normalization Differential Expression Fold change Comparison statistics FWER/FDR Pattern Identification Clustering Visualization Partitioning Gene Annotation UniGene LocusLink Gene Ontology KEGG OMIM Single Genes Gene Summaries Gene Lists Ontology Report Pathway Report MIAME What is it? Publication Public databases GEO Using public data

25 GeneSifter - Analysis Examples 2 groups (Normal B-cells vs. Centroblasts) Data preparation Data Upload Data management Normalization 3 + groups (Myeloid Differentiation Time Series) Differential expression Fold change Quality t-test False discovery rate Differential expression Fold change Quality ANOVA False discovery rate Visualization Hierarchical clustering PCA Partitioning PAM Silhouettes Biological significance Gene Annotation Ontology report Pathway report

26 Pairwise Analysis Transcriptional Analysis of the B-Cell Germinal Center Reaction Klein, U., Y. Tu, G.A. Stolovitzky, J.L. Keller, J. Haddad, Jr., V. Miljkovic, G. Cattoretti, A. Califano, and R. Dalla-Favera. Proc. Acad. Natl. Sci. U. S. A. 100: , Experimental Design In order to study germinal center reaction, B-cell populations representing four stages of the germinal center reaction were isolated from human tonsils. Five biological replicates for each stage. Four stages of the germinal center reaction were examined. Naïve B-cells Centroblasts Centrocytes Memory B-cells Data for U95Av2 GeneChips available as MAS5 Text Files (CHP values) or as CEL files. Analysis MAS5 signal and detection calls loaded.

27 Pairwise Analysis 5 biological replicates Normalized in MAS 5 t-test Quality filter 1 (filters out A genes) 1.5 fold-change Benjamini and Hochberg (FDR)

28 Pairwise Analysis Gene List

29 One-Click Gene Summary

30 Pairwise Analysis Gene List

31 Ontology Report

32 Ontology Report : z-score R = total number of genes meeting selection criteria N = total number of genes measured r = number of genes meeting selection criteria with the specified GO term n = total number of genes measured with the specific GO term Reference: Scott W Doniger, Nathan Salomonis, Kam D Dahlquist, Karen Vranizan, Steven C Lawlor and Bruce R Conklin; MAPPFinder: usig Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data, Genome Biology 2003, 4:R7

33 Z-score Report

34 KEGG Report

35 GeneSifter - Analysis Examples 2 groups (Normal B-cells vs. Centroblasts) Data preparation Data Upload Data management Normalization 3 + groups (Myeloid Differentiation Time Series) Differential expression Fold change Quality t-test False discovery rate Differential expression Fold change Quality ANOVA False discovery rate Visualization Hierarchical clustering PCA Partitioning PAM Silhouettes Biological significance Gene Annotation Ontology report Pathway report

36 Project Analysis Genomic analysis of myeloid differentiation program Retinoic acid induced differentiation of neutrophils from Mouse MPRO cell line 6 timepoints after retinoic acid addition 1,2,3,4,5,6 days 4 biological replicates per timepoint Affymetrix Mouse U74A GeneChip CEL files RMA transformation Goal- Identify gene expression changes associated with neutrophil differentiation

37 Project Analysis Time Series User-defined grouping of conditions Average intensities Data normalized (RMA) Data Log Transformed Set saved for further analysis Project Analysis - Three or more groups

38 Project Analysis - Filtering ANOVA, 2 fold change cutoff, Benjamini and Hochberg (FDR)

39 Project Analysis Gene List Summary Hierarchical Clustering z-score report

40 Project Analysis - Partitioning Segregation of expression patterns using PAM Mean sil. Width: 0.45 Mean sil. Width: 0.71

41 Project Analysis Biological Significance

42 Project Analysis Biological Significance

43 Project Analysis Biological Significance Search by Ontology

44 Project Analysis Biological Significance Transcription factor activity

45 Project Analysis Biological Significance Biological processes associated with transcription factor activity

46 GeneSifter - Overview Accessible Web-based Easy to use Powerful Statistics R and bioconductor Integrated with Genome Databases Annotation for individual genes Reports for gene lists Optimized for Affymetrix GeneChip Arrays Easy file upload and RMA Affymetrix specific support

47 Thank You Trial account, tutorials, sample data and Data Center Eric Olson