The microarray data analysis process - from raw data to biological significance

Size: px

Start display at page:

Download "The microarray data analysis process - from raw data to biological significance"

Martha Willis
5 years ago
Views:

1 The microarray data analysis process - from raw data to biological significance Thank you for waiting. The presentation will be starting in a few minutes at 6AM Pacific Daylight Time. During this webinar you will be in listen only mode, so if you have a question, please type it into the Question and Answer panel at the end of the presentation. Dr. Olson will try to answer as many questions as possible at the end of the presentation. We will also make the slides and a recording of this presentation available after the webinar. Please contact Dr. Olson at eric@microarraysuccess.org if you would like a copy of the slides.

2 The microarray data analysis process - from raw data to biological significance N. Eric Olson eric@microarraysuccess.org

3 The microarray data analysis process - from raw data to biological significance General microarray data analysis workflow From raw data to biological significance Comparison statistics Correction for multiple testing Clustering Biological significance Public microarray databases GeneSifter Overview Analysis Examples Two group Huntington s disease peripheral blood Time series (1 factor) Drosophila innate immune response Time series (2 factor) Gene expression after MI in mouse

4 The Microarray Data Analysis Process Experimental Design Number of groups, factors, replicates Data management Data, sample annotation, gene annotation, databases Differential Expression Comparison statistics, Correction for multiple testing, Clustering Biological significance Individual genes, Biological themes Platform Selection One-color, two-color, platform comparisons System access Ease of use, accessibility Making data public and using public data MIAME, Journals, GEO, meta-analysis

5 The Microarray Data Analysis Process Experimental Design Number of groups, factors, replicates Data management Data, sample annotation, gene annotation, databases Differential Expression Comparison statistics, Correction for multiple testing, Clustering Biological significance Individual genes, Biological themes Platform Selection One-color, two-color, platform comparisons System access Ease of use, accessibility Making data public and using public data MIAME, Journals, GEO, meta-analysis

6 Experiment Design Type of experiment Two groups Normal vs. cancer Control vs. treated Three or more groups, single factor Time series Dose response Multiple treatment Four or more groups, multiple factors Time series with control and treated cells The type of experiment and number of groups and factors will determine the statistical methods needed to detect differential expression Replicates The more the better, but at least 3 Biological better than technical Rigorous statistical inferences cannot be made with a sample size of one. The more replicates, the stronger the inference. Pavlidis P, Li Q, Noble WS. The effect of replication on gene expression microarray experiments. Bioinformatics Sep 1;19(13): Experimental Design and Other Issues in Microarray Studies - Kathleen Kerr -

7 Differential Expression The fundamental goal of microarray experiments is to identify genes that are differentially expressed in the conditions being studied. Comparison statistics can be used to help identify differentially expressed genes and cluster analysis can be used to identify patterns of gene expression and to segregate a subset of genes based on these patterns. Statistical Significance Fold change Fold change does not address the reproducibility of the observed difference and cannot be used to determine the statistical significance. Comparison statistics 2 group t-test, Welch s t-test, Wilcoxon Rank Sum, 3 or more groups, single factor One-way ANOVA, Kruskal-Wallis 4 or more groups, multiple factors Two-way ANOVA Comparison tests require replicates and use the variability within the replicates to assign a confidence level as to whether the gene is differentially expressed. Supporting material - Draghici S. (2002) Statistical intelligence: effective analysis of high-density microarray data. Drug Discov Today, 7(11 Suppl).: S55-63.

t-test for comparison of two groups Calculate t statistic t = difference between groups difference within groups = Mean grp 1 Mean grp 2 ((s 12 /n 1 ) + (s 22 /n 2 )) 1/2 s = variance n = size of

8 t-test for comparison of two groups Calculate t statistic t = difference between groups difference within groups = Mean grp 1 Mean grp 2 ((s 12 /n 1 ) + (s 22 /n 2 )) 1/2 s = variance n = size of sample Determine confidence level for t (probability that t could occur by chance) df = n 1 + n 2-2 The larger the difference between the groups and the lower the variance the bigger t will be and the lower p will be

9 Analysis of Variance (ANOVA) Like t-test, identifies genes with large differences between groups and small differences within groups For use with 3 or more groups One-way and two-way One-way examines effects of one factor on gene expression Two-way can examine effects of two factors on gene expression as well as the interaction of the two factors Pavlidis P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods Dec;31(4): Glantz S. Primer of Biostatistics. 5 th Edition. McGraw-Hill. Glantz S, Slinker B. Primer of Regression and Analysis of Variance. McGraw-Hill.

10 Two-way ANOVA Example Triple treatment in Huntington s Disease model (R6/2 mice, GSE857, Affymetrix U74Av2) Treatment - + Disease effect Disease WT R6/ WT - WT + R6/2 - R6/2 + Gene expression pattern Treatment effect Interaction Disease and treatment effect (no Interaction)

11 Two-way ANOVA compared to t-test Triple treatment in Huntington s Disease model (R6/2 mice, GSE857, Affymetrix U74Av2) Treatment - + Disease WT R6/ t-test Two-way Disease Differences Pavlidis P, Noble WS. Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2001;2(10):RESEARCH0042.

12 Differential Expression Correction for multiple testing- Methods for adjusting the p-value from a comparison test based on the number of tests performed. These adjustments help to reduce the number of false positives in an experiment. FWER : Family Wise Error Rate (FWER) corrections adjust the p-value so that it reflects the chance of at least 1 false positive being found in the list. Bonferonni, Holm, W & Y MaxT FDR : False Discovery Rate corrections (FDR) adjust the p-value so that it reflects the frequency of false positives in the list. Benjamini and Hochberg, SAM The FWER methods are more conservative, but the FDR methods are usually acceptable for discovery experiments, i.e. where a small number of false positives is acceptable Dudoit, S., et al. (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 18(1): Reiner, A., et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3):

13 Multiple Hypothesis Testing in Microarray Experiments Per comparison error rate (PCER) the probability of error for each comparison Family-wise error rate (FWER) the probability of at least one error for all comparisons False discovery rate (FDR) the expected proportion of errors among your results An error means a false positive Example : 1000 genes and 50 differentially expressed using cutoff of 5% PCER - using a 5% PCER means a 5% chance of error for each comparison, so perhaps 50 errors for 1000 comparisons. This is not acceptable, you don t have confidence than any of your results are real (not errors). FWER using 5% FWER means there is a 5% chance that you have at least 1 error. This is very good and would be a very conservative requirement, you are confident that all of your results are real. FDR using 5% FDR you would expect 2.5 errors (5% of 50). This is probably acceptable, you are confident that most of your results are real.

14 Correction Example CodeLink Ms 10K BioArray Lacrimal + Placebo Lacrimal + Androgen 3 biological replicates 9982 Comparisons t-test 5% PCER : 2458 genes (estimate 499 errors) 5% FWER: 19 genes (5% chance of 1 error) 5% FDR: 904 genes (estimate 45 errors)

15 Correction Example CodeLink Ms 10K BioArray Lacrimal + Placebo Lacrimal + Androgen 3 biological replicates 9982 Comparisons t-test 5% PCER : 2458 genes (estimate 499 errors) 5% FWER: 19 genes (5% chance of 1 error) 5% FDR: 904 genes (estimate 45 errors) PCER FDR FWER

16 Identification and partitioning of expression patterns Cluster Analysis - clustering methods are descriptive or exploratory tools that can be used to identify groups within complex datasets. Visualization Methods such as hierarchical clustering can be used to help identify patterns in a large dataset. Hierarchical methods provide a hierarchy of clusters, from the smallest, where all objects (gene expression profiles) are in one cluster, through to the largest set, where each observation is in its own cluster. Partitioning this type of cluster analysis can be used to separate data into discrete groups or clusters. Partitioning methods partition the data (list of genes) into a prespecified number (K) of mutually exclusive groups based on feature vector (expression profile). - K-means -PAM (Partitioning around medoids) Cluster analysis is used to identify patterns of gene expression within large datasets and to segregate those genes based on these patterns. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet Jun;2(6): Review. Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.

17 Identification and partitioning of expression patterns Hierarchical Partitioning Cluster analysis 1846 differentially expressed genes from FVB heart development time series.

18 Differential Expression - Gene Lists

19 Biological Significance Gene Annotation Sources UniGene - organizes GenBank sequences into a non-redundant set of gene-oriented clusters. Gene titles are assigned to the clusters and these titles are commonly used by researchers to refer to that particular gene. LocusLink (Entrez Gene) - provides a single query interface to curated sequence and descriptive information, including function, about genes. Gene Ontologies The Gene Ontology Consortium provides controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products, that can be used by databases such as Entrez Gene. KEGG - Kyoto Encyclopedia of Genes and Genomes provides information about both regulatory and metabolic pathways for genes. Reference Sequences- The NCBI Reference Sequence project (RefSeq) provides reference sequences for both the mrna and protein products of included genes.

20 Gene annotation for individual genes

21 Ontology reports identify biological themes

22 Ontology Report : z-score R = total number of genes meeting selection criteria N = total number of genes measured r = number of genes meeting selection criteria with the specified GO term n = total number of genes measured with the specific GO term Reference: Scott W Doniger, Nathan Salomonis, Kam D Dahlquist, Karen Vranizan, Steven C Lawlor and Bruce R Conklin; MAPPFinder: usig Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data, Genome Biology 2003, 4:R7

23 The Microarray Data Analysis Process Experimental Design Number of groups, factors, replicates Data management Data, sample annotation, gene annotation, databases Differential Expression Comparison statistics, Correction for multiple testing, Clustering Biological significance Individual genes, Biological themes Platform Selection One-color, two-color, platform comparisons System access Ease of use, accessibility Making data public and using public data MIAME, Journals, GEO, meta-analysis

24 The Gene Expression Omnibus (GEO) Gene expression data repository (mostly microarrays) Over 3000 data sets All array platforms represented Searchable by Platform Species Experiment annotation Downloadable data Using the Gene Expression Omnibus (

25 The microarray data analysis process - from raw data to biological significance General microarray data analysis workflow From raw data to biological significance Comparison statistics Correction for multiple testing Biological significance Clustering Public microarray databases GeneSifter Overview Analysis Examples Two group Huntington s disease peripheral blood Time series (1 factor) Drosophila innate immune response Time series (2 factor) Gene expression after MI in mouse

GeneSifter Microarray Data Analysis Accessibility Web-based Secure Data management Data

Differential Expression - Powerful, accessible tools for determining Statistical Significance

test, one-way ANOVA, two-way ANOVA Correction for Multiple Testing Bonferroni, Holm, Westfall

26 GeneSifter Microarray Data Analysis Accessibility Web-based Secure Data management Data Annotation (MIAME) Multiple upload tools CodeLink Affymetrix Illumina Agilent Custom Differential Expression - Powerful, accessible tools for determining Statistical Significance R based statistics Bioconductor Comparison Tests t-test, Welch s t-test, Wilcoxon Rank sum test, one-way ANOVA, two-way ANOVA Correction for Multiple Testing Bonferroni, Holm, Westfall and Young maxt, Benjamini and Hochberg Unsupervised Clustering PAM, CLARA, Hierarchical clustering Silhouettes

Click Gene Summary Ontology Report Pathway Report

27 GeneSifter Microarray Data Analysis Integrated tools for determining Biological Significance One Click Gene Summary Ontology Report Pathway Report Search by ontology terms Search by KEGG terms or Chromosome

28 The GeneSifter Data Center Free resource Training Research Publishing 6 areas Cardiovascular Cancer Endocrinology Neuroscience Immunology Oral Biology Access to : Data Analysis summary Tutorials WebEx

29 The GeneSifter Data Center

30 The microarray data analysis process - from raw data to biological significance General microarray data analysis workflow From raw data to biological significance Comparison statistics Correction for multiple testing Biological significance Clustering Public microarray databases GeneSifter Overview Analysis Examples Two group Huntington s disease peripheral blood Time series (1 factor) Drosophila innate immune response Time series (2 factor) Gene expression after MI in mouse

31 Analysis Workflow Examples 2 groups (HD and healthy blood ) 5 groups, single factor (Drosophila Innate Immune Response Time Series) 18 groups, two factors (Gene expression after myocardial infarction in mouse) t-test BH (FDR) Up regulated Down regulated Gene Lists One-way ANOVA BH (FDR) Clustering Gene Lists Two-way ANOVA BH (FDR) Clustering Gene Lists Individual genes of interest Biological themes (Pathways, molecular functions, etc.)

Background - Data Human blood expression for Huntington s disease versus control, CodeLink CodeLink Human 20K Bioarray Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch

32 Background - Data Human blood expression for Huntington s disease versus control, CodeLink CodeLink Human 20K Bioarray Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch SM, Hogarth P, Bouzou B, Jensen RV, Krainc D. Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Proc Natl Acad Sci U S A Aug 2;102(31):

33 Pairwise Analysis Select group 1 14 normal Select group 2 12 Huntingtons

34 Pairwise Analysis Already normalized (median) t-test Quality filter 0.75 (filters out genes with signal less than 0.75) Benjamini and Hochberg (FDR) Log transform data

35 Pairwise Analysis Gene List

36 One-Click Gene Summary

37 Ontology Report

38 Z-score Report

39 Z-score Report

40 Pairwise Analysis - Summary Human blood expression for Huntington s disease versus control, CodeLink 12 HD 14 Control t-test, Benjamini and Hochberg (FDR) Pattern selection 2606 increased In HD Z-scores Biological processes Protein biosynthesis (104) Ubiquitin cycle (123) RNA splicing (53) KEGG Oxidataive phosphorylation (35) Apoptosis (22) ~20,000 genes 5684 genes 3078 decreased In HD Biological processes Neurogenesis (90) Cell adhesion (120) Sodium ion transport (29) G-protein coupled receptor signaling (114) KEGG Neuroactive ligand-receptor interaction (56)

41 Mouse model Huntington s Disease 3 WT untreated 3 WT treated 4 R6/2 untreated 3 R6/2 treated

42 RNA Splicing in mouse model 3 WT untreated, 4 R6/2 untreated t-test 500+ genes two-way ANOVA WT untreated and treated R6/2 untreated and treated Significant interaction 5 RNA splicing genes

Pairwise Analysis Human blood expression for Huntington s disease versus control, Affymetrix U133A Human Genome Array MAS 5 signal Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD,

43 Pairwise Analysis Human blood expression for Huntington s disease versus control, Affymetrix U133A Human Genome Array MAS 5 signal Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch SM, Hogarth P, Bouzou B, Jensen RV, Krainc D. Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Proc Natl Acad Sci U S A Aug 2;102(31):

44 Analysis Workflow Examples 2 groups (HD and healthy blood ) 5 groups, single factor (Drosophila Innate Immune Response Time Series) 18 groups, two factors (Gene expression after myocardial infarction in mouse) t-test BH (FDR) Up regulated Down regulated Gene Lists One-way ANOVA BH (FDR) Clustering Gene Lists Two-way ANOVA BH (FDR) Clustering Gene Lists Individual genes of interest Biological themes (Pathways, molecular functions, etc.)

Project Analysis Drosophila innate immune response time series Microarray Platform: Affymetrix GeneChip Drosophila Genome Array Reference: CEL files were obtained from http://www.fruitfly.org.

45 Project Analysis Drosophila innate immune response time series Microarray Platform: Affymetrix GeneChip Drosophila Genome Array Reference: CEL files were obtained from De Gregorio E, Spellman PT, Rubin GM, Lemaitre B. Genome-wide analysis of the Drosophila immune response by using oligonucleotide microarrays. Proc Natl Acad Sci U S A Oct 23;98(22): Summary - Gene expression profiles in adult fruit flies subjected to microbial infection (1.5, 3, 6 and 12 hours) were monitored using the Affymetrix GeneChip Drosophila Genome Array. CEL files loaded and RMA used to normalize. Innate Immune Responses of Drosophila Cellular immune reactions consist of phagocytosis, encapsulation, and melanization. (B) A dead and melanized crystal cell phagocytosed by a plasmatocyte. (C) Encapsulation of an egg of a Drosophila parasite. (D) Clot formation occurs during wound healing. (E) Melanization occurs in response to intruding pathogens or parasites and is also observed during wound healing. (F) Humoral immune reaction. The expression of antimicrobial peptides in the larval fat body is induced by microbes. Antimicrobial peptides are released from the fat body into the hemolymph. This response is therefore systemic.

46 Time series - Filtering ANOVA, 1.5 fold change cutoff, Benjamini and Hochberg (FDR)

47 Time series - Visualization Visualization of 624 genes

48 Time series - Partitioning Segregation of expression patterns using PAM

Time series - Partitioning Silhouette widths are used to find best number of clusters k mean sil. width 2 0.28 4 0.39 6 0.37 Dudoit S, Fridlyand J.

49 Time series - Partitioning Silhouette widths are used to find best number of clusters k mean sil. width Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol Jun 25;3(7):RESEARCH0036. Epub 2002 Jun 25.

50 Time series Biological Significance Cluster 4

Time series Summary Drosophila innate immune response time

and Hochberg (FDR) Hierarchical Clustering PAM, Silhouette

carbohydrate metabolism lipid metabolism ~14,000 Genes 624 Genes

response I-kappaB kinase/nf-kappab cascade Cluster 3 136 genes

51 Time series Summary Drosophila innate immune response time series 5 time points 3 biological replicates ANOVA, Benjamini and Hochberg (FDR) Hierarchical Clustering PAM, Silhouette widths Z-scores (Biological Process) Cluster genes carbohydrate metabolism lipid metabolism ~14,000 Genes 624 Genes Cluster genes Toll signaling pathway antifungal humoral response I-kappaB kinase/nf-kappab cascade Cluster genes protein catabolism melanin biosynthesis Cluster 4 37 genes antibacterial humoral response

52 Analysis Workflow Examples 2 groups (HD and healthy blood ) 5 groups, single factor (Drosophila Innate Immune Response Time Series) 18 groups, two factors (Gene expression after myocardial infarction in mouse) t-test BH (FDR) Up regulated Down regulated Gene Lists One-way ANOVA BH (FDR) Clustering Gene Lists Two-way ANOVA BH (FDR) Clustering Gene Lists Individual genes of interest Biological themes (Pathways, molecular functions, etc.)

53 CardioGenomics - Mouse Myocardial Infarction Model 6 time points after ligation of LAD artery (1hr -> 8wk) Three sites - Sham LV, Infarcted LV, Non-infarcted LV Affymetrix Mouse U74Av2 Array MAS5 text files loaded (signal and detection call) CEL files also available (RMA or GC-RMA)

54 Project Analysis : Two-way ANOVA Factor One: Site (3 levels, LV, NILV, ILV) Factor Two: Time after ligation (6 levels, 1 hr, 4 hr, 24 hr,48 hr,1 wk, 8 wk) Site: Time: LV NILV ILV Gene expression pattern Site Effect Time Effect Interaction

55 Project Analysis : Two-way ANOVA

56 Project Analysis : Two-way ANOVA Identify Factors Indicate number of levels for each

57 Project Analysis : Two-way ANOVA Identify levels for each factor

58 Project Analysis : Two-way ANOVA Assign levels for each factor to cells

59 Project Analysis : Two-way ANOVA Include fold-change cutoff if desired Select effect to filter on first (you can switch later)

60 Two-way ANOVA : Interaction

61 Interaction - Visualization Visualization of 2513 genes (Interaction p < 0.001)

62 Interaction Partition Clustering

63 Interaction Cluster 1

64 Interaction Cluster 3

65 Two-way ANOVA : Summary Gene expression following myocardial infarction 18 groups (3 biological replicates) 2 factors (Site and Time) Differential Expression (Two-way ANOVA, interaction) Visualization (Hierarchical clustering) Partitioning (Partitioning around medoids) Biological significance (Biological process and KEGG) Glucan metabolism (10) Oxidative phosphorylation (56) Fatty acid metabolism (15) ~12,000 transcripts 2513 genes Cell division (21) Immune cell activation (18) Regulation of actin cytoskeleton (26) Cell adhesion (24) Proteolysis (21) TGF beta signaling (7) Inflammatory response (19) Regulation of cell cycle (21) Toll-like receptor signaling (10)

66 Gene expression following myocardial infarction Pathways Biological processes Molecular functions Inflammatory response Positive regulation of cell proliferation Regulation of cell cycle Toll-like receptor signaling Jak-STAT signaling pathway Cell division Immune cell activation Small GTPase mediated signal transduction Regulation of actin cytoskeleton Leukocyte transendothelial migration Cell adhesion Cell cycle arrest Extracellular matrix structural constituent Proteolysis TGF beta signaling

Analysis Workflow Examples 2 groups (HD and healthy blood ) 5 groups,

two factors (Gene expression after myocardial infarction in mouse) t-test

Clustering Gene Lists Two-way ANOVA BH (FDR) Clustering Gene Lists

67 Analysis Workflow Examples 2 groups (HD and healthy blood ) 5 groups, single factor (Drosophila Innate Immune Response Time Series) 18 groups, two factors (Gene expression after myocardial infarction in mouse) t-test BH (FDR) Up regulated Down regulated Gene Lists One-way ANOVA BH (FDR) Clustering Gene Lists Two-way ANOVA BH (FDR) Clustering Gene Lists Individual genes of interest Biological themes (Pathways, molecular functions, etc.)

Resources Monthly Webinar Series Archived - Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison

response to hookworm infection in mouse lung Archived - The microarray data analysis process - from raw data to biological significance Archived -

68 Resources Monthly Webinar Series Archived - Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison Archived - Using 2-way ANOVA to dissect gene expression following myocardial infarction in mice Archived - Using 2-way ANOVA to dissect the immune response to hookworm infection in mouse lung Archived - The microarray data analysis process - from raw data to biological significance Archived - Microarray analysis of gene expression in androgen-independent prostate cancer Archived - Microarray analysis of gene expression in male germ cell tumors

69 Thank You Trial account, tutorials, sample data and Data Center Eric Olson