Combining ANOVA and PCA in the analysis of microarray data

Size: px
Start display at page:

Download "Combining ANOVA and PCA in the analysis of microarray data"

Transcription

1 Combining ANOVA and PCA in the analysis of microarray data Lutgarde Buydens IMM, Analytical chemistry Radboud University Nijmegen, the Netherlands

2 Scientific Staff: PhD students: External PhD: Post doc: L. Buydens, W. Melssen, G. Postma, R. Wehrens T. Bloemberg, P. Krooshof, B. Üstün, 2 Vacancies F. Lopes (VU, Amsterdam), J. Andries (Avans, Breda) O. Othersen Web:

3 Micro array technology Measuring expression of thousends of genes Target DNA sequence Specific sequence for one gene

4 Micro array technology Measuring expression of thousends of genes Cell m-rna Fluorescent label Hybridize

5 Data analysis for microarrays image analysis normalization classification clustering gene selection validation

6 Gene selection : Differences in gene expression Treatment Control Ratio cut-off (two fold regulation) Two sample t-test per gene Multiple hypothesis testing (Bonferroni, SAM, cyber t-test,.)

7 Performance of various methods on a known micro array dataset* Selectivity (*Chloe et al., Genome biology, 2005) Sensitivity

8 Osteogenesis data set Aim: learn about stem cell differentiation and bone formation human mesenchymal stem cells three treatments (DEX, BMP, VIT), one control Affymetrix chips (U133A)

9 eleven time points (incl. t = 0) time points close together in the beginning three replicate measurements Experimental data hypercube of dimensions x 3 x 11 x 4 preprocessed by Rosetta Resolver, log units

10 ANOVA for microarrays Assume factors Treatment (T), Time (S) and Gene (G): fixed-effect model X ijkr = + T i + S j + G k + (TS) ij + (TG) ik + (SG) jk + (TSG) ijk + ε ijkr Time (10) Treatment (4) Genes (22283) M.K. Kerr et al. (2001), J. Computat. Biol., 7,

11 Main effects of time and treatment

12 Main effect of genes

13 ANOVA for microarrays Assume factors Treatment (T), Time (S) and Gene (G): fixed-effect model X ijkr = + T i + S j + G k + (TS) ij + (TG) ik + (SG) jk + (TSG) ijk Time (10) Treatment (4) Genes (22283) + ε ijkr Interaction matrices contain interesting information M.K. Kerr et al. (2001), J. Computat. Biol., 7,

14 PCA explained variance GxT GxTxS Whole dataset

15 PCA on the ANOVA interaction matrix Gene treatment (22283 x 4) J. de Haan et al., Bioinformatics 23, (2007) H. F. Gollob, Psychometrika, 33, (1968)

16 Gene selection Based on Hotelling T 2 ( α= 10 5 ): 384 genes selected

17 PCA on the whole dataset Unfolding 4 treatments x 10 times Time (10) Treatment (4) Genes (22283) Genes (22283)

18 PCA on the whole dataset Loading plot

19 PCA on GxTxS interaction DEX

20 Gene selection Gene treatment (22283 x 4) Based on Hotelling T 2 ( α= 10 5 ): 384 genes selected

21 Validation of gene selection Step 1: agreement with known biology Gene ontology enrichment ROC curves Step 2: prediction

22 Gene Ontology (GO) Identifier GO term Development Organ development Morphogenesis Skeletal development Blood vessel development Cell differentiation......

23 GO - Enrichment For every GO category (of interest): Estimate the expected number of genes in that category Calculate a p value for finding as many genes as you do (hypergeometric test, 2-test, z-scores,...) Caveats: Critical level for gene selection? Multiple testing correction? Many genes have no GO categorisation (yet)

24 Enrichment results GO: development GO: organ development GO: morphogenesis GO: skeletal development GO: blood vessel development GO: vasculature development GO: blood vessel morphogenesis GO: cell differentiation GO: angiogenesis GO: organogenesis

25 Enrichment across several thresholds (T 2 ) 25 most enriched terms (on average):

26 Plotting enrichment information in the GO tree J. de Haan et al., submitted

27 ROC curves When is an enrichment significant? (p-value) Check 30 GO terms known to be involved

28 When is an enrichment significant? Check 30 GO terms known to be involved 957 genes ( T 2 α = 10 2 ) Right plot: 10 random subsets of 27/30 GO terms

29 ROC curves of random selections 30 random processes (A) or random genes (B)

30 Sliding window (200 genes) 200 genes

31 Sliding window (200 genes) Number of enriched (expected) processes

32 Validation of gene selection Step 1: agreement with known biology GO enrichment ROC curves Step 2: prediction

33 ANOVA on selected genes (384) Average GxTxS interaction effect Vitamine D : shift in kinetics* Experimentally confirmed: Piek et al., in preparation

34 Gene selection : based on ANOVA-PCA Based on Hotelling T 2 ( α= 10 5 ): 384 genes selected ANOVA : Normal distribution! Normalization of data Non-parametric ANOVA

35 Non parametric ANOVA - PCA Robust location (RL) : Mean Median Rank transformation (RT): data ranks ANOVA Aligned rank method (AR): (data main effects) rank transformation Evaluation Procrustes analysis ROC curves

36 Main effects of time, genes and treatment Time Genes Treatments

37 PCA on the ANOVA interaction matrix Gene x Treatment (22283 x 4) Classical ANOVA Robust Location ANOVA

38 ROC Curves Check 30 GO terms known to be involved genes selected with( α = 10 2 )

39 Procrustes Analysis Classical ANOVA-PCA vs (on GxT interaction term) robust ANOVA-PCA Distribution of the differences classical rotated robust method Overall procrustes error: Σ (classical rotated) 2 PCA on difference matrix

40 Procrustes analysis : Distribution of differences Robust Location method versus Classical ANOVA

41 Procrustes analyses : Overall Procrustes error compared with the classical ANOVA PCA Method Error Robust Location 71.2 Rank Transform +CL 96.0 Aligned Rank Transform +CL 96.4

42 Procrustes analysis: PCA on difference matrix RL vs CL RT+CL vs CL ART +CL vs CL

43 Conclusions Individual variance terms (from ANOVA) can be very informative; Useful for gene selection, too Non-parametric ANOVA even better (manuscr. submitted) GO information is essential for biological validation: Lack of annotation is a problem Supplementary material available from

44 Acknowledgements Jorn de Haan Ron Wehrens Ester Piek Susanne Bauerschmidt René van Schaijk

45

46

47