Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Size: px
Start display at page:

Download "Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world."

Transcription

1 Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Preprocessing and normalization A trailer show of the rest of the microarray world Carsten Friis Media glna tnra GlnA TnrA C2 glnr C3 C5 C6 K GlnR C1 C4 C7

2 What we cover and what we don t

3 Array design Probe design The DNA Array Analysis Pipeline Question Experimental Design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

4 Array design Probe design The DNA Array Analysis Pipeline Question Experimental Design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

5 Intensities are not just mrna concentrations Example of spatial effects on microarrays Tissue contamination Clone In theory, identification the and spatial mapping location of a given spot Image should segmentation matter little, since the RNA degradation locations were PCR randomly yield, selected. contamination Signal quantification Amplification efficiency Raw data Spatial effects Spotting efficiency Background But in correction reality, things like the Reverse transcription distribution of efficiency solvent over the array DNA-support binding surface and Hybridization the quality efficiency of washing, have and specificity their say on Other issues the related matter. to array Spatial manufacturing bias estimate

6 Two kinds of variation in the signal Global variation RNA quality Sample preparation Dye Hybridization Photodetection Gene-specific variation Spotting (size and shape) Cross-hybridization Dye Biological variation Effect Noise Systematic Stochastic

7 Sources of variation Array-specific variation: Systematic Gene-specific variation: Stochastic Similar effect on many measurements Corrections can be estimated from data Too random to be explicitly accounted for noise Normalization Statistical testing and/or Error modelling

8 Calibration = Normalization = Scaling

9 Nonlinear normalization

10 The Qspline method From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black) A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline function to the quantiles As reference one can either use an artificial median array for a set of arrays or use a log-normal distribution, which is a good approximation

11 Spatial normalization Raw data After intensity normalization Spatial bias estimate After spatial normalization

12 Array design Probe design The DNA Array Analysis Pipeline Question Experimental Design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

13 Expression index value Some microarrays have multiple probes addressing the expression of the same target However for downstream analysis we often want - Perfect to Match deal (PM) with only one value pr. gene. - MisMatch (MM) Therefore we want to collapse the intensities from many probes into one value: a gene expression index value Affymetrix GeneChips have probe pairs pr. Gene PM: CGATCAATTGCACTATGTCATTTCT MM: CGATCAATTGCAGTATGTCATTTCT

14 Expression index calculation Simplest method? Median But more sophisticated methods exists: dchip, RMA and MAS 5

15 dchip (Li & Wong) Model: PM ij = θ i φ j + ε ij Outlier removal: Identify extreme residuals Remove Re-fit Iterate Distribution of errors ε ij assumed independent of signal strength (Li and Wong, 2001)

16 RMA Robust Multi-array Average (RMA) expression measure (Irizarry et al., Biostatistics, 2003) For each probe set, re-write PM ij = θ i φ j as: log(pm ij )= log(θ i ) + log(φ j ) Fit this additive model by iteratively re-weighted least-squares or median polish

17 MAS. 5 MicroArray Suite version 5 uses Signal = TukeyBiweight{log(PM j - MM * j)} MM* is an adjusted MM that is never bigger than PM Tukey biweight is a robust average procedure with weights and outlier rejection

18 Methods compared on expression variance Standard deviation of gene measures from 20 replicate arrays Std Dev of gene measures from 20 replicate arrays Expression level RMA: Blue and Red MAS5: Green dchip: Black From Terry speed

19 The really cool thing about R......is all the nice libraries out there The BioConductor packages encompasses many very useful methods for microarray analysis Including the qspline method, and other normalization algorithms Check out the website!

20 References Li and Wong, (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application. Genome Biology 2:1 11. Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31(4):e15 Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA, Version 5 edition, Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis of affymetrix genechip data at the probe level. Bioinformatics

21 Array design Probe design The DNA Array Analysis Pipeline Question Experimental Design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

22 Demanding answers of your microarray data Typically we want to identify differentially expressed genes Example: Alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media alcohol dehydrogenase without alcohol with alcohol

23 Alas, the data contain stochastic noise There is no way around it... He s going to say it Statistics

24 You can think of statistics as a black box... Noisy measurements statistic p-value...but, you still need to perform the test and understand how to interpret the results

25 What's inside the black box statistics? t-tests, ANOVAs and Volcanoes

26 The notorious p-value p-value = The chance of rejecting the null hypothesis by coincidence For gene expression analysis we can say: The chance that a gene is categorized as differential expressed by mistake

27 Why and what do we test? - We test to extract significant genes Density Fold change: > 2 Fold change: ~1 Intensity

28 The t-test The t statistic is based on the sample mean and variance Then a lookup of t in a table of the t-distribution finds the p-value

29 ANOVA = ANalysis Of VAriance Very similar to the t-test, but can test multiple categories Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2 Advantage: it has more power than the t-test

30 The two way ANOVA 3 wildtype +alcohol 2 mutant +alcohol wildtype mutant 2

31 The fine print... We can rank the genes according to the p-value But, we really can t trust the p- value, in the strict statistical way! We rarely fulfill all the assumptions of the statistical tests... And we should take multiple testing into account

32 Multiple testing? In a typical microarray analysis we test thousands of genes If we use a significance level of 0.05 and we tested 1000 genes. We would expect 50 genes to be significant by chance 1000 x 0.05 = 50

33 Sometimes the impossible just happens Monday, September 5, 2005; Posted: 11:42 a.m. EDT (15:42 GMT) VIENNA, Austria -- At least nine tourists have been killed after a helicopter dropped a concrete block on a cable car in western Austria, according to police. Police said the aircraft dropped construction materials onto one cable car Monday, causing it to plummet to the ground, news agencies said. Passengers were also catapulted from other cars as they swung violently, reports said. Cable car passengers are rescued by helicopter in the Austrian skiing resort of Soelden.

34 Correction for multiple testing Bonferroni: P 0.01 N Confidence level of 99% Benjamini-Hochberg: P i N 0.01 N = number of genes i = number of accepted genes

35 But really, those methods are crude What we can trust is the ability of statistic test to rank genes accordingly to their reliability The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff If we permute the samples we then can get an estimate of the False Discovery Rate (FDR) in this set

36 ...but then what do we do? P-value log2 fold change (M)

37 Conclusion Array data contains stochastic noise Statistics is needed to conclude on differential expression We can t trust the p-value But the statistics can rank genes The capacity/needs of downstream processes can be used to set cutoff FDR can be estimated e.g. Volcano Plots t-test is used for two category tests ANOVA is used for multiple categories

38 Array design Probe design The DNA Array Analysis Pipeline Question Experimental Design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network