Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Size: px
Start display at page:

Download "Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612"

Transcription

1 Normalization Getting the numbers comparable

2 The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

3 Intensities are not just mrna concentrations Tissue contamination RNA degradation RNA purification Reverse transcription Amplification efficiency Dye effect (cy3/cy5) Spotting Example of spatial effects on microarrays Raw data Spatial bias estimate DNA-support binding The distribution of solvents and temperature over the array surface and the washing procedure, may result in spatial effects Other issues related to array manufacturing Background correction Image segmentation Hybridization efficiency and specificity Spatial effects

4 Two kinds of variation Global variation Gene-specific variation Amount of RNA in the biopsy Efficiencies of: RNA extraction Reverse transcription amplification Labeling Photodetection Systematic Spotting efficiency, Spot size Spot shape Cross-/unspecific hybridization Biological variation Effect Noise Stochastic

5 Stochastic noise we use statistics to deal with PCA Plot of 34 patients, 8973 dimensions (genes) reduced to 2

6 ...like we will see tomorrow PCA for 100 most significant genes reduced to 2 dimensions

7 Sources of variation Array-specific variation: Systematic Gene-specific variation: Stochastic Similar effect on many measurements Corrections can be estimated from data Too random to be explicitly accounted for noise Normalization Statistical testing

8 Calibration = Normalization = Scaling

9 Nonlinear normalization

10 The Qspline method From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black) A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline function As reference one can use an artificial median array for a set of arrays or use a log-normal distribution, which is a good approximation.

11 Once again qspline Accumulating quantiles When many microarrays are to be normalized to each other an average array can be used as target

12 Lowess Normalization M * * * * * * * A One of the most commonly utilized normalization techniques is the LOcally Weighted Scatterplot Smoothing (LOWESS) algorithm.

13 Invariant set normalization (Li and Wong) A invariant set of probes is used -Probes that does does not change intensity rank between arrays -A piecewise linear median line is calculated -This curve is used for normalization

14 Spatial normalization Raw data After intensity normalization Spatial bias estimate After spatial normalization

15 The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

16 Expression index value Some microarrays have multiple probes addressing the expression of the same gene Affymetrix chips have probe pairs pr. Gene - Perfect Match (PM) - MisMatch (MM) PM: CGATCAATTGCACTATGTCATTTCT MM: CGATCAATTGCAGTATGTCATTTCT

17 Expression index calculation Simplest method? Median But more sophisticated methods exists: dchip, RMA and MAS 5 (from Affymetrix)

18 dchip (Li & Wong) Model: PM ij = θ i φ j + ε ij Outlier removal: Identify extreme residuals Remove Re-fit Iterate Distribution of errors ε ij assumed independent of signal strength (Li and Wong, 2001)

19 RMA Robust Multi-array Average (RMA) expression measure (Irizarry et al., Biostatistics, 2003) For each probe set, re-write PM ij = θ i φ j as: log(pm ij )= log(θ i ) + log(φ j ) Fit this additive model by iteratively re-weighted least-squares or median polish

20 MAS. 5 MicroArray Suite version 5 uses signal * = TukeyBiweight{log( PM j MM j )} MM* is an adjusted MM that is never bigger than PM Tukey biweight is a robust average procedure with weights and outlier rejection

21 Methods compared on expression variance Std Dev of gene measures from 20 replicate arrays Std Dev of gene measures from 20 replicate arrays Expression level Blue and Red: RMA; Black: dchip; Green: MAS5.0 From Terry speed

22 Robustness MAS5.0 MAS 5.0 Log fold change estimate from 20ug crna Log fold change estimate from 1.25ug crna (Irizarry et al., Biostatistics, 2003)

23 Robustness dchip dchip Log fold change estimate from 20ug crna Log fold change estimate from 1.25ug crna (Irizarry et al., Biostatistics, 2003)

24 Robustness RMA RMA Log fold change estimate from 20ug crna Log fold change estimate from 1.25ug crna (Irizarry et al., Biostatistics, 2003)

25 All of this is implemented in R In the BioConductor packages affy (Gautier et al., 2003).

26 References Li and Wong, (2001). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application. Genome Biology 2:1 11. Irizarry, Bolstad, Collin, Cope, Hobbs and Speed, (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31(4):e15.) Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA, version 5 edition, Gautier, Cope, Bolstad, and Irizarry, (2003). affy - an r package for the analysis of affymetrix genechip data at the probe level. Bioinformatics