Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

Size: px
Start display at page:

Download "Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology"

Transcription

1 Image Analysis Lecture 3 Pre-Processing of Affymetrix Arrays Stat 697K, CS 691K, Microbio 690K 2 Affymetrix Terminology Probe: an oligonucleotide of 25 base-pairs ( 25-mer ). Based on Information from Terry Speed s Group, UC Berkeley Bolstad, Bioinformatics 2003 Irizarry, Biostatistics 2003 Each gene or portion of a gene is represented by 9 to 22 probes that uniquely identify a gene (current standard = 11). Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene). Mismatch (MM): same as PM but with a single base change for the middle (13th) base. Purpose is to measure nonspecific binding and background noise. Probe-pair: a (PM,MM) pair. Probe-pair set: a collection of probe-pairs for a gene. 4 1

2 Image Analysis Affymetrix arrays are processed using MicroArray Suite MAS 5.1 software, from Affymetrix Images are scanned with lasers that generate excitation light at 488 nanometers (nm) The scanner produces one image which is stored as a DAT file (~50 MB) Affymetrix Chips Each probe cell has millions of copies of the 25- mer immobilized on the slide On a typical chip, there are ~200,000 probe cells 11 pairs x 10,000 genes 5 6 Image Analysis Image Analysis in MAS 5.1 Gridding: MAS 5.1 software overlays a grid onto the image, to locate probe cell centers The user can adjust the grid 7 Misaligned Grid Adjusting the Grid 8 2

3 Expression Calculation Use only center 8x8 pixels for signal The raw data = DAT image files are converted to CEL files Each probe cell: 10x10 pixels Signal for each probe cell (PM and MM): 1) Remove outer 36 pixels (reduces noise) 2) 8x8 pixels remain 3) The probe cell signal is the 75 th percentile of the 8x8 pixel values Background: Average of the lowest 2% of probe cell values in a region of the chip is taken as the background value for that region and subtracted from values in the region chip is divided into 16 regions (called sectors) 9 10 Affymetrix File Summary 1) DAT file: Image file, ~10^7 pixels, ~50 MB. 2) CEL file: Cell intensity file, probe level PM and MM values (view with MAS 5.1, or read into BioConductor), ~7 MB 3) CDF file: Chip Description File. Contains probe locations on the chip. Describes which probes go in which probe sets and the location of probepair sets, ~7 MB built into R and BioConductor (homework) just need to identify the type of chip for BioConductor Affymetrix File Summary 4) EXP file: Contains sample information, fluidics settings and scanner settings (small, ~1 kb)

4 Affymetrix File Summary Affymetrix File Summary 5) RPT file: Quality report file (small, ~2 kb) 5) RPT file: Quality report file Main results: a) percent present genes: should be 40-50%, but at least greater than 25% b) average background signal: should be less than 100 c) how well the labeling reaction went As long as values are same across chips in an experiment, then they are good chips to use. For further detail, see the Affymetrix manual and website Affymetrix Files 6 Affymetrix Chips We will not be analyzing the DAT files We will be using the probe level CEL files in R The CDF files are part of BioConductor and do not need to be read in BioConductor has built-in files for Affymetrix Log Intensity HuGeneFL Chips 15 Cannot combine chips before normalization Data from Wright et al 2002, human fibroblasts (involved in wound repair) 16 4

5 Multiple Slides Normalization Methods Extension of within-slide normalization. Scale normalization step maybe skipped if chips have approximately the same distributions. There is a trade-off between the gains achieved by scale normalization and the possible increase in variability introduced. Why Normalize Affymetrix Data? Total brightness differs among slides Background is different among slides Some causes of systematic measurement variation include: Different amounts of RNA The hybridization reaction may proceed more fully to equilibrium in one array than another Hybridization conditions may vary across arrays Scanner settings are often different Types of Variation Interesting variation Gene expression differences Obscuring variation Sample preparation (labeling difference) Array manufacturing (hybridization difference) Array processing (scanner difference) Multiple Affymetrix Chip Normalization Normalize across slides, to combine information from multiple slides Can treat a pair as Red and Green as in cdna arrays, and use the same approaches from cdna technology Or, select one array as the baseline array, and use it as a benchmark baseline chosen as best quality, or the chip with the median total intensity of all chips some packages automatically choose baseline array (RMA)

6 Scale Normalization Many variations of this: Location Transformations: subtract or add a constant to all values Scaling individual intensities so median or mean intensities are the same across all arrays Scaling individual intensities so the total intensity on an array is the same across all arrays Built into Bioconductor package affy, normalization method = constant (uses mean, using probe level intensities before summarization) 0 0 Mean & Median centering are examples of location transformations (see cdna lecture) Scale Transformations Scale Transformation = Multiply all values by a constant Scale transformations shift the median of the distribution and change the shape of the distribution 0 23 Normalization by Scaling When comparing multiple arrays (with one sample or multiple samples): Assume the overall distribution of RNA intensity values does not change much between samples Most genes change very little in intensity across samples Simplest approach assumes average gene expression is the same for all arrays This makes sense: We are starting with equal quantities of RNA for the samples we are going to compare Therefore the average hybridization should be the same for all samples Source: Mark Reimers,

7 Scale Normalization using Affymetrix MAS 5.1 Software 1) Choose a baseline array 2) For each array i (besides baseline), multiply each probe expression value by: (probe value) x [(mean expression on baseline array) / (mean expression on array i)] Results in each array having the same mean intensity as baseline array Scale Normalization (background notes) Affymetrix uses the 2% trimmed mean (trims the probeset values. The probeset value is the gene expression for a gene, i.e. after summarization of probe-level data) Definition: 2% trimmed mean: exclude the highest and lowest 2% of probeset values Built into Bioconductor package affy, normalization method = constant (uses mean, using probe level intensities before summarization) Affymetrix Scale Normalization using BioConductor Example Step 1: calculate average intensity for each slide Probe 1 Slide 1 Baseline Slide 2 50 Slide 3 Probe Probe Probe Mean Before normalization After scale normalization 27 test 28 7

8 Step 2: Multiply each column by Average 1, then divide by Previous Column Average Column average intensities are now all the same Probe 1 10 Slide 1 Baseline Slide 2 25 * 17.5/40 Slide 3 50 * 17.5/55 Probe 1 10 Slide 1 Baseline Slide Slide Probe * 17.5/40 70 * 17.5/55 Probe Probe * 17.5/40 60 * 17.5/55 Probe Probe * 17.5/40 40 * 17.5/55 Probe Mean 17.5 Mean test 29 test 30 Disadvantage of Scale Normalization If choose a poor baseline, get poorer results All normalized chips do not have the same distribution Quantile Normalization Introduced by Bolstad et al Goal: To make the distribution of probe intensities the same for every chip The normalization distribution is calculated by averaging each quantile across chips Advantage: do not need a baseline array normalizes a group of arrays at the same time without specifying any one as the baseline array It works with probe-level data Built into Bioconductor package affy, normalization method = quantiles

9 Quantile Normalization at PM Probe Level Columns are chips Each gene has 11 PM probes Definition: the quantile is the sorted percentage of data: i.e. the 20 th quantile has 20% of data below it. The algorithm gives each array the same distribution by calculating the mean of each quantile and substituting it as the data value in the original dataset Source: Ben Bolstad Before normalization Constant vs. Quantile Normalization After constant normalization After quantile normalization, 35 distributions all same Remarks For quantile normalization, the distribution functions are effectively estimated by the sample quantiles Quantile normalization is fast Variability of expression measures across chips is reduced after normalization compared to constant normalization and no normalization Removes necessity of choosing baseline array choosing poor baseline gives poorer results 36 9

10 Quantile Normalization Illustration M vs. A plots of chip pairs: before quantile normalization 5 Affymetrix chips (version HG-U95A, HG=human genome) of human liver cell lines (Bolstad et al. 2003) Use M vs. A plots, where M and A are for each pair of chips, using probe-level data Plot pairwise PM probes for each pair of the 5 chips, 10 pairs in all: 5 = 2 5! 2!3! = Bolstad et al., M vs. A plots of chip pairs: after quantile normalization Quantile Normalization Illustration Black line is distribution of all 27 after quantile normalization, i.e. all have same distribution Bolstad et al., HG-U95 Affymetrix arrays for different dilutions of human liver tissue and central nervous system cell line Bolstad et al.,

11 Rank-Invariant Normalization Idea: if a gene is differentially expressed between 2 experiments, it should have a higher rank in one array than another Rank-Invariant Normalization Select a subset of probes (or genes) that are non-differentially expressed, as the basis for normalization - similar to house-keeping gene idea Fit a normalization curve through nondifferentially-expressed probes Built into Bioconductor package affy, normalization method = invariantset 41 Introduced by: Li and Wong 2001b, Tseng et al. 2001, Schadt et al. 2001, Stuart et al Rank Invariant Normalization Rank probes in each array separately Probes with ranks in the two arrays within a threshold, i.e. within 500 out of 150,000, are labeled as rank invariant 2 different samples in an array set y-axis is baseline 43 Green = rank invariant set, not affected by differentially expressed genes in lower right corner, therefore different normalization than yellow line (here, normalized values are based on subtracting x values) Yellow = smoothing spline (different method, similar idea to Lowess): affected by lower right corner, which could be differentially expressed 44 genes Li & Wong, GB,

12 Comparison to Other Methods Rank invariant method was not compared to quantile normalization, scale normalization or no normalization in Li & Wong 2001 Rank invariant method works well if there are a small number of differentially expressed genes in an experiment BioConductor: Quantiles and Li & Wong method are most widely used. QQ plots (quantile-quantile plots) A QQ plot is a graphical technique for determining if 2 data sets come from populations with a common distribution It is a plot of the quantiles of the 1 st data set against the quantiles of the 2 nd data set If the 2 sets come from a population with the same distribution, the points should fall on a 45- degree line Built in function of R (qqplot, qqnorm) QQ plot of normalized array vs baseline Other Normalization Methods Use Stable Genes Definition: genes that exhibit similar expression across a large number of tissues and conditions, but are allowed to deviate from this level in a small number of cases 451 standard genes found across tissue samples in HuGE Index (Hsiao et al. 2001, Physiol Genomics; used HU6800 array). HuGE Index (Human Gene Expression Index, is a public repository for gene expression data on normal human tissues using high-density oligonucleotide arrays. 2 replicate arrays should have same distribution This is close to 45-degree line Li & Wong, GB,

13 Use Stable Genes HG_U95 and HG_U133 arrays contain 100 normalization control probe sets validated to have constant expression across tissue types (Affymetrix 2002) More databases to obtain and validate such genes? E.g. Su et al. (2002) Large-scale analysis of the human and mouse transcriptomes. PNAS Credit These slides are based in large part on lectures by Steve Qin, University of Michigan, with generous permission. - Steve Qin - Cheng Li - Jun S. Liu - Wing Wong - Robert Gentleman - Yee Hwa Yang - Sandrine Dudoit - Percy Luu - Terry Speed - Debashis Ghosh - Rafael Irizarry - Rebecca Fry - Leona Samson - David Hoyle - Mark Reimers - Ben Bolstad - Fred Wright Rank-Invariant Normalization (background notes) Example from Li & Wong, Genome Biology human brain sample microarrays Choose a baseline array (array 11) Normalize each array to the baseline array, pairwise (baseline + 1 more) HU6800 array, with ~140,000 probes Expect a probe for a non-differentially expressed gene to have similar intensity ranks in two arrays Rank-Invariant Normalization (background notes) Rank probes in each pair of arrays separately expect a probe for non-differentially-expressed genes to have the same rank in both arrays Iterative procedure to identify rank invariant set of probes (non-differentially-expressed genes)

14 Identifying Rank-Invariant Probes in a Pair of Arrays (background notes) 1) Rank probes in each array separately 2) Calculate proportional rank difference (PRD) for each probe: PRD p =(rank p in array 1 rank p in array 2)/ (total number of probes) 3) Threshold: If PRD p < 0.003, then a probe is rankinvariant 4) Repeat steps 1)-3) iteratively until the rank invariant set does not change 53 Example of Iterative Procedure (background notes) For total number of probes = 140,000 Proportional rank difference of = absolute rank difference of 420. Find 10,000 probes with rank difference within 420 Repeat process for new set of 10,000 probes Repeat until the number of points in new set does not decrease Li & Wong, GB, Variations on Method (background notes) Threshold: PRD < (diff of 420) for low ranking genes (based on average rank) - i.e. low ranks: rank 50 on array 1, rank 70 on array 2 PRD < (diff of 980) for high ranking genes - i.e. high ranks: rank 136,000 array 1, rank 136,200 array 2 (fewer points at high intensity) ** Higher number, 980, for high ranking genes (threshold is interpolated in between high and low ranking genes) Rank Invariant Normalization Curve (background notes) A piecewise linear median line is fit through the rank invariant probes similar to Lowess curve fitting The normalization curve is subtracted from each probe in the non-baseline array similar to Lowess normalization Baseline array is not changed all other arrays are changed in comparison to baseline