Improvements to the RMA Algorithm for Gene Expression Microarray Background Correction

Size: px
Start display at page:

Download "Improvements to the RMA Algorithm for Gene Expression Microarray Background Correction"

Transcription

1 Improvements to the RMA Algorithm for Gene Expression Microarray Background Correction Monnie McGee & Zhongxue Chen Department of Statistical Science Southern Methodist University MSU Seminar November 18, 2005 p.1/44

2 Acknowledgments SMU - UTSW Microarray Analysis Group (SMUMAG) Faculty Jennifer Cai Jing Cao Tony Ng Richard Scheuermann William Schucany Jyoti Shaw Burke Squires Xinlei Wang Students Zhongxue Chen Kinfe Gedif Drew Hardin Jobayer Hossain Julia Kozlitina Feng Luo Data supplied by Boland Lab at Baylor. MSU Seminar November 18, 2005 p.2/44

3 Outline Motivation Central Dogma of Biology Types of Microarray Technologies Central Dogma of Microarray Analysis Robust Multi-Chip Average Improvements to RMA Other Ongoing and Future Work MSU Seminar November 18, 2005 p.3/44

4 Colon Cancer Cell Line Data Microarrays of four cell lines HCT116: Microsatellite Instability Model HCT111 Plus 3: MSI plus a corrective gene SW48: CIMP line (silencing of genes) SW480: Chromosomal Instability (CIN) line Four treatments to each line (including no treatment) Two control cell lines (RKO & HT29) Total of 18 microarrays MSU Seminar November 18, 2005 p.4/44

5 Colon Cancer Cell Line Data Microarrays of four cell lines HCT116: Microsatellite Instability Model HCT111 Plus 3: MSI plus a corrective gene SW48: CIMP line (silencing of genes) SW480: Chromosomal Instability (CIN) line Four treatments to each line (including no treatment) Two control cell lines (RKO & HT29) Total of 18 microarrays Question: What genes are differentially expressed among the various cell lines? MSU Seminar November 18, 2005 p.4/44

6 DNA and Base Pairs Image Courtesy of ExploreMore Television MSU Seminar November 18, 2005 p.5/44

7 Central Dogma of Biology Image Courtesy of BioCoach MSU Seminar November 18, 2005 p.6/44

8 Transcription Image Courtesy of BioCoach Movie of Complete Transcription MSU Seminar November 18, 2005 p.7/44

9 Measuring Gene Expression Gene expression can be quantified by measuring either mrna or protein. mrna Measures Quantitative Northern blot, qpcr, qrt-pcr, short or long oligonucleotide arrays, cdna arrays, EST sequencing, SAGE, MPSS, MS, bead arrays, etc. Protein Measures Quantitative Western blots, ELISA, 2D-gels, gas or liquid chromatography, mass-spec, etc. MSU Seminar November 18, 2005 p.8/44

10 Why Microarray Analysis? Large-scale study of biological processes Activity in cell at a certain point in time Account for differences in phenotypes on a large-scale genetic level Sequences are important, but genes have effect through expression MSU Seminar November 18, 2005 p.9/44

11 Why Microarray Analysis? Large-scale study of biological processes Activity in cell at a certain point in time Account for differences in phenotypes on a large-scale genetic level Sequences are important, but genes have effect through expression Rough measurement on a grand scale which has utility MSU Seminar November 18, 2005 p.9/44

12 Measuring Gene Expression Basic idea: Quantify concentration of a gene s mrna transcript in a cell at a given time MSU Seminar November 18, 2005 p.10/44

13 Measuring Gene Expression Basic idea: Quantify concentration of a gene s mrna transcript in a cell at a given time How? Immobilize DNA probes onto glass (or other medium) Hybridize labeled target mrna with probes Wash away excess target Measure how much binds to each probe (i.e. forms DNA) MSU Seminar November 18, 2005 p.10/44

14 Microarray Measurements All raw measurements are fluorescence intensities Target cdna (or mrna) is fluorescently labeled Molecules in dye are excited using a laser Measurement is a count of the photons emitted Entire slide or chip is scanned, and the result is a digital image Image is processed to locate probes and assign intensity measurements to each probe MSU Seminar November 18, 2005 p.11/44

15 Analysis Tasks Identify up- and down-regulated genes. Find groups of genes with similar expression profiles. Find groups of experiments (tissues) with similar expression profiles. Find genes that explain observed differences among tissues (feature selection). MSU Seminar November 18, 2005 p.12/44

16 Microarray Technologies Two Channel Spotted Arrays Robotic Microspotting Probes are 300 to 3000 base pairs in length Long-oligo arrays: probes are uniformly 60 to 90 bp Commerical arrays using inkjet technology Single-channel Arrays High-density short oligo (25 bp) arrays (Affymetrix, Nimblegen) MSU Seminar November 18, 2005 p.13/44

17 Two-Channel Spotted Arrays Diagram courtesy of Columbia Department of Computer Science MSU Seminar November 18, 2005 p.14/44

18 Yeast Array Image Yeast Array courtesy of Russ Altman, Stanford University MSU Seminar November 18, 2005 p.15/44

19 The Affymetrix Chip Human Genome U133 Plus 2.0 Array Courtesy of Affymetrix Some Definitions Probes = 25 bp sequences Probe sets = 11 to 20 probes corresponding to a particular gene or EST Chip contains 54K probe sets MSU Seminar November 18, 2005 p.16/44

20 In situ Synthesis of Probes Image Courtesy of Affymetrix MSU Seminar November 18, 2005 p.17/44

21 mrna Hybridizes to Probes Image Courtesy of Affymetrix MSU Seminar November 18, 2005 p.18/44

22 Image of E. Coli Gene Chip Image Courtesy of Lee Lab at Cornell University MSU Seminar November 18, 2005 p.19/44

23 Perfect Match vs. Mismatch PM Probe = 25 bp probe perfectly complementary to a specific region of a gene MM Probe = 25 bp probe agreeing with a PM apart from the middle base. The middle base is a transition (A G, C G) of that base MSU Seminar November 18, 2005 p.20/44

24 Perfect Match vs. Mismatch PM Probe = 25 bp probe perfectly complementary to a specific region of a gene MM Probe = 25 bp probe agreeing with a PM apart from the middle base. The middle base is a transition (A G, C G) of that base Image Courtesy of Affymetrix MSU Seminar November 18, 2005 p.20/44

25 PM and MM Example Target Transcript for Human reca gene: ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag c t c a g c t t a a g t c a t g g a a t t c t a g PM1 c t c a g c t t a a g t g a t g g a a t t c t a g MM1 t c a g c t t a a g t c a t g g a a t t c t a g a PM2 t c a g c t t a a g t c t t g g a a t t c t a g a PM2 a t t c t a g a g g a t g t a t c t c a c a a g t PM3 a t t c t a g a g g a t c t a t c t c a c a a g t MM3 a g g a t g t a t c t c a c a a g t a g g a t c a PM4 a g g a t g t a t c t c t c a a g t a g g a t c a MM4 MSU Seminar November 18, 2005 p.21/44

26 PM and MM Example Target Transcript for Human reca gene: ctcagcttaagtcatggaattctagaggatgtatctcacaagtaggatcaag c t c a g c t t a a g t c a t g g a a t t c t a g PM1 c t c a g c t t a a g t g a t g g a a t t c t a g MM1 t c a g c t t a a g t c a t g g a a t t c t a g a PM2 t c a g c t t a a g t c t t g g a a t t c t a g a PM2 a t t c t a g a g g a t g t a t c t c a c a a g t PM3 a t t c t a g a g g a t c t a t c t c a c a a g t MM3 a g g a t g t a t c t c a c a a g t a g g a t c a PM4 a g g a t g t a t c t c t c a a g t a g g a t c a MM4 Morals: Large overlap of sequences and variable GC content Source: Naef and Magnesco (2003) MSU Seminar November 18, 2005 p.21/44

27 Other Sources of Variation Systematic Amount of RNA in biopsy extraction, Efficiencies of RNA extraction, reverse transcription, labeling, photodetection, GC content of probes Similar effect on many measurements Corrections can be estimated from data Calibration corrections MSU Seminar November 18, 2005 p.22/44

28 Other Sources of Variation Systematic Amount of RNA in biopsy extraction, Efficiencies of RNA extraction, reverse transcription, labeling, photodetection, GC content of probes Similar effect on many measurements Corrections can be estimated from data Calibration corrections Stochastic PCR yield, DNA quality, Spotting efficiency, spot size, Non-specific hybridization, Stray signal Too random to be explicitly accounted for in a model Noise components & Schmutz MSU Seminar November 18, 2005 p.22/44

29 Quality Control We wish to find and eliminate problem probes before analyzing the data Problems may be local (scratch on the array, inadequate washing) or global (background set too high) Look at image plots, histograms, MA plots, boxplots, etc. MSU Seminar November 18, 2005 p.23/44

30 Colon Cancer Cell Line Data Microarrays of four cell lines HCT116: Microsatellite Instability Model HCT111 Plus 3: MSI plus a corrective gene SW48: CIMP line (silencing of genes) SW480: Chromosomal Instability (CIN) line Four treatments to each line (including no treatment) Two control cell lines (RKO & HT29) Total of 18 microarrays Question: What genes are differentially expressed among the various cell lines? MSU Seminar November 18, 2005 p.24/44

31 Log Base 2 SW 480 Intensities MSU Seminar November 18, 2005 p.25/44

32 Exploratory Data Analysis MSU Seminar November 18, 2005 p.26/44

33 Images of Raw Cell Line Data Scanner Effect? MSU Seminar November 18, 2005 p.27/44

34 Central Dogma of MA Analysis Computing Expression Values for each probe set requires three steps: Background correction (image correction for cdna) Normalization Summarization MSU Seminar November 18, 2005 p.28/44

35 Central Dogma of MA Analysis Computing Expression Values for each probe set requires three steps: Background correction (image correction for cdna) Normalization Summarization Approaches: Microarray Analysis Suite 5.0 (MAS Affymetrix, 2001, 2003) Model Based Expression Index (MBEI - Li and Wong, 2001a,b) Robust Multichip Analysis (RMA - Irizarry et. al., 2003) Generalized Logarithm Average (GLA - Zhou and Rocke, 2005) MSU Seminar November 18, 2005 p.28/44

36 Background Correction in RMA where X = S + Y X = observed probe level intensity S E(α) = true signal Y TN(µ,σ 2 ) = background noise Reference: Irizarry et. al., Biostatistics, 2003 MSU Seminar November 18, 2005 p.29/44

37 RMA for the Right Brained... Image courtesy of Terry Speed MSU Seminar November 18, 2005 p.30/44

38 Parameter Estimation Background Corrected intensity is E ij = E(S ij X ij ), where i = 1...G, and j = 1,...,J. We need to estimate µ, σ, and α. MSU Seminar November 18, 2005 p.31/44

39 Parameter Estimation Background Corrected intensity is E ij = E(S ij X ij ), where i = 1...G, and j = 1,...,J. We need to estimate µ, σ, and α. How does RMA estimate the parameters? µ = Mode of observations to the left of the overall mode σ = Sample standard deviation for observations to left of overall mode α = Mode of observations to the right of the overall mode MSU Seminar November 18, 2005 p.31/44

40 Parameter Estimation Background Corrected intensity is E ij = E(S ij X ij ), where i = 1...G, and j = 1,...,J. We need to estimate µ, σ, and α. How does RMA estimate the parameters? µ = Mode of observations to the left of the overall mode σ = Sample standard deviation for observations to left of overall mode α = Mode of observations to the right of the overall mode Shown to perform better than more principled approaches (Hein, et. al., 2005). MSU Seminar November 18, 2005 p.31/44

41 Simulation Experiment 100 replications for n = 100, 000. True parameter values of µ = 50, 100, σ = 10, 20, and α = 50, 250. Estimate of σ is the same as RMA Four methods for estimating α: Mean, Median, 75 th percentile, and th percentile of PM values larger than overall mode Five methods of estimating µ MSU Seminar November 18, 2005 p.32/44

42 Estimating µ Estimate µ with 1. Affy method 2. Overall mode (s) of PM intensities 3. Mode of data to the left of 2s 4. Either of 2 or 3 plus a one-step correction, defined by the formula: ( s µ φ σ ) ασ = ασ [ ( s µ Φ σ )] ασ MSU Seminar November 18, 2005 p.33/44

43 Results MSE for α, when µ = 50, σ = 10, α = 50 Using RMA: 1754 MSU Seminar November 18, 2005 p.34/44

44 Results MSE for α, when µ = 50, σ = 10, α = 50 Using RMA: 1754 ˆµ ˆα Given By Mean Median 75% 99.95% s s s s MSU Seminar November 18, 2005 p.34/44

45 Performance on Cell Line Data PM intensities compared to original curve for ˆµ = 2s + 1 and various estimates of α. Data: SW 480 cell line with short term treatment. MSU Seminar November 18, 2005 p.35/44

46 Performance on Spike-In Data MSU Seminar November 18, 2005 p.36/44

47 Ongoing and Future Work Fourier or Bootstrap Deconvolution (Hall & Qiu 2005, Cordy & Thomas 1997) MSU Seminar November 18, 2005 p.37/44

48 Ongoing and Future Work Fourier or Bootstrap Deconvolution (Hall & Qiu 2005, Cordy & Thomas 1997) Nonparametric Approach Find smallest k 1 % of PM intensities Obtain k 2 % of corresponding MM intensities MM intensities are an estimate of background noise MSU Seminar November 18, 2005 p.37/44

49 Ongoing and Future Work Fourier or Bootstrap Deconvolution (Hall & Qiu 2005, Cordy & Thomas 1997) Nonparametric Approach Find smallest k 1 % of PM intensities Obtain k 2 % of corresponding MM intensities MM intensities are an estimate of background noise Model PM intensities as Nonstandard Mixtures (Statistical Science, 1989) X = S + Y, where S (1 p)δ 0 + pf(x) MSU Seminar November 18, 2005 p.37/44

50 Some Preliminary Results Nonparametric Correction with k 1 = and k 2 = vs. RMA Correction Data: SW480 Cell Line with Short-Term Treatment MSU Seminar November 18, 2005 p.38/44

51 A New Model Model X, the observed intensity, by X pm = BK pm + NS pm + CH pm + S pm X mm = BK mm + NS mm + CH mm where BK is the background, NS is non-specific hybridization, and CH is cross-hybridization. Assumptions: NS mm = αsh pm + ONS mm BK pm BK mm CH pm CH mm MSU Seminar November 18, 2005 p.39/44

52 More Work To Do... Spatial Correlation in Affymetrix GeneChip Arrays Use of Moran s I statistic to detect spatial correlation Test for Scanner effects within an array Methods of Visualizing Microarray Data Methods of Simulating Microarray Data Creating pseudo-replicate arrays MSU Seminar November 18, 2005 p.40/44

53 Spatial Correlation Measures Raw Image File Image using Moran s I MSU Seminar November 18, 2005 p.41/44

54 Statistical Challenges Enormous amount of Data Current methods are somewhat ad-hoc Data integration and visualization Data has variable specificity Dynamic nature of data Multiple Comparisons MSU Seminar November 18, 2005 p.42/44

55 References 1. Affymetrix, Inc (2001). "Statistical Algorithms Reference". Data Analysis Fundamentals Technical Manual, Chapter Affymetrix Technical Note: Design and Performance of the GeneChip Human Genome U133 Plus 2.0 and Human Genome U133A Plus 2.0 Arrays (2003) Affymetrix, Inc (2002). Statistical Algorithms Description Document Bolstad BM (2004). Low Level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization. Dissertation. University of California, Berkeley. 5. Cordy CB and Thomas DR (1997). Deconvolution of a distribution function. Journal of the American Statistical Association, 92, Hall P and Qiu P (2005). Discrete-transform approach to deconvolution problems. Biometrika, 92, Hein AK, Richardson S, Causton H, Ambler GK, and Green PJ (2005). BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data. Biostatistics, 6, MSU Seminar November 18, 2005 p.43/44

56 References Continued 8. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, and Speed TP (2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31 (4) e Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, and Speed TP (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, Li C and Wong HW (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences, 98 (1): Li C and Wong HW (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology, 8 (8): research , Naef F and Magnasco MO (2003). Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Physical Review, Panel on Nonstandard Mixtures of Distributions (1989). Statistical Models and Analysis in Auditing. Statistical Science, 4, Zhou L and Rocke DM (2005). An expression index for Affymetrix GeneChips based on the generalized logarithm. Bioinformatics, 21 (21): MSU Seminar November 18, 2005 p.44/44