Rafael A Irizarry, Department of Biostatistics JHU
|
|
- Matthew Lawson
- 5 years ago
- Views:
Transcription
1 Getting Usable Data from Microarrays it s not as easy as you think Rafael A Irizarry, Department of Biostatistics JHU rafa@jhu.edu
2 Acknowledgements Sandrine Dudoit, Terry Speed, Ben Bolstad, Yee- Hwa Yang UCB Leslie Cope, JHU Francois Collin, GeneLogic Bridget Hobbs, WEHI Gene Brown s group at Wyeth/Genetics Institute, and Uwe Scherf s Genomics Research & Development Group at Gene Logic, for generating the spike-in and dilution data Gene Logic and Affymetrix for permission to use their data
3 Outline Scientific questions Quick review of technology Role of statistics Preprocessing Case study
4 Scientific Questions Expression To understand gene function, it is helpful to know when and where it is expressed and Differential expression under what circumstances the expression level is affected. Expression pattern questions concerning functional pathways and how cellular components work together to regulate and carry out cellular processes. Lipshutz et al. (1999) Nature genetics (supplement), 21, pp
5 What do Microarrays do? Interrogate labeled nucleic acid samples How do they do it? Labeled targets Probes
6 Biological question Differentially expressed genes Sample class prediction etc. Role of Statistics Experimental design Microarray experiment Image analysis Quantify Expression Preprocessing Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
7 cdna microarrays
8 Preprocessing Image Analysis: What is a spot? What is background? What do we do about background? Expression measure: How do we summarize pixel information to represent expression? Normalization: Can arrays from same batch be directly compared?
9 Segmentation Adaptive segmentation, SRG Fixed circle segmentation Spots usually vary in size and shape.
10 Spatial effect Cy3 background intensity Cy5 background intensity
11 Local background ---- GenePix ---- QuantArray ---- ScanAnalyze Spot uses morphological opening
12 Quantification of Expression For each spot on the slide we calculate Red intensity = Rfg Rbg fg = foreground, bg = background, and Green intensity = Gfg Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity) we now have one differential expression for each gene for each array
13 What is an MvA plot? M = log 2 R - log 2 G vs. A = (log 2 R + log 2 G)/2
14 45 degree rotation of scatter plot M = log 2 R - log 2 G vs. A = (log 2 R + log 2 G)/2
15 Background matters Morphological opening Local background
16 Why do we normalize?
17 Why do we normalize?
18 The red-green ratios can be spatially biased Top 2.5%of ratios red, bottom 2.5% of ratios green
19 MvA-plot by print-tip-group
20 Case Study Improving expression measures provided by Affymetrix software
21 PM MM
22 Statistical Problem: Summarize probe intensity pairs (PM and MM) to give a measure of expression for a probe set There are various summaries MAS 4.0 (AvDiff) MAS 5.0 Li and Wong s MBEI RMA Many others
23 Default until 2002 GeneChip Avg. diff = software uses Avg.diff 1 Α j Α ( PM MM ) j j with A a set of suitable pairs chosen by software. Obvious Problems: Many negative expression values No log transform
24 Why we take log2
25 What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)
26 What we do: four steps We use only PM, and ignore MM. Also, we Adjust for background on the raw intensity scale; Carry out quantile normalization of PM-*BG with chips in suitable sets, and call the result n(pm-*bg); Take log 2 of normalized background adjusted PM; Carry out a robust multi-chip analysis (RMA) of the quantities log 2 n(pm-*bg). We call our approach RMA
27 Data used for next few slides Spike in data set A: 11 control crnas spiked in, all at the same concentration, which varies across 12 chips.
28 Why we ignore the MM values
29 Why we ignore the MM values
30 Why we ignore the MM values
31 Why and how we remove background White arrows mark the means
32 Why and how we normalize
33 Why we write log 2 n(pm-*bg) = chip effect + probe effect Because: probe effects are additive on the log scale
34 Why we carry out a Robust Multi-chip Analysis Why multi-chip? To put each chip s values in the context of a set of similar values. Why robust? In the old human and mouse series, perhaps 10%- 15% of probe level values are outliers. How? We base our analysis on the linear model : log 2 n(pm ij -*BG) = m + a i + b j + ε ij where i labels chips and j labels probes.
35 Comparisons We study the trade-off of Bias/variance (accuracy/precision), or False positives/true positives. To place ourselves on the spectrum, we need some truth. Often hard to come by, but we have some special data sets from GeneLogic and Affymetrix. We begin looking at variability (SD) across replicates.
36 RMA has smaller SD Especially for low intensities
37
38 Summary Simple data exploration useful tool for quality assessment Statistical thinking helpful for interpretation Statistical models may help find signals in noise
39 The End
40 cdna Arrays excitation laser 2 laser 1 scanning cdna clones (probes) emission printing PCR product amplification purification mrna target overlay image and normalize 0.1nl/spot microarray Hybridize target to microarray analysis
41 High Density Oligonucleotide Arrays GeneChip Probe Array Hybridized Probe Cell Single stranded, labeled RNA target Oligonucleotide probe * * * * * 24µm 1.28cm Millions of copies of a specific oligonucleotide probe >200,000 different complementary probes Image of Hybridized Probe Array Compliments of D. Gerhold
42 Normalization at Probe Level
43 Normalization at Probe Level
44 Next comparison uses Part of Spike-in Data B Probe Set BioB-5 BioB-3 BioC-5 BioB-M BioDn-3 DapX-3 CreX-3 CreX-5 BioC-3 DapX-5 DapX-M Conc 1 Conc 2 Rank Later we consider many different combinations of concentrations.
45 Observed ranks DapX-M Top MAS DapX BioC CreX CreX DapX BioDn BioB-M BioC BioB BioB-5 AvLog(PM-BG) Li&Wong AvDiff Gene
46 Differential Expression
47 Differential expression
48 Differential expression
49 Differential expression
50
51
52
53 Comparisons using FC, N = 2, 3, 4, 6 and 12.