Rafael A Irizarry, Department of Biostatistics JHU

Size: px
Start display at page:

Download "Rafael A Irizarry, Department of Biostatistics JHU"

Transcription

1 Getting Usable Data from Microarrays it s not as easy as you think Rafael A Irizarry, Department of Biostatistics JHU rafa@jhu.edu

2 Acknowledgements Sandrine Dudoit, Terry Speed, Ben Bolstad, Yee- Hwa Yang UCB Leslie Cope, JHU Francois Collin, GeneLogic Bridget Hobbs, WEHI Gene Brown s group at Wyeth/Genetics Institute, and Uwe Scherf s Genomics Research & Development Group at Gene Logic, for generating the spike-in and dilution data Gene Logic and Affymetrix for permission to use their data

3 Outline Scientific questions Quick review of technology Role of statistics Preprocessing Case study

4 Scientific Questions Expression To understand gene function, it is helpful to know when and where it is expressed and Differential expression under what circumstances the expression level is affected. Expression pattern questions concerning functional pathways and how cellular components work together to regulate and carry out cellular processes. Lipshutz et al. (1999) Nature genetics (supplement), 21, pp

5 What do Microarrays do? Interrogate labeled nucleic acid samples How do they do it? Labeled targets Probes

6 Biological question Differentially expressed genes Sample class prediction etc. Role of Statistics Experimental design Microarray experiment Image analysis Quantify Expression Preprocessing Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation

7 cdna microarrays

8 Preprocessing Image Analysis: What is a spot? What is background? What do we do about background? Expression measure: How do we summarize pixel information to represent expression? Normalization: Can arrays from same batch be directly compared?

9 Segmentation Adaptive segmentation, SRG Fixed circle segmentation Spots usually vary in size and shape.

10 Spatial effect Cy3 background intensity Cy5 background intensity

11 Local background ---- GenePix ---- QuantArray ---- ScanAnalyze Spot uses morphological opening

12 Quantification of Expression For each spot on the slide we calculate Red intensity = Rfg Rbg fg = foreground, bg = background, and Green intensity = Gfg Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity) we now have one differential expression for each gene for each array

13 What is an MvA plot? M = log 2 R - log 2 G vs. A = (log 2 R + log 2 G)/2

14 45 degree rotation of scatter plot M = log 2 R - log 2 G vs. A = (log 2 R + log 2 G)/2

15 Background matters Morphological opening Local background

16 Why do we normalize?

17 Why do we normalize?

18 The red-green ratios can be spatially biased Top 2.5%of ratios red, bottom 2.5% of ratios green

19 MvA-plot by print-tip-group

20 Case Study Improving expression measures provided by Affymetrix software

21 PM MM

22 Statistical Problem: Summarize probe intensity pairs (PM and MM) to give a measure of expression for a probe set There are various summaries MAS 4.0 (AvDiff) MAS 5.0 Li and Wong s MBEI RMA Many others

23 Default until 2002 GeneChip Avg. diff = software uses Avg.diff 1 Α j Α ( PM MM ) j j with A a set of suitable pairs chosen by software. Obvious Problems: Many negative expression values No log transform

24 Why we take log2

25 What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)

26 What we do: four steps We use only PM, and ignore MM. Also, we Adjust for background on the raw intensity scale; Carry out quantile normalization of PM-*BG with chips in suitable sets, and call the result n(pm-*bg); Take log 2 of normalized background adjusted PM; Carry out a robust multi-chip analysis (RMA) of the quantities log 2 n(pm-*bg). We call our approach RMA

27 Data used for next few slides Spike in data set A: 11 control crnas spiked in, all at the same concentration, which varies across 12 chips.

28 Why we ignore the MM values

29 Why we ignore the MM values

30 Why we ignore the MM values

31 Why and how we remove background White arrows mark the means

32 Why and how we normalize

33 Why we write log 2 n(pm-*bg) = chip effect + probe effect Because: probe effects are additive on the log scale

34 Why we carry out a Robust Multi-chip Analysis Why multi-chip? To put each chip s values in the context of a set of similar values. Why robust? In the old human and mouse series, perhaps 10%- 15% of probe level values are outliers. How? We base our analysis on the linear model : log 2 n(pm ij -*BG) = m + a i + b j + ε ij where i labels chips and j labels probes.

35 Comparisons We study the trade-off of Bias/variance (accuracy/precision), or False positives/true positives. To place ourselves on the spectrum, we need some truth. Often hard to come by, but we have some special data sets from GeneLogic and Affymetrix. We begin looking at variability (SD) across replicates.

36 RMA has smaller SD Especially for low intensities

37

38 Summary Simple data exploration useful tool for quality assessment Statistical thinking helpful for interpretation Statistical models may help find signals in noise

39 The End

40 cdna Arrays excitation laser 2 laser 1 scanning cdna clones (probes) emission printing PCR product amplification purification mrna target overlay image and normalize 0.1nl/spot microarray Hybridize target to microarray analysis

41 High Density Oligonucleotide Arrays GeneChip Probe Array Hybridized Probe Cell Single stranded, labeled RNA target Oligonucleotide probe * * * * * 24µm 1.28cm Millions of copies of a specific oligonucleotide probe >200,000 different complementary probes Image of Hybridized Probe Array Compliments of D. Gerhold

42 Normalization at Probe Level

43 Normalization at Probe Level

44 Next comparison uses Part of Spike-in Data B Probe Set BioB-5 BioB-3 BioC-5 BioB-M BioDn-3 DapX-3 CreX-3 CreX-5 BioC-3 DapX-5 DapX-M Conc 1 Conc 2 Rank Later we consider many different combinations of concentrations.

45 Observed ranks DapX-M Top MAS DapX BioC CreX CreX DapX BioDn BioB-M BioC BioB BioB-5 AvLog(PM-BG) Li&Wong AvDiff Gene

46 Differential Expression

47 Differential expression

48 Differential expression

49 Differential expression

50

51

52

53 Comparisons using FC, N = 2, 3, 4, 6 and 12.