UNIVERSITY OF WISCONSIN AND MEDICAL INFORMATICS

Size: px
Start display at page:

Download "UNIVERSITY OF WISCONSIN AND MEDICAL INFORMATICS"

Transcription

1 UNIVERSITY OF WISCONSIN DEPARTMENT OF BIOSTATISTICS AND MEDICAL INFORMATICS Technical Report # 183 July 2004 To Pool or Not to Pool: A Question of Microarray Experimental Design C. Kendziorski 1, R.A. Irizarry 2, K. Chen 3, J.D. Haag 3, and M.N. Gould 3 1 Department of Biostatistics and Medical Informatics, University of Wisconsin -Madison 2 Department of Biostatistics, Johns Hopkins University and 3 McArdle Laboratory for Cancer Research, University of Wisconsin-Madison UNIVERSITY OF WISCONSIN DEPARTMENT OF BIOSTATISTICS AND MEDICAL INFORMATICS K6/446 Clinical Science Center 600 Highland Avenue Madison, WI

2 1 Abstract Over 10% of the data sets catalogued in the Gene Expression Omnibus Database involve messenger RNA samples that have been pooled prior to hybridization. Pooling affects data quality and inference, but the exact effects are not yet known as pooling has not been systematically studied in the context of microarray experiments. Here we report on the results of an experiment designed to evaluate the utility of pooling and the impact on identifying differentially expressed genes. We find that inference for most genes is not adversely affected by pooling and we recommend that pooling be done when fewer than three arrays are used in each condition. For larger designs, pooling does not significantly improve inferences if few subjects are pooled. The realized benefits in this case do not outweigh the price paid for loss of individual specific information. Pooling is beneficial when many subjects are pooled, provided independent samples contribute to multiple pools. 2 Introduction Messenger RNA samples are often pooled in a microarray experiment out of necessity 1 2 or in an effort to reduce the effects of biological variation 3 6. The idea behind the latter motivation is that differences due to subject-to-subject variation will be minimized making substantive features easier to find This is often desirable when primary interest is not on the individual (e.g. making a prognosis or diagnosis), but rather on characteristics of the population from which certain individuals are obtained (e.g. identifying biomarkers or expression patterns common across individuals). Since pooled designs allow for measurement of groups of individuals using relatively few arrays, they have the potential to decrease costs when arrays are expensive relative to samples. In spite of the potential advantages, pooled designs are often discouraged due to concerns regarding the inability to identify and appropriately transform or remove aberrant subjects 1

3 and the inability to estimate within population variation. These concerns are uncontroversial when all subjects in a study are pooled and only technical replicates are obtained. However, there are pooling strategies which compromise between pooling everything and only considering individual biological samples on individual arrays. Theoretical advantages of such designs have been established 10,12. A brief review is given here. A microarray experiment to estimate gene expression levels consists of extraction and labeling of mrna from n s subjects, hybridization to n a arrays, followed by scanning and image processing. Assuming that sufficient data pre-processing has been done to remove artifacts within and across a set of arrays, the gene expression measurements for a gene denoted by x 1, x 2,..., x na are considered independent and identically distributed samples from a distribution with mean θ, the quantity of interest. The average of the measurements is used to estimate θ. Since the processed measurements are affected primarily by biological and technical variation, denoted σ 2 ɛ and σ2 ξ, respectively, the x i are given by: x i = θ + ɛ i + ξ i = T i + ξ i Assuming that mrnas average out when pooled (biological averaging), T i = 1 r s rs k=1 T i k where r s denotes the number of subjects contributing mrna to a pool and T ik is the k th subject s contribution to the i th pool. Note that individual subjects contribute to one and only one pool and in this way the pools contain biological replicates. Through biological averaging, the variability of ɛ i is reduced to σ 2 ɛ /r s. The variance of the estimator for θ is then given by 1 n p ( σ 2 ɛ /r s + σ 2 ξ /r a ) where np is the total number of pools and r a the number of arrays probing each pool. The precision to estimate expression levels and the power to identify differentially expressed genes are inversely related to this variance, which is reduced by pooling since for a pooled design r s > 1. The larger biological variability is relative to technical variability, the larger the overall variance reduction and benefit of the pooled 2

4 design. The utility of pooling in practice depends on the extent to which the assumed conditions hold for microarray data. Previous studies provide limited support for biological averaging 6 and variance reduction in pools 11, but key questions remain unresolved: 1. To what extent is variability reduced by pooling? 2. Is biological variability larger than technical variability for most genes? 3. To what extent does biological averaging hold? 4. Are inferences regarding differential expression comparable for pooled and non-pooled designs? We report here on the results of a large Affymetrix experiment with individuals and pools of varying sizes in two biological conditions designed to address these questions (see M1). 3 Results 3.1 Pooling Reduces Overall Variability Gene specific sample variances were calculated across conditions for individual samples, pools of 2 and 3, and technical replicates. Figure 1 shows nearly perfect stochastic ordering with decreasing variability from individuals to technical replicates. In the population of genes, therefore, there is decreasing variability overall from individuals to pools to technical replicates. This trend does not hold, however, for each gene. Variability across technical replicates was larger than variability across individuals for 31% of the genes, indicating negligible biological variability and therefore negligible gains by pooling for these genes. 3

5 3.2 Pooling Results in Biological Averaging for Most Genes To investigate whether mrna samples average out when pooled (biological averaging), mathematical averages (referred to hereinafter as averages) across individuals were compared to the corresponding pools. Some distortion is expected as the original mrna abundance undergoes a series of possibly non-linear transformations during the measurement process. In the absence of pooling, individual samples are transformed separately whereas when pooling is done, the transformation applies to the pool rather than to each individual sample. Any averaging that takes place on the scale of raw mrna abundance is transformed during data generation and processing and as a result measurements from a pool may not correspond to an average of individual samples that comprise the pool. Figure 2 gives an example of distortion caused by pooling. Shown is one gene where the averages on the raw intensity scale are quite similar across the individuals and pools. Following a log transformation, as often recommended for microarray data 13 18, the averages in the individuals are smaller than those in the pools. Such differences are exacerbated when there are outliers (e.g. 3 and 10 in Figure 2). Part of the reason for this is that the log transformation of each individual has a stronger attenuating effect than that same transformation applied to the average of the individuals (see M2). Figure 3a shows that a similar artifact affects approximately 25% of the genes. In spite of distortions, we find that pools are more similar to averages of the contributing samples than they are to other pools (Figure 4). Further, cluster plot WS1 shows that pools of 2 cluster closely with their averages. The same holds for pools of 3. When pools of 2 and 3 are clustered together along with the averages, there is less than perfect, but still good, concordance among the pairs (Figure WS2). These plots suggest that biological averaging occurs for most, but not all genes. Furthermore, for the genes where biological averaging does not occur, similar amounts of distortion are often observed in both control and treatment conditions and, as a result, the identification of DE genes is not grossly affected (Figure 3b). 4

6 3.3 Identifying Differentially Expressed Genes Designs without Biological Replicates: Without biological replication, outliers cannot be found and appropriate variance components cannot be estimated. Figure 5 gives an example of how this can adversely affect DE inferences. Each affected gene appears to be DE when only technical replicates are considered. It is clear when biological replicates are available that the first DE call is due to an outlier and the second DE call is due to an underestimation of the gene specific variance. The last gene shown is not affected by an outlier. Correct inferences regarding DE would be made in this case using only technical replicates. Figure 6a shows that this latter case is most representative of the full data set as similar lists of DE genes are identified using true biological replicates (e.g. 12 on 4 12 on 1 x 4; see M1 for terminology). In fact, the same level of accuracy is only slightly reduced when single arrays with pools of 12 (12 on 1) are considered. This is not the case when pooling is not done. An analysis using individuals on single arrays (1 on 1) reduces accuracy by approximately 50%. Pooled Designs with Biological Replication When the mrna from an individual contributes to one and only one pool, and many pools are constructed, the pools can be used to properly assess biological variation. Pooling extra subjects onto a fixed number of arrays decreases variability across experiments and provides a representative list of genes with accuracy similar to the representative lists obtained without pooling (Figure 6a). For example, the representative lists from an analysis of 3 on 3, 6 on 3, or 9 on 3 have similar accuracy. However, the variability across the 9 on 3 experiments is reduced, resulting in a list for a particular experiment that has properties near those of the representative list. Similar results are found for other comparisons. Reducing the number of arrays in an experiment without increasing the number of subjects decreases accuracy (Figure 6a). False Discovery Rate Control: Designs in Figure 6a are compared based on agreement among the top ranking genes for lists with fixed size. For these comparisons, no consideration 5

7 of the false discovery rate (FDR) associated with each list is made. Results regarding accuracy are very similar when comparisons are instead made among lists with fixed FDR. There is one major difference: when biological replicates are not available and technical replicates are used, estimates of the variance required for FDR specification are incorrect resulting in many DE calls with very low accuracy (Figure 6b). Figure 6b also provides insight into how the number of genes identified as DE using an FDR based criterion varies among the designs. As the number of arrays decreases for a fixed number of subjects, fewer genes are identified as DE. This is expected, as to have comparable results while decreasing the number of arrays, one must increase the number of subjects appropriately to maintain a so-called equivalent design. Equivalent Designs: Given the biological and technical variability in this study, the following designs are equivalent (see M4): 100 on 100 vs 160 on 80; 25 on 25 vs. 42 on 21; 7 on 7 vs. 12 on 6. Figure WS4 shows that there is little difference between the genes identified from the last designs. However, although the designs are equivalent on average, gene specific DE calls are affected by varying biological and technical variability; and, as a result, pooling while reducing the number of arrays will be useful for identifying some genes, at the expense of not identifying others. 4 Discussion Experimental designs utilizing pooled mrna samples are often done out of necessity or in an effort to reduce the effects of biological variation making substantive differences easier to find. Pooled designs are attractive as they have the potential to decrease cost since a large number of individual samples can be evaluated using relatively few arrays. We have here considered fundamental properties of various pooling designs to evaluate their performance and to determine if the basic conditions required for pooling to be useful hold. 6

8 The most basic condition is the assumption of biological averaging. Many investigators conjecture that mrna abundance levels average out when pooled. This may be true; however, due to non-linearities introduced in the generation and processing of microarray data, an average on the scale of raw mrna abundance will not necessarily correspond to the average of the same highly processed mrna measurements. The effects of the log transformation were demonstrated (see M5) ; there may also be implicit transformations that could affect the measurements, but none of these are substantial. Most expression measurements from mrna pools are similar to averages of individuals that comprise the pool. For the majority of genes where there was a large difference, the difference was similar across biological conditions resulting in comparable DE inferences from individuals and pools. Because of this, pooled designs were never found to perform significantly worse than nonpooled designs. For very small designs in which only one or two arrays are available in each biological condition, pooling dramatically improves accuracy. Although valid, the results in such small designs are limited as gene specific variance components cannot be well estimated and outliers cannot be identified. Designs involving technical replicates only are similarly limited; and when used with statistical methods that require estimates of both biological and technical variability (e.g. FDR based methods), they can be very misleading. Without biological replicates (in the individuals or pools), most FDR based methods will incorrectly identify a large number of DE genes due to misestimation of appropriate variance components. When the goal of the experiment is to identify DE genes, investigators should favor biological, not technical, replicates. For larger designs where biological replicates are employed, pooling is not always advantageous. Accuracy was similar across designs in both the rank and FDR based analyses when the number of arrays was fixed and the number of subjects varied. There is a slight decrease in the variability among the lists of DE genes identified as the number of subjects 7

9 pooled increased; but the modest reduction resulting from pooling two to three subjects per array is generally not worth the price paid for loss of individual specific information. A greater advantage can be gained by pooling a larger number of subjects. For a fixed number of subjects, reducing the number of arrays results in decreasing accuracy and fewer genes with small FDR. This is expected, as to maintain equivalent properties, the number of subjects pooled must be increased appropriately. An optimal method for specifying equivalent designs will depend on a number of factors including the method used for finding DE genes, the distributions of biological and technical variability across genes, and the underlying model describing expression values. Early attempts defined equivalence in terms of the ability with which underlying intensity could be estimated for a single gene 10 and on average across genes 19. These models are greatly simplified as most non-linearities are not considered. To the extent that these simplified models hold, we found the following designs to be equivalent for this data set: 100 on 100 vs 160 on 80; 25 on 25 vs. 42 on 21; 7 on 7 vs. 12 on 6.; evidence for equivalence was given in the last case (7 on 7 vs. 12 on 6). The financial benefit of pooling for this particular case is minimal at best. However, the results demonstrate the potential of equivalent design specification. For larger designs, the realized benefit of decreasing the number of arrays can be substantial. Furthermore, in studies of humans, biological variability is much larger relative to technical 20, the condition which yields a greater reduction in the number of arrays required. 5 Methods 1. Data Collection: Thirty female Wystar Furth (WF) rats were obtained from Harlan Sprague-Dawley, Inc. at 5 weeks of age. The animals were group housed and provided Teklad lab diet (8604) and acidified water ad libitum. All rats were maintained with a 8

10 light/dark cycle of 12 hours. After a two-week period of acclimation, all animals were randomly assigned to a control or a retinoic X receptor (RXR) agonist, LG100268, diet and were pair-fed every 24 hours for 14 days. The experimental diets were prepared by adding LG100268, 100 mg per kg diet meal (w/w) to the appropriate amount of ground rat meal diet and mixing for 15 minutes in an industrial size mixer in the chemical fume hood. All diets were double bagged and stored at -20C for one week. During the administration of the experimental diets, body weight was monitored by weighing at two or three day intervals. After consuming the diets for 14 days, all rats were removed from study for tissue collection. Messenger RNA samples were obtained and Affymetrix Rat230A chips were used to measure gene expression for 15,923 genes in control and RXR conditions (15 animals in each condition). The best 12 from each condition were chosen to construct twelve pools of pairs, eight pools of triples, and 2 pools of twelve subjects. To obtain mrna samples, total RNA was extracted from individual mammary glands; 5µg of each RNA sample was reverse transcribed, synthesized to double-stranded cdna, and then transcribed to biotin-labeled crna targets which were then fragmented. For the individuals, each chip was hybridized using 10 µg of the fragmented crna targets; for the pools, equal amounts of fragmented targets were combined from the individuals to give a total of 10 µg. Within pool group, the processing of samples was randomized across treatments and Affymetrix scanners. To estimate technical variability, selected samples were analyzed using additional microarrays. These samples are: A3 and B3, with three technical replicates; AQ and BQ, with four technical replicates. Figure 7 gives a schematic of the experiment. The probe level data was processed using RMA (see M6). A single individual sample hybridized onto a single array is referred to as 1 on 1 ; a pool of n subjects hybridized onto a single array is n on 1 ; M = n p n unique subjects hybridized onto n p arrays (a pool of n onto each array) will be referred to as 9

11 M on n p. N Technical replicates of an M on n p pool are denoted by M on n p N. 2. This is well-known as Jensen s inequality Evaluation of Experimental Designs: Student t-tests were done on the full set of 12 individuals in each condition as a reference. For each design, moderated t-statistics 21 are obtained if more than three arrays are available. With fewer than three arrays, a fold change is calculated. For the rank plots, the statistics are ranked from largest to smallest. The top N genes make up a list of size N. For the FDR plots, the FDR of a list is estimated using the q-value approach 22. Each color represents one potential design and reports the percentage of DE calls in common between that design and the reference for lists of fixed size or fixed FDR. The percentage is referred to as accuracy. Each vertical tick on the FDR plot marks 100 genes identified at the specified level of FDR. For some designs, a number of subsets could be chosen to generate a list. For example, with a 3 on 3 design, ( ) subsets are possible. For each design, we considered 100 subsets at random. The solid line gives an average performer across the 100; the dashed line gives the worst case performer. Similar results are obtained when a Wilcoxon statistic or a statistic measuring the posterior odds of differential expression 14 is considered for evaluation or reference. 4. Specification of Equivalent Designs: Formulas exist specifying the number of subjects and arrays required in a pooled design so that gene specific 10 or average 19 estimation efficiency is maintained comparable to a design where pooling is not done and additional arrays are used. Similar formulas hold when equivalent power is considered. 5. Some distortion also appears on the raw intensity scale. This is not considered here as data is almost always transformed prior to analysis

12 6. RMA: Robust Multi-Array Analysis (RMA) was used to pre-process the raw Affymetrix GeneChip data 24. RMA fits a linear model to the log probe intensities for each probeset. The linear model includes a sample effect (the parameter of interest), a probe effect, and an error term. Figure WS5 shows standard error estimates of the sample effects derived from the residuals, obtained from fitting the linear model, for each array. The standard errors are normalized so that each probeset has the same median standard error across arrays. RMA has been demonstrated to provide improved precision over the default algorithms provided by Affymetrix to generate MAS Signals 25. We observe similar findings for this data set where cluster plots of MAS Signal data show little structure (Figure WS6). Acknowledgements The authors thank Gary Churchill, Michael Newton, and Terry Speed for useful discussions. This work was supported by NIH R03-CA and NIH R01-CA

13 References 1. Jin, W., et al. The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nature Genetics, 29, , (2001). 2. Saban, M.R., et al. Time course of LPS-induced gene expression in a mouse model of genitourinary inflammation. Physiological Genomics, 5, , (2001). 3. Chabas, D., et al. The influence of the proinflammatory cytokine, Osteopontin, on autoimmune demyelinating disease. Science, 294, , (2001). 4. Waring, J.F., et al. Clustering of hepatotoxins based on mechanism of toxicity using gene expression profiles. Toxicology and Applied Pharmacology, 175, 28-42, (2001). 5. Enard, W., et al. Intra-and interspecific variation in primate gene expression patterns. Science, 296, , (2002). 6. Agrawal, D., et al. Osteopontin identified as lead marker of colon cancer progression, using pooled sample expression profiling. Journal of the National Cancer Institute, 94, , (2002). 7. Churchill, G.A. and Oliver, B. Sex, flies and microarrays. Nature Genetics, 29, , (2001). 8. Simon, R.M. and Dobbin, K. Experimental Design of DNA Microarray Experiments. BioTechniques, 34, , (2003). 9. Churchill, G.A. Fundamentals of Experimental Design for cdna microarrays. Nature Genetics Supplement, , (2002). 10. Kendziorski, C.M., Zhang, Y., Lan,H. and Attie, A. The efficiency of mrna pooling in microarray experiments. Biostatistics, 4, , (2003). 12

14 11. Han, E.S., et al. Reproducability, sources of variability, pooling, and sample size: important considerations for the design of high-density oligonucleotide array experiments. Journal of Gerontology, 59A(4), , (2004). 12. Peng, X., et al. Statistical implications of pooling RNA samples for microarray experiments. BMC Bioinformatics, 4:26, (2003). 13. Kendziorski, C.M., Newton, M.A. Lan, H., and Gould, M.N. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine, 22, , (2003). 14. Kerr, M.K., Martin, M. and Churchill, G.A. Analysis of variance for gene expression microarray data. Journal of Computational Biology, 7, , (2000). 15. Lonnstedt, I. and Speed, T.P. Replicated microarray data. Statistica Sinica, 12, 31-46, (2002). 16. Quackenbush, J. Computational analysis of microarray data. Nature Reviews Genetics, 2, , (2001). 17. Wolfinger, R.D., et al. Assessing gene significance from cdna microarray expression data via mixed models. Journal of Computational Biology, 8, , (2001). 18. Yang, Y.H., et al.. Normalization for cdna microarray data: A composite method addressing local, global, and multiple slide systematic variation. Nucleic Acids Research, 30(4), e15, (2002). 19. Kendziorski, C. Pooling biological samples in microarray studies in DNA Microarrays and Statistical Genomics Techniques: Design, Analysis, and Interpretation of Experiments. Eds. D.B. Allison, G.Page, T.M. Beasley, J.W. Edwards. Marcel Dekker, New York, New York, (2004). 13

15 20. Cheung, V., et al. Natural variation in human gene expression assessed in lymphoblastoid cells. Nature Genetics, 33, , (2003). 21. Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1), Article 3, Storey, J.D., and Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100, , (2003). 23. Jensen, J. Sur les fonctions convexes et les ingalits entre les valeurs moyennes. Acta Mathematica, 30, , (1906). 24. Irizarry, R.A., et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, , (2003). 25. Irizarry, R.A., et al. Summaries of affymetrix genechip probe level data. Nucleic Acids Research, 31(4), e15, (2003). 14

16 P(X<=x) Individuals Pools of 2 Pools of 3 Technical x=log(variance) Figure 1: Gene specific sample variances: Cumulative distribution functions of gene specific sample variances are calculated by combining estimates across biological conditions. Density estimates are shown in the Web Supplement (Figure WS3). 15

17 Expression x m ,11 + 2,3 x 4,7 12,14 5,6 m 8,9 10,11,4 + 2,3,5 x m 12,14,7 6,8,9 r4 r1 r2 mr3 +x Number of samples in pool Figure 2: Distorted gene. Expression values are shown for individuals, pools, and technical replicates. The + (x) indicates the mathematical average of the raw (log) data; the m indicates the median of the values. The numbers refer to arrays (control condition). 16

18 b 30% 48% 81% 87% 92% 97% 100% 0% 25% 75% 90% 95% 99% 99.9% 0.0 FC Difference: Pools of 2 and Individuals Difference between pools of 2 and individuals a Standard deviation estimate 27% 42% 50% 60% 88% 0% 0.1% 1% 99% 99.9% Difference in SD between Population Figure 3: Effects of distortion within and between conditions. Panel (a) shows the mean difference between the pools of 2 and the corresponding averages across individuals (control condition) as a function of standard deviation (SD) estimated within the control condition. The percentiles of SD are shown (bottom) along with the percentage of genes (top) having values in the pools of 2 that are larger than the corresponding average across individuals. For the 25% of the genes with largest SD, over 80 % have values larger in the pools of 2. The treatment condition and pools of 3 give similar results. Panel (b) shows the difference between the fold change (FC) values (control/treatment) calculated from the pools of 2 and the individuals. Distortion affects both control and treatment and largely cancels out when FCs are considered resulting in similar FC values in the individuals and pools. 17

19 Figure 4: M vs. A plots of pools and averages: The left and middle columns compare averages to the corresponding pools. The right panel shows averages plotted against a noncorresponding pool. 18

20 Expression m x +mx + x m +mx x x x x x x x x Distorted High Var Large t Affected Unaffected Figure 5: DE inferences without biological replication. Expression values from three genes are shown. By considering technical replicates only (first and third columns), the first two genes appear to be DE (FC > 1.5); when biological replicates are considered (second and fourth columns), it is obvious that the large FC is due to three outliers (first gene) and underestimation of the biological variance (second gene). DE calls for the last gene are valid when considering only technical replicates. +, (x), and m are defined in Figure 2. 19

21 a b Accuracy on1 3on1 12on1 3on3 9on3 12on4 12on1x4 6on6 12on6 Accuracy on12 12on6 12on4 12on1 x Size of list 1e 05 1e FDR Figure 6: Design accuracy. Panel (a) compares lists of fixed size; panel (b) compares lists with fixed FDR (see M3). 20

22 Figure 7: Schematic of the experiment. Each box represents one array; X s indicate arrays that were not used in construction of the pools. Here, A=control and B=treatment. 21

23 Web Supplement Distance a10a11 m.ave.a10a11 m.ave.a4a7 a4a7 a2a3 m.ave.a2a3 m.ave.a12a14 a12a14 m.ave.b13b15 b13b15 m.ave.a5a6 a5a6 m.ave.a8a9 a8a9 m.ave.b10b11 b10b11 m.ave.b5b6 b5b6 m.ave.b12b14 b12b14 b2b3 m.ave.b2b3 m.ave.b8b9 b8b9 Figure WS1: Clustering of pools of 2 and averages across individuals. Common colors are given to pools and their corresponding averages. 22

24 Distance colors are given to pools and their corresponding averages. Figure WS2: Clustering of pools of 2, pools of 3, and averages across individuals. Common a10a11 m.ave.a4a7 a4a7 a10a11a4 m.ave.a10a11 m.ave.a10a11a4 m.ave.a8a9 a8a9 a6a8a9 a5a6 m.ave.a5a6 m.ave.a6a8a9 m.ave.a2a3 m.ave.a2a3a5 a2a3 a2a3a5 a12a14a7 a12a14 m.ave.a12a14 m.ave.a12a14a7 m.ave.b2b3 m.ave.b2b3b5 b8b9 m.ave.b8b9 m.ave.b6b8b9 b12b14b15 m.ave.b13b15 b13b15 b6b8b9 b2b3 b2b3b5 m.ave.b5b6 b5b6 b12b14 m.ave.b12b14 m.ave.b12b14b15 b10b11b13 b10b11 m.ave.b10b11 m.ave.b10b11b13

25 Frequency Individuals Pools of 2 Pools of 3 Technical log(variance) Figure WS3: Density estimates of values shown in Figure 1. 24

26 Agreement on7 7on7 7on7 7on7 7on7 7on7 7on7 12on FDR Figure WS4: Comparison of 12 on 6 vs. 7 on 7 designs (see M3 and M4). 25

27 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a14 a2a3 a4a7 a5a6 a8a9 a10a11 a12a14 a2a3a5 a6a8a9 a10a11a4 a12a14a7 aq1 aq2 aq3 aq4 b2 b3 b5 b6 b8 b9 b10 b11 b12 b13 b14 b15 b2b3 b5b6 b8b9 b10b11 b12b14 b13b15 b2b3b5 b6b8b9 b10b11b13 b12b14b15 bq1 bq2 bq3 bq4 Normalized Standard Errors Figure WS5: Standard error estimates of the sample effects derived from the linear model described in M6. The standard errors are normalized so that each probeset has the same median standard error across arrays. 26

28 Distance b8b9 a2a3 b2b3 b12b14 b10b11 b5b6 b13b15 a8a9 a5a6 m.ave.a12a14 m.ave.a2a3 m.ave.a5a6 m.ave.a8a9 m.ave.b13b15 m.ave.b8b9 m.ave.b2b3 m.ave.b10b11 m.ave.b12b14 m.ave.b5b6 a10a11 a12a14 a4a7 m.ave.a10a11 m.ave.a4a7 Figure WS6: Clustering of pools of 2, pools of 3, and averages across individuals for MAS Signal values (see M6). Common colors are given to pools and their corresponding averages. 27