Detection and Restoration of Hybridization Problems in Affymetrix GeneChip Data by Parametric Scanning

Size: px

Start display at page:

Download "Detection and Restoration of Hybridization Problems in Affymetrix GeneChip Data by Parametric Scanning"

Phoebe Higgins
6 years ago
Views:

1 100 Genome Informatics 17(2): (2006) Detection and Restoration of Hybridization Problems in Affymetrix GeneChip Data by Parametric Scanning Tomokazu Konishi Faculty of Bioresource Sciences, Akita Prefectural University, Shimo-Shinjyo, Akita , Japan Abstract Gene expression microarray data often include problems caused by uneven hybridization and dust contamination. Such problems should be removed prior to analysis to prevent degradation of analytical accuracy and false positive results. This paper presents a parameter-scanning algorithm to detect such defects on the basis of the character of data distributions. The cell data is thoroughly scanned using a window algorithm, and windows with an index value greater than a threshold are recognized as defects and removed from the array data. The index is found from the differences between the target and an ideal standard of hybridization obtained as a trimmed mean among experiments, representing the statistical center of differences in each section. The threshold is derived as a screening level designated by the operator, but has only limited effect on the effectiveness of data cancellation. The validity of the algorithm and the effects of data cancellation are tested using GeneChip data obtained from a series of experiments. The algorithm is demonstrated to greatly improve the reproducibility of measurements, and removes only a small number of faultless data. Keywords: microarray data analysis, removal of errors, type I error, false positive, flag, data filter 1 Introduction Hybridization is the basis of gene expression microarray analysis and, while widely used, is not free from technical problems. For example, some hybridizations form a doughnut-like geometric pattern around the center of chip images [11]. Such patterns often result in reduced signals from certain areas of the chip, appearing similar to surface scratching that may be attributed to the entrainment of dust. Although analytical programs that identify such problems have been proposed, the methods are destructive, resulting in the total cancellation of the array chip data when large defects are present [7, 14]. The dchip package [10] implements several automated algorithms for recognizing and removing outliers during model-based data normalization [6]. The algorithms find patterns in the responses among perfect match (PM) and mismatch (MM) probes for each gene, and cells and probe sets that disagree with the resultant patterns are identified as outliers. However, this approach is based on a series of mathematical models that are derived from a very simplified view of both biological fundamentals and the composition of the data. Furthermore, the appropriateness of the models and the calculation methods are difficult to check rigorously as there is no objective indicator for how well the models, which inevitably contain parameters for handling noise, describe the experimental system. One of the reasons why the recognition of hybridization flaws remains ad hoc is that such problems, even if occupying a large proportion of the chip area, are believed to be harmless to the signal or the scaled probe value, which reflect the transcript level. Furthermore, in a GeneChip, a transcript is measured by approximately ten pairs of adjacent PM and MM cells, with pairs dispersed across the chip [12]. Thus, a failure will simultaneously ruin both the PM and MM probe of the relevant pair, but will not ruin more than one probe pair for a gene. The signal is found by several calculation

2 Detection and Restoration of Hybridization Problems 101 algorithms based on different philosophies, and most pay attention to outliers caused by such probe failures. For example, Affymetrix MAS5 [12] finds the signal as a weighted trimmed mean among probe pairs, while RMA [3] finds the signal by a median polish of PM values. It is desirable, however, that problems be recognized and removed from data prior to analysis in order to prevent loss of accuracy in signal data. Trimmed means and medians are robust only if the outliers occur in both directions (i.e. positive and negative) at the same frequency. This is rarely the case in practice, as problems often produce outliers that reflect the cause. For example, bright spots will appear if the problem is caused by fluorescent material, while dark spots will appear if the chip surface has been damaged. These types of defects will affect the results by breaking the robustness of calculations. Such defects also have a direct effect on analyses when the target is not the gene signal but the cell data, as in the case of analyzing processing variants of mrna. Microarray preparation problems thus present a barrier to progress in advanced analyses of GeneChip data. This article introduces a method that finds out the troubles as local tendency of cell data in comparisons of each array to an ideal standard of hybridization. Cells at the identified locations of the troubles are cancelled before data normalization. The cancellations will not affect the original distribution of the array data, since the cancellations are independent to signal intensities. Consequently, remained data will be able to be used for analyses. The following section explains an algorithm that finds and removes the troubles. The troubles are distinguished from biological effects by means of data distribution. The algorithm bases on several verifiable assumptions of which appropriateness is tested with GeneChip data in the Results section. 2 Methods 2.1 Algorithm The proposed parametric scanning algorithm for identifying microarray problems is as follows. A standard, ideal array is selected, and indices representing the size of distinct regions in each chip are determined. Regions with indices larger than a threshold value in reference to the standard are recognized as problem areas. The standard is found as a set of trimmed means among hybridizations. The experiments are simply normalized by dividing the respective median values (including both PM and MM cells) and taking logarithms. The trimmed means of data for each cell in the array are calculated, the resulting set of means is adopted as the ideal standard of hybridization. If the means are calculated using a sufficiently large number of array data, the values can be considered stable and to be suitable for a standard. No particular distributions are expected in the ideal standard. Differences between simply normalized array data and the standard are then found for each cell. These differences may represent both biological responses and experimental noise. The distribution of the differences is expected to be approximately normal, since the logarithms of biological changes appropriately measured and normalized obey a normal distribution [4, 5]. The differences are therefore z-normalized using robust estimators of the distribution parameters, and the distributions are checked on quantile-quantile (QQ) plots. The indices are found by using the medians of the z-normalized differences among neighboring cells on an array. The matrix of the differences is rearranged to reflect the physical order of the chip, and data are collected via a moving window that simulates scanning through a pseudo image of the chip to find the medians. The window median is robust to biological responses, since neighboring cells on a chip do not have biological relationships. In contrast, experimental problems that hide or add signals at the window will affect the window median. The window medians will obey a normal distribution in a strict sense, according to the effect described by the central limiting theorem. Although this model does not expect particular distributions for problems, affected windows will produce outliers in the normal distribution of the matrix medians. The indices are found by normalizing the matrix

3 102 Konishi medians. There is a difficulty in the normalization; width of the distribution of matrix medians is not robust to problems. Indeed, the width may increase with the number of problems. If the distribution is simply z-normalized, the number of recognized problems will be reduced. However, this effect can be readily avoided by finding the width from that of the distribution of the differences among cells. In principle, a width of 0.25 was predicted in the present study for the medians of a window of 25 cells (see simulation section of the data supplement [16]). Here, the width of the distribution of cell differences is robust with respect to problems, since large problems will produce outliers that will not affect the distribution at the central quantiles. In practice, the distributions for cells are not perfectly normal, having long tails possibly due to systematic additive noise in the data. However, the proper width can be estimated robustly from the proper quantiles. Consequently, the effect of the problems can be excluded by estimating the width of the distribution of indices according to the distribution for cells. Systematic noise as well as hybridization problems may change the compensation 0.25 to somewhat larger values. In this article, a constant of 0.31 was used, obtained as the mode in actual measurements and being smaller than many other values that may have been affected by many problems (Figure 1). All indices were adjusted by dividing by this constant. The threshold is derived by a test level decided by analysis prior to the operation, similar to screening levels in other statistical tests. The parametric nature of data handling makes it possible to estimate how many indices will be larger (and smaller) among half a million results. The program will ask the operator how many windows should be expected. If an array is problem-free, the expected number of windows will be recognized by the random neighboring of biological responses on the chip. In practice, the affected indices will not obey the normal distribution and will more likely take values that exceed the threshold. 2.2 Program A program for the parametric scanning method is available in the data supplement [16] in the form of a function for R [8]. The function requires the library "affy" [1], which is available from BioC [14]. An outsourcing service is available as a part of data normalization [17]. Figure 1: A Histogram of standard deviations for medians of moving windows. The mode is 0.31, larger than the expected value of Figure 2: Coincidences between two sets of ideal standards for leaves analyzed by two different laboratories.

4 Detection and Restoration of Hybridization Problems Data Source and Data Processing Arabidopsis GeneChip data were obtained from TAIR [13]. Leaf data from two research groups were used in the comparison of the ideal standard of hybridizations: 15 arrays for the rosette leaf used in drawing expression maps [9], and 18 arrays of day-old control plants in infection experiments by Dr. F. Ausubel's group [13]. Human data [2] were obtained from the public domain resource at RCAST, University of Tokyo [15]. PM data for the arrays were normalized according to the threeparameter method [4]. 3 Results 3.1 Verification of Assumptions Stability of Hybridization Standard The method compares each datum with the ideal standard of hybridization, which should represent a stable pattern of the sample tissue. If the pattern is truly stable, the pattern will coincide with that of other standards determined using different sets of data on identical tissue. To confirm this coincidence, the standards obtained using data from two research groups were compared. Both groups determined the transcriptome of leaves, one as part of an atlas of plants, and the other as a control for infection experiments. Standards were obtained as trimmed means of the median-normalized log data. The results were compared on a scatter plot with 1,000 corresponding cell data (Figure 2). The coincidence between laboratories was thus confirmed. Some other examples of inter- and intralaboratory comparisons are presented in the data supplement [16], showing likely correspondences. Such coincidences cannot be obtained by chance; for example, standards found from different tissues have different tendencies, which will appear as wide scatter in the plot (data supplement [16]). Such tendencies may show a tissue-dependence of the standard, and attention should be paid in a practical usage of the program (see Discussion) Normality of Differences between Array Data and the Standard The proposed method assumes that the differences between each datum and the ideal standard of hybridization will be distributed normally in a rough sense. This assumption was confirmed by means of QQ plots for the data distribution. The distributions had long tails, which may reflect the systematic additive noise of measurement. However, all of the distributions were coincident with the theoretic values at -1.5 to 1.5 (Figure 3 and data supplement [16]), indicating that more than 85% of the data obeyed the normal distribution. As problems and noise influence the distribution, hybridizations with large problems had a narrower range of coincidence, as observed in the case shown in Figure 3 (ATGE 14C) Normality of Distribution of Indices The method also assumes that the indices, which are derived from the medians of the moving windows, will be distributed normally when large problems are not present. This assumption was also confirmed by means of QQ plots (Figure 4 and data supplement [16]). The distributions observed were roughly normal, as expected from the central limiting theorem. The standard deviation of 0.31, determined from many hybridizations (Figure 1), afforded good compensation for the width of the distribution and slope of the plot (Figure 4, panels at the left and the center). As expected, the width of the distribution increased with the severity of the problems (Figure 4, ATGE 14C, right).

5 104 Konishi Figure 3: Distribution of differences between hybridizations and standards. Straight line at y = x denotes the normal distribution. Data are denser at the center of the plots. Only 2.3%, 0.1%, and 0.003% of data have z-scores of 2, 3 and 4. Figure 4: Distribution of index values. 3.2 Confirmation of Method Improvement of Reproducibility in Repeated Experiments If parametric scanning effectively eliminates problems from data, it will reduce the fluctuations found in duplicate experiments. This effect was checked using sets of repeated measurements on Arabidopsis leaves [13]. Before and after cancellation, PM data were normalized using the SuperNORM algorithm [17], which is based on a three-parameter method [4]. The resultant z-scores were compared on scatter plots (Figure 5), from which it is clear that the proposed method eliminated the diffusions found in the plots (Figure 5, left) and achieves the expected reproducibility (center). As in other statistic tests, some clean and faultless data were also cancelled by parametric scanning. In a sense, this is a cost required to find something by means of statistical tests. However, in this algorithm, the number of cancelled clean data was not large. The nature of the cancelled data was checked from the reproducibility of experiments (Figure 5, right). The number of data on the plots increased as the quality of hybridization decreased. The cancelled data did not display narrow concentrations to the y = x line, but were instead dispersed (Figure 5, right). Coincidence was observed only when many cell data were cancelled (Figure 5, lower right), and the data concentrated on the y = x line were only a limited part of the cancelled data. Some of the fluctuations found in the examples shown in Figure 5 were critically large. Such examples were not exceptions among the many examinations. Figure 6 compares the numbers of

6 Detection and Restoration of Hybridization Problems 105 cancelled data under different expectations. It is obvious that the extreme examples have not been taken from outliers. Other examples are available in the data supplement [16]. The improvement of reproducibility was further checked from the reductions in the standard deviations for the differences in z-scores between the corresponding PM cells of paired hybridizations. To minimize the effect of additive noise and saturation of measurements, standard deviations were calculated using normalized values (0 to 1). The effect was checked on a scatter plot (Figure 7), which clearly shows that parametric scanning reduces the standard deviation in the differences among obtained z-scores. Figure 5: Reproducibility in repeated experiments. A combination of experiments is shown in each row. Left: original data. Center: remaining data. Right: cancelled data. PM data (n = 10,000) randomly selected from the indicated pairs of arrays are shown. The expectation value for the cancellation was 2 windows.

7 106 Konishi Figure 6: Numbers of cancelled cells at expec- Figure 7: Standard deviations of differences tations of 2 and 20 windows (50 and 500 cells, among cell data in reproducibility measurerespectively). Data sources: rectangles [9], cir- ments. cles [13] and triangles [2] Comparison with Other Algorithms The method was evaluated against the same sets of arrays treated using other automated methods in the dchip package [10] rather than new experimental data. All the spikes and outliers recognized by dchip were cancelled using the PM-only model, and the data were normalized in an identical manner. As shown in Figure 8, dchip gave lower reproducibility (Figure 8, left), showing weaker detection power. This does not necessary means that dchip preserves faultless data; it cancelled the complete set of cells for certain genes ( % of the total), while no gene was totally cancelled by the parametric scan (see data supplement [16]). In such genes, no information will be retained for analysis Sensitivity of Threshold Parameter The number of data actually cancelled in each hybridization was not clearly dependent on the threshold parameter, which is a test level decided by the operator. The number of cancelled data was much larger than that of the expectations estimated from the threshold parameter (Figure 6), reaching as high as a quarter of the total number of cells (tens of thousands), even when the expectation was 50 cells of 2 windows. However, the number of cancelled cells did not increase by ten times when the expectation was increased from 2 to 20. The relationship between the expectation and the actual number of cancellations became poorer as the number of canceled data increased. Processing of data obtained from three different laboratories suggested a stable relationship between cancelled windows at the two expectations (Figure 7). It should be noted that the expected numbers, which appears at (1.7, 2.7) in the plot, briefly satisfies the extrapolated relationship (Figure 7). The number of cancelled data may depend on the quality of hybridization, as the number of cancellations was observed to be higher when major problems were found (Figure 5). The cancelled windows often formed clusters in the chip, suggesting a single cause within the cluster (Figure 9). Such clusters were found regardless of the value of the expectation parameter. The frequencies and area of cancellation differed among data from the different laboratories (Figure 6). The data measured in one particular laboratory (triangles in the figure) were clearly larger than from the other laboratories.

8 Detection and Restoration of Hybridization Problems 107 Many of the clusters may represent polishing of the chip surface or uneven hybridization. It is likely that the differences in the frequencies of problems are due to the differences in protocols and skills in wet experiments, which will differ according to the laboratory and the time of preparation. These problems were highlighted by high index values, producing many cancelled windows in the case of severely defected cells even when the expectation was rather small. The results above are considered evidences showing the insensitivity of parametric scanning to the value of the expectation parameter, that is, the proposed method appears to have good fidelity with respect to problem detection. Such insensitivity implies objectivity in the algorithm, since the threshold is the only parameter subject to operator selection. 4 Discussion On the basis of the observations above, the proposed method is recommended for practical use on all GeneChip expression data prior to normalization. The assumptions in the approach were validated through analysis of data distributions, and the only arbitrary parameter was shown to have limited effect on the results. Furthermore, through tests in many additional experiments (not shown), the parameter scanning method has been found to be very effective in eliminating hybridization problems. The appropriateness of the method can be checked in every analysis, with the data required for the checking process supplied by the software (data supplement [16]). The numbers of cancelled data are always larger than the expectation, suggesting that most hybridizations have problems of some sort. The problems detected had patterns indicative of surface polishing, uneven hybridization, Figure 8: Reproducibility in data treated by the dchip package. Results using the PM-only models are shown. The corresponding original data are presented in Figure 5 (left). Left: remaining data. Right: cancelled data. PM data (n = 10,000) randomly selected from the indicated pairs of arrays are shown. Results using PM-MM models are presented in the data supplement [16]. and errors in the fabricated cell structure. Symmetric patterns of clusters surrounding the center of the chip (Figure 9, lower right) can be identified as polishing artifacts [11]. In such a case, the signals in the affected area are always distinctively lower and thus insensitive to the expectation value. Cases with advanced degree of surface polishing will form the common doughnut-like cluster pattern. In contrast, clusters with indefinite shape are more likely indicative of uneven hybridization. Within the cluster, data has a tendency to increase or decrease, producing diffusion in the scatter plot with experimental reproducibility (Figure 5). Such unevenness can be derived from several sources, and some of the distinctive regions are insensitive to the expectation value while some are not (Figure 9, ATGE_14_C). The differences in sensitivity correspond to the differences in the magnitude of the defect. Defects detected as smaller clusters or isolated windows may have been formed by dust. Again, some of these features are distinct while others are not. Errors in the chip structure can be identified as repeated clusters in the same parts of multiple chips, forming regular shapes often surrounded by straight lines. Many such defects are not problems but control cells designed and placed on the

9 108 Konishi Figure 9: Positions of cancelled windows in a chip. Four typical examples at the indicated expectations are shown. Upper left: hybridization with relatively small numbers of cancellations. Upper right: uneven hybridization. Lower left: regular shapes with straight boundaries. Lower right: clusters at symmetric positions. chip, although some may be caused by problems, appearing in all chips with similar batch numbers (i.e. same manufacturing lot). Such problems might be caused by product errors that have not been detected in quality controls and can result in serious problems. In the case shown in Figure 5, the huge upward diffusion is attributed to this sort of failure (Figure 9, lower left). The proposed method will reduce false positives in microarray data analyses. Such errors are not unique to microarray analyses, but the multiplicity of tests in conducted using microarrays increases the seriousness of errors. Multiplicity is realized through the comprehensiveness of the microarray and other post-genomic analyses, which generate distinctively different targets of analyses compared to conventional methods measures only a limited number of gene products. In the hyper-multiple comparisons, a large number of false positives will hinder analysis, producing both intra- and interlaboratory contradictions in the observations. For example, permitting type-i error at a probability of 1%, half a million double-sided tests will produce 10,000 errors. Ignoring hybridizations problems will greatly increase this expectation (Figure 5). Additionally, such problems will affect data normalization and the summarized data for genes. Consequently, hybridization problems should be detected and eliminated before normalization. The proposed method will rescue clean data from a failure-free region of hybridization, and the data remaining after cancellation can be normalized and used for further analysis. The resultant data set showed fair coincidence with the corresponding pairs in reproducibility experiments (Figure 5, center). The total cost of experiments will be reduced in comparison to an ad hoc approach to cancellation of genes in arrays and/or entire arrays. The R program in the data supplement [16] will be affected by the tissue effect [10] in discovery of

10 Detection and Restoration of Hybridization Problems 109 the ideal standard of hybridization. That is, the standards will differ according to the differentiation of cells in the sample. Such an effect will occur when treating small numbers of arrays together with large number of arrays on a different tissue. Additionally, treating data using less than four arrays is not encouraged, since the standard cannot be considered stable. The stability of the standard can be checked using the approach shown in Figure 2, and the tissue effect can be noticed by a marked increase in cancellations without producing the clusters of cancelled windows found in Figure 9. Such problems can be avoided by finding the standard separately from the recognition process. Practically, two alternative ways can be employed to discover the ideal standard: using randomly selected samples among various tissues of many arrays, and by finding tissue-specific standards and using these for the corresponding arrays. References [1] Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A., affy analysis of Affymetrix GeneChip data at the probe level, Bioinformatics, 20: , [2] Ge, X., Yamamoto, S., Tsutsumi, S., Midorikawa, Y., Ihara S., Wang S., and Aburatani H., Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues, Genomics, 86: , [3] Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P., Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Res., 31:e15, [4] Konishi, T., Three-parameter lognormal distribution ubiquitously found in cdna microarray data and its application to parametric data treatment, BMC Bioinformatics, 5:5, [5] Konishi, T., A thermodynamic model of transcriptome formation, Nucleic Acids Res., 33: , [6] Li, C. and Wong, W., Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proc. Natl. Acad. Sci. USA, 98:31-36, [7] Psarros, M., Heber, S., Sick, M., Thoppae, G., Harshman, K., and Sick, B., RACE: Remote Analysis Computation for gene Expression data, Nucleic Acids Res., 33:W638-W643, [8] R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, [9] Schmid, M., Davison, T. S., Henz, S. R., Pape, U. J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D., and Lohmann, J., A gene expression map of Arabidopsis development, Nat. Genet., 37: , [10] [11] [12] [13] [14] [15] [16] [17]

Introduction to gene expression microarray data analysis

Introduction to gene expression microarray data analysis Outline Brief introduction: Technology and data. Statistical challenges in data analysis. Preprocessing data normalization and transformation. Useful