Statistical power in mass-spectrometry proteomic studies

Size: px
Start display at page:

Download "Statistical power in mass-spectrometry proteomic studies"

Transcription

1 Statistical power in mass-spectrometry proteomic studies Thomas Jouve 1,2,3, Delphine Maucort-Boulch 1,2,3, Patrick Ducoroy 4, Pascal Roy 1,2,3 Abstract Background Time-Of-Flight mass-spectrometry (MS) (MALDI or SELDI) is a promising tool for the identification of new biomarkers, or differentially expressed proteins, in various clinical situations. The ability to recognize a biomarker as such, or statistical power, can not be accessed through classical experiments due to the lack of true protein content information. Methods We provide a simulation study to investigate the statistical power of MS experiments, under FDR control. Virtual mass spectra are created from virtual individuals belonging to one of two groups. We examine the effect of i) the sample size n, ii) the group repartition p s, iii) the differential expression level, three critical parameters in any clinical setting. The power study is led before and after inclusion of instrumental variability. Also, we evaluate an alternative power measure that can prove useful in MS experiments, called relaxed power. Results: We show that small sample sizes of about n = 100 spectra offer a power equivalent to the power seen in group-label permuted data, about 5-10%. Increasing n, p s or the differential effect allows to reach a much better power. The power loss incurred through FDR control is high when instrumental variability is high (as in MS data), while this power loss is negligible on data free of instrumental variability. Conclusion The high instrumental variability encountered in MS, together with FDR control, builds a detrimental synergy leading to a low statistical power for usual MS studies sample sizes. This detrimental synergy is a proper issue in MS studies and should be compensated for by increasing sample sizes in MS-based biomarker discovery experiments, but also by lowering MS instrumental variability. Background As a high-throughput method, mass-spectrometry (MS) allows to explore a wide feature space, with simultaneous measurements of hundreds or thousands of peptides or proteins in a sample. It is then used to screen for new biomarkers of human diseases in easily sampled biological fluids (blood, urine, etc.). Numerous studies aimed at detecting new cancer biomarkers [1] and some potential biomarkers have been put into light though they still require proofs of disease-specificity [2]. The ability of a technique to detect a biomarker is closely linked to the concept of statistical power. Sample size calculation is the most common aspect of that search for power. Some power studies already performed in the field of transcriptomics [3, 4] have shown the need for improved experimental designs in terms of sample sizes, number of replicates... The added value of new transcriptomic biomarkers versus classical laboratory biomarkers started to be investigated as well [5]. In proteomics, however, especially in mass-spectrometry, power considerations are still poorly investigated, although a recent work [6] provided a method for sample size calculation specifically applied to MS experiments. corresponding author, thomas.jouve@chu-lyon.fr 1 Hospices Civils de Lyon, Service de Biostatistique, Lyon, France 2 Université Claude Bernard Lyon 1, Université de Lyon, Villeurbanne, France 3 Laboratoire Biostatistique Santé, UMR CNRS 5558, Pierre-Bénite, France 4 Clinical and Innovation Proteomic Platform, CHU Dijon, France 1

2 2 Indeed, the concept of statistical power applied to mass-spectrometry data introduces two important questions: i) instrumental variability must be considered specifically and ii) the number of tests to perform is large. Classical power studies do not provide an answer for these two questions, for they are mainly targeted at experiments considering variability as a whole, without dissecting it, with one single biomarker. In a simple situation with one single biomarker, experimental MS data integrates different sources of variability. Biological variability reflects differences in protein patterns or expression levels between individuals or groups of individuals. This variability defines biomarkers as molecules showing different expression patterns between groups of individuals (e.g. diseased and undiseased). Instrumental variability reflects variability in the physical process and differences in signal pre-processing; i.e., transformation of raw spectra into series of values. Both variabilities may obscure the discovery of true biomarkers. In mass spectrometry, awareness of variability led to reproducibility studies [7, 8]; these showed relatively low levels of instrumental variability: i.e., a good reproducibility, but this does not guarantee statistical power. Besides variability, the large amount of raw data in proteomics requires statistical strategies adapted to multiple testing. Transcriptomic studies already offered some insights into statistical power in presence of biological variability and multiple testing [3]. The results of these studies cannot be applied to proteomics because they did not take into account instrumental variability. A study of power in the field of proteomics is therefore awaited. Given the scarcity of data sets providing a controled measure of their content, one step towards investigating power may rely on simulations that involve the three determinants of statistical power in proteomics: biological variability, instrumental variability, and multiple testing. In the present article, we used simulations to investigate the statistical power of studies using MALDI- TOF or SELDI-TOF MS to identify new biomarkers. We generated spectra mimicking MS experiment results and considered a single arbitrary pre-processing scheme. We studied the effects of different parameters on statistical power controlling for the number of false discoveries. Methods The methods used in this article take into account the different issues of MS data analysis: biological and instrumental variabilities, described in the Simulation part, as well as multiple-testing specificities, described in the Analysis part. Simulation To reflect the two sources of variability, two simulation steps were considered: generation of virtual samples and transformation into virtual mass spectra. Virtual samples build the concentration data, while virtual mass spectra build the intensity data. A sample from a given subject (no duplicates) was a list of m concentration measurements of different proteins or peptides. Two groups of subjects were considered: the risk group or the reference group. Each experiment for marker identification involved a group of n samples; thus, n m concentration measurements. Power measurements are based on collection of experiments. Table 1 (top) summarizes the parameters that should be specified to fully describe the virtual biological samples; i.e., the parameters of biological variability. These were extracted from a set of 192 real spectra from plasmas of 32 subjects with Hodgkin lymphoma (each with 6 replicates) obtained using a Bruker TOF-SIM MALDI spectrometer in the protein/peptide mass range 1 to 10 kda (Source: the CLinical

3 3 Distribution Parameter Model Spectral component Biological variability DEP Gaussian m/z Arbitrary defined Mean N(µ M, σ M ) + δ i Standard deviation N(µ S, σ S ) φ(t) log(m/z) N(µ m/z, σ m/z ) Mean N(µ M, σ M ) Standard deviation N(µ S, σ S ) NDEP Gaussian φ(t) Instrumental variability Baseline Gamma Shape N(µ shape, σ shape ) Scale N(µ scale, σ scale ) Maximum N(µ max, σ max ) Maximal position N(µ argmax, σ argmax ) b(t) Random noise Gaussian Standard deviation N(0, σ rn ) r(t) Total concentration Log-normal Standard deviation logn(0, θ) a Table 1: Elements of a virtual experiment: models for biological variability and instrumental variability. DEP=Differentially Expressed Proteins, NDEP=Non-DEP, δ i describes the group of subject i. Note that the link between proteins and their spectral component φ is made by the virtual mass-spectrometer. and Innovation Proteomic Platform (Clipp)). The strategy used to generate protein concentrations is based on three layers. The two fundamental parameters are the mean and the standard deviation of all possible protein or peptide concentrations, building the first layer. Because of the choice of Gaussian distributions for these concentrations, 6 parameters were eventually required. Parameters µ M and µ S represent respectively the mean values for the means and the standard deviations of protein or peptide concentrations. Similarly, σ M and σ S represent respectively the standard-deviation parameters for the mean and standard deviation of protein or peptide concentrations. Finally, p s represents the splitting of subjects into a risk and a reference group and j the difference between the mean concentrations of protein or peptide j between the risk and the reference group; thus, j = 0 in case of Non-Differentially-Expressed Proteins (NDEP) and j 0 in case of Differentially Expressed Proteins (DEP). In case of m measurements, there are m 0 NDEP and m 1 DEP and m 0 + m 1 = m. For each collection of experiments, we considered m 0 = 60, m 1 = 8. For each protein or peptide j of the m proteins and peptides, M j is the mean concentration and S j the intra-group standard deviation (Equation (0.1)), where N denotes the Gaussian distribution. This equation represents protein or peptide concentration distributions in the whole population, used throughout a whole experiment. { Mj (1 p s ) N(µ M, σ M ) + p s N(µ M + j, σ M ) S j N(µ S, σ S ) (0.1) Parameters M j and S j build the second layer of the simulation strategy. Individual concentrations

4 4 are drawn from Gaussian distributions using parameters M j and S j. They build the last layer of the simulation strategy. Each protein or peptide was given a mass label m/z as its identifier, under the assumption of a single ionization, i.e. z = 1. For all collections of experiments, mass-labels for DEP were arbitrarily defined within the 1-10 kda range. Mass labels for NDEP are defined once for each experiment. They derive from a Gaussian distribution of the logarithm of mass grounded on two fundamental parameters, as in equation (0.2). log( m z ) N(µ m/z, σ m/z ) (0.2) A spectrum was derived from a given biological sample using the virtual mass-spectrometer described by Morris et al. [10] run by the R software [9]. With an input of a list of protein or peptide concentrations and their associated masses, it outputs a spectrum φ(t) free of the two classical MS instrument-noises. Let t be the time-of-flight of a protein or peptide and φ(t) the signal of interest generated by the virtual mass-spectrometer to which a baseline noise b(t) and a random noise r(t) are added to create a realistic spectrum. With a multiplicative coefficient a accounting for the variability in sample deposit and ionization, a spectrum can be written as in equation (0.3). I(t) = a (φ(t) + b(t) + r(t)) (0.3) Table 1 (bottom) summarizes the parameters for the instrumental variability. Once for each spectrum, coefficient a is drawn from a Gaussian distribution with standard deviation θ. Baseline b(t) is generated as a rescaled gamma distribution whose parameters µ shape, σ shape, µ scale, σ scale are drawn from Gaussian distributions. Parameters µ max, σ max, µ argmax, σ argmax are used to rescale the gamma density according to a classical baseline, both on the m/z axis (argmax parameters) and on the intensity axis (max parameters). The baseline is obtained once for each spectrum by sampling from distributions using the eight above-cited parameters. Noise r(t) is generated through a Gaussian distribution with standard deviation parameter σ rn. Simulation parameters related to instrumental variability were extracted from a set of real spectra as the above-described parameters of biological variability from a Bruker TOF-SIM MALDI spectrometer. Noise intensity was compared to the mean signal intensity to calibrate r(t). Baseline properties (maximum intensity and position on the m/z axis) were extracted from the Bruker spectra to calibrate b(t). Besides, physical parameters such as the drift tube length, the voltages, and ion focus delay (not described here but required for virtual MS) were set as for a Bruker TOF-SIM MALDI spectrometer as well. Nine different collections of experiments of 400,000 spectra were simulated (400 experiments with 1,000 spectra). Each collection of experiments bears two characteristics: p s and j /S j. Sub-experiments of sizes n = 100 and n = 500 were drawn from experiments of size n = 1000 to eventually study the effect of n, p s, and j /S j. These nine collections of experiment were simulated with the R software on a AMD Athlon X computer with 2GB RAM. In average, each of the nine collection of experiments required 20 hours and 30 GB. Analysis Preprocessing The pre-processing strategy used here is described in Figure 1. Baseline substraction was carried out by smoothing of successive local minima (The PROcess package [10]). An undecimated wavelet transform (UDWT) was used for random noise r(t) reduction in simulated spectra [11] (Rice Wavelet Toolbox (rwt)

5 5 Spectrum 1 Raw spectra No baseline spectra Denoised spectra Intensity vectors (47,28,12) Intensity matrix m peaks Spectrum 2... Spectrum n... Baseline substraction... Noise reduction... Normalization Intensity readings (24,32,6) (18,6,23) n spectra Mean spectrum Peak picking List of peaks m /z Intensity value for spectrum i, peak j Figure 1: Denoising strategy for virtual experiments: from spectra to matrix of peak intensities R-package). The UDWT threshold was 3.6 times the mean absolute deviation (MAD) of the signal. All values lower than this threshold were given value zero (hard thresholding). As recommended by Morris et al. [12], peak localization was performed using the pre-processed mean spectrum; i.e., the mean of all raw spectra. Noise filtering of the mean spectrum used a high filtration threshold. All local maxima found in the pre-processed mean spectrum were initially considered as peaks. However, because a lot of these peaks had very low intensities (close to random noise intensity fluctuations), the initially detected peaks were filtered using a threshold on the signal-to-noise ratio. Noise was defined as the part of the signal filtered out by the rwt algorithm. The local maxima and their neighbouring minima on both sides were kept to define the widths of the peaks. Because the exact position of a given peak in low-resolution spectra (like MALDI-TOF spectra) cannot be precisely defined, a 0.3% tolerance around each m/z value was used to consider a found and a simulated peak as a single entity; i.e., representative of a single protein or peptide. A count of detected peaks was collected from all experiments to obtain a measure of the Peak Detection Ability (PDA). As usually done in SELDI / MALDI pre-treatment, normalization of peak intensities was done by dividing all intensities of a spectrum by the Total Ion Current (TIC) as estimated by the area under the spectrum. After normalization, peak intensities were estimated through the Area Under the Peak (AUP). An idea about the parameters of variability of the simulations was given by a comparison between the standard deviation of intensities within a given spectrum and the maximum intensity of that spectrum in both real (used for calibration) and simulated spectra. This stdev/max ratio allows comparing variability between different spectra with arbitrary units. Another comparison of variability was made between extracted intensities on a local scale. It was performed using the coefficient of variation (CV) defined as the standard deviation of intensities for a given peak among an experiment, divided by the mean intensity for this peak. The same pre-processing was applied to real spectra and simulated spectra, then CVs compared. Statistical analysis Univariate logistic models were used in this study to predict the probability of belonging to the risk group given the measurement of each protein or peptide. The Wald test applied to the regression coefficient of each model was used to test for different protein or peptide expression levels between groups. The q-value was used to control the number of false positive conclusions [13] (the positive False Discovery Rate). In multiple testing, the q-value is the equivalent of the classical p-value. It expresses the collective

6 6 type-1 error as the highest q-value associated with a protein or peptide for which H0 is rejected. This FDR-controled approach was called the FDR approach. In the setting of multiple testing, power estimation should be carefully considered. Here, two measures of power were used: individual power and relaxed power. Individual power can be written 1 β 1 (β 1 being the individual type-2 error). In our setting, considering identical and independant power for all biomarkers, it is easy to show that 1 β 1 = E(S)/m1 (where S is the number of DEP for which H0 is rejected). Here, E(S)/m1 is referred to as the average power and has the same value as the usual individual power; it reflects the average number of true discoveries among all DEP and quantifies the ability of the study to detect a proportion of the biomarkers. In high-throughput studies, it can be useful to estimate the probability of detecting all biomarkers. Lee et al. [3] proposed to use the type-2 error β F (as a family definition of power). Under the hypothesis of independence between tests, 1 β F = (1 β 1 ) m1. Because this seemed to be a too strong constraint for calibrating future studies, power was also considered as pr(s > k); the probability of detecting more than k biomarkers in a single experiment. In comparison, relaxed power is the special case when k = 0; i.e., p(s > 0), the probability of detecting at least one biomarker. It may be written 1-β RP = 1 β m1 1. The precision of power values is provided by the 95% confidence interval for a binomial distribution B(400, P), where P is the estimated power over the 400 experiments of each collection of experiments. For each covariate, the number of rejections of H0 (using a threshold for q-values) was calculated over all experiments from a single collection of experiments. This allowed an easy derivation of statistical power. A count of H0 rejections without FDR control (using a naive 0.05 threshold on p-values) was kept in order to estimate power before multiple-testing error control. This was called the naive approach rendered virtual by not considering the number of tests. The experimental design developed here allows studying the impact of n, p s and j /S j on these different powers. Permutations of group labels (diseased vs. undiseased) for the different spectra were also performed in order to create a generalized H0. In this setting, no protein or peptide is truly differentially expressed. The apparent power obtained with permuted data gives an idea of the power that may be expected by chance. Comparisons of powers estimated with true vs. permuted data help recognizing truly informative experiments. Results Table 2 (top) shows individual and relaxed powers with concentration data (as opposed to intensity data). In this data, we considered biological variability and ignored instrumental variability. Good individual power values ( 80%) were reached with a sample size of With fewer samples, individual power dropped rapidly. The distribution of the subjects between risk and the reference group affected power substantially: there was a 10 to 50% individual power loss with decreasing p s from 0.5 to Finally, the value of j /S j had a dramatic effect. This is best seen for n = 500 and p s = 0.15 where the individual power drops by 76% from j /S j = 0.75 to j /S j = 0.3. Altogether, it should be noticed that power increases are not uniform but change at each n, p s, and j /S j combination. With this concentration data, the relaxed power reached 100% in many instances and the pattern of change was similar to that of the individual power; i.e., important decreases of power with decreasing sample sizes and a non-negligible effect of subject distribution. The impact of j /S j had nearly the same amplitude as that of the sample size. However, by definition, relaxed power is greater than individual power. Experiments with n = 500, p s = 0.5, and differential effect j /S j = 0.3 would most probably lead to a biomarker detection (relaxed power of 99%) whereas individual power remains at a deceptive 68%. Power was also assessed on concentration data in the naive approach (data not shown). The general

7 7 Experimental settings n Individual j /S j p s p s Concentration power p s Relaxed Individual Intensity power Relaxed Table 2: Power evaluated on concentration data (top) and on intensity data (bottom), FDR controlled at 5%, n is the total number of subjects, j /S j is the differential effect; p s is the proportion of cases.

8 8 pattern was a power increase with increasing values of sample size n, group distribution p s, and differential effect j /S j. In the most difficult setting (n = 100, p s = 0.15, and j /S j = 0.3), individual power was 14% but reached 68% with n = 500 wheras a balanced design (p s = 0.5) led to a 29% power. Here again, an increase in sample size had a greater effect on power than an increase in p s, but using adequate p s allowed to double power. A differential effect j /S j at 0.5 achieved a 41% power with n = 100 and p s = 0.15 and even a 67% power with p s = 0.5. In all other cases, power was 80% or higher. The stdev/max ratio ranged from to with real spectra (median 0.062) but from to with simulated spectra (median 0.099). The standard deviation for real spectra, to the scale of the maximum intensity on these spectra, is therefore slightly smaller than for simulated spectra. The coefficient of variation of the peak intensity values ranged from to (median 5.918) with real spectra but only from to (median 0.390) with simulated spectra. Summarizing, the variability for full spectra appears somewhat higher in simulated spectra for raw intensities, but detected peaks in simulated spectra have a smaller variability than real spectra. The PDA reached 97% with 100 samples but reaching 99% required 1000 samples. A comparison of the PDA with the total number of simulated protein or peptides (about 70) shows that all peaks are most often detected. Varying p s did not affect PDA; with n = 500, PDA = 98% with p s = 0.15 or p s = 0.5. This confirms the ability of the mean spectrum to detect almost all peaks [12]. Table 2 (bottom) displays the progression of power in the intensity data according to sample size, group distribution, and differential effect. This data set involves both biological and instrumental variability. Whereas a 100% individual power was possible with the concentration data, a maximum of 78% could be reached using the intensity data with n = 1000 samples, p s =0.5, and j /S j = The general pattern of change is the same as with the concentration data. Individual power decreased with decreasing n, p s and j /S j values down to 5% in the worst scenario (n = 100, p s = 0.15, and j /S j = 0.3). Relaxed power is not greater with n = 100 whereas it doubles with n = This shows that the chance to detect at least one biomarker with n = 100 or less is less than 20%. Using group-label permutations on this intensity data, the apparent individual power reached 5 to 10% in the same scenario, which is a non-informative experiment. Figure 2 compares results obtained with and without control for FDR. A FDR control led to a loss of power that depended on the differential effect and on the nature of the data. With the concentration data, FDR control decreased power only slightly. The introduction of instrumental variability with FDR control led to an important power decrease. In the case of a small differential effect ( j /S j = 0.3), FDR control caused an almost negligible loss of power on the concentration data, whereas power was reduced to a third (50% to 18%) with intensity data. Figure 3 examines the relationship between power and type-1 error. The graph represents the change of power with different FDR control levels. As expected, the more false positive conclusions are accepted, the better the power. The shapes of the curves were very similar between various differential effects. Besides, the graph shows the major effects of sample size and differential effect on power results regardless of the accepted FDR level. Considering a change of parameter n or of the accepted FDR level, Power would be better increased by increasing n than by increasing the accepted FDR. Discussion The present simulation study gives insight into the ability to identify biomarkers in the frame of massspectrometry proteomics. The results focus on the statistical power of MS experiments before and after mass-spectrometry with different data sizes and subject distribution between diseased and non-diseased, with and without FDR control.

9 9 Power loss with respect to differential effect Power σ = σ = σ = Step Naive approach, samples FDR approach, samples Naive approach, spectra FDR approach, spectra Sample size = 1000, prevalence = 0.5 Figure 2: Power comparison for different analytical steps. Naive refers to the lack of consideration of any multiple testing type 1 error control (using a classical 5% cutoff), samples refers to the use of the concentration data (perfect knowledge of concentrations) and spectra refers to the use of the intensity data. 95% confidence interval are shown. We show that statistical power in that setting is very low. Most previous experiments dealt with about 100 subjects at most. In such conditions, the probability of identifying a given biomarker is low (max 10%). Results from group-label permuted data showed similar powers. Experiments with low sample sizes and low differential effects are not informative. Indeed, trying to apply that probability to the detection of all 8 simulated DEP yielded a = 10 8 power; thus, it is virtually impossible to detect all 8 DEP. Sample sizes should be increased but equal groups allow also a less costly gain of power. Balanced designs would achieve the highest possible power, all other parameters being equal. This is in agreement with results from non-omic studies REFERENCE? and other omic studies such as that of Tsai et al. [14] who showed in transcriptomics clear relationships between power measures and sample size estimations. Our results revealed an increasing loss of power along the successive simulation steps. When instrumental variability was set aside, power reached 80% or more in many settings and introducing FDR control did not lead to a high power loss. The ratio of powers obtained with the naive and the FDR approach was close to 1. Under some circumstances, instrumental variability halved power (98% to 50% drop for j /S j = 0.3, n = 1000 and p s = 0.5). Whenever both components of variability are taken into account, the use of FDR control induces a greater loss of power than with the concentration data; power then falls to 18% of its previous value and the ratio of powers obtained with the naive and the FDR approach is then less than 0.4. This reveals a novel detrimental synergy between instrumental variability and FDR control. Jointly, biological variability, instrumental variability, and multiple testing generate a dramatic power loss that should be carefully considered in MS studies. Thus, a general strategy to improve power is to reduce variability through: i) increasing sample sizes; ii) using matched experimental designs whenever possible; iii) performing experiment replications. The

10 10 Individual power Sample size FDR level Figure 3: Evolution of individual power with FDR level (q-value correction), for different sample sizes n (different point shapes) and different differential effect j /S j. p s = 0.5. Solid lines correspond to j /S j =0.3, dashed lines correspond to j /S j =0.5 and dotted lines correspond to j /S j =0.75

11 11 use of mixed models to refine biological variability estimations, allows discriminating between biological and instrumental variability [6] and might help improving power. Finally, there is also room for technology improvements in mass-spectrometry, which will also lead to a reduced instrumental variability. The very definition of statistical power is crucial with MS data. Here, we used two simple measures: individual power and relaxed power. The first is the most demanding. It is also equivalent to E(S)/m 1 in our setting where the probabilities of spotting various biological markers are all equal within a single collection of experiments. Together, they extract some knowledge from a MS dataset. Experiments with low individual power are still able to identify at least one biomarker but surely not all. However, further research is required to compare power measures in multiple testing contexts. By definition, the differential effect, affects the statistical power. Here, we have chosen a rather low differential effect (0.3) because we did not expect newly identified biomarkers to show larger differential effects. We believe this small value should be taken as reference in study designs. The use of this value explains the low powers seen in the present simulation study. Indeed, more optimistic results were obtained in a recent study by Cairns et al. [6] who used higher differential effects (mostly with a j /S j > 1). Therefore, they sought other classes of biomarkers, not in the expected range of differential effects for new biomarkers. Besides, Cairn s study shared two identical sources of variability with ours but did not investigate their respective effects on power. Here, this investigation led to strategies to improve power. The power issue is a major concern in all MS technologies but this article focused on MALDI and SELDI-TOF instruments because most current studies rely on these instruments. Electro Spray Ionization (ESI) combined with Liquid Chromatography MS (LC-MS) is increasingly used in biomedical proteomics and offers an alternative to the former instruments. We believe the general frame our work is applicable to explore the value of LC-MS technology. Simulations imply various hypotheses concerning peptide or protein concentrations, sample sizes and mass spectrometry technology. The choice of a Gaussian variability to simulate concentrations might be simple. However, previous experiments have shown that simple algorithms such as Linear Discriminant Analysis [15] have good spectra clustering performance, making the Gaussian hypothesis a reasonable assumption. Also, spectra with properties similar to those of real mass-spectrometry spectra were simulated. However, the present simulations did not consider the variability of peak positions within an experiment and did not require alignment algorithms in the pre-processing steps. Spectra alignment is an essential step of pre-processing that can be improved at the data acquisition step. The present power results apply to well-aligned data so that poor power should be expected from poorly-aligned spectra. Conclusions The present power results explain, at least partially, our inability to identify the same sets of biomarkers from one experiment to another. Our key result for clinicians and experimenters is that most current experimental designs do not ensure enough power to identify several, or even a few, biomarkers per experiment. Therefore, each experiment finds a different part of a given set of biomarkers. This was already reported in transcriptomics [16, 17]. The frequent identification of inflammation phase proteins [18, 19, 20] is not surprising because these proteins exhibit very large differential effect between diseased and nondiseased groups and, thus, require lower sample sizes. Our key result for statisticians involved in study designs is the negative synergy between instrumental variability and FDR control, explaining why most experiments led so far were not calibrated to identify specific biomarkers with small differential effects around 0.3. Ideally, sample sizes should be increased and instrumental variability decreased to the greatest possible extents.

12 REFERENCES 12 Authors contributions PR and PD designed the study. TJ performed the simulations, analyzed the data, and drafted the manuscript. PR, DMB, and TJ participated in drafting the manuscript. All authors read and approved the final version. Acknowledgements The authors wish to thank Caroline Truntzer for valuable discussions and Jean Iwaz for a thourough editing of the manuscript. References [1] L. C. Whelan, K. A R Power, D. T. McDowell, J. Kennedy, and W. M. Gallagher. Applications of seldi-ms technology in oncology. J Cell Mol Med, 12(5A): , [2] M. A. Karpova, S. A. Moshkovskii, I. Y. Toropygin, and A. I. Archakov. Cancer-specific maldi-tof profiles of blood serum and plasma: Biological meaning and perspectives. J Proteomics, Sep [3] Mei-Ling Ting Lee and G. A. Whitmore. Power and sample size for dna microarray studies. Stat Med, 21(23): , Dec [4] Yudi Pawitan, Stefano Calza, and Alexander Ploner. Estimation of false discovery proportion under general dependence. Bioinformatics, 22(24): , Dec [5] Caroline Truntzer, Delphine Maucort-Boulch, and Pascal Roy. Comparative optimism in models involving both classical clinical and gene expression information. BMC Bioinformatics, 9:434, [6] David A Cairns, Jennifer H Barrett, Lucinda J Billingham, Anthea J Stanley, George Xinarianos, John K Field, Phillip J Johnson, Peter J Selby, and Rosamonde E Banks. Sample size determination in clinical proteomic profiling experiments using mass spectrometry for class comparison. Proteomics, 9(1):74 86, Jan [7] Jakob Albrethsen. Reproducibility in protein profiling by maldi-tof mass spectrometry. Clin Chem, 53(5): , May [8] Catherine Mercier, Caroline Truntzer, Delphine Pecqueur, Jean-Pascal Gimeno, Guillaume Belz, and Pascal Roy. Mixed-model of anova for measurement reproducibility in proteomics. J Proteomics, 72(6): , Aug [9] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN [10] Xiaochun Li. PROcess: Ciphergen SELDI-TOF Processing, R package version [11] Kevin R Coombes, Spiridon Tsavachidis, Jeffrey S Morris, Keith A Baggerly, Mien-Chie Hung, and Henry M Kuerer. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5(16): , Nov 2005.

13 REFERENCES 13 [12] Jeffrey S Morris, Kevin R Coombes, John Koomen, Keith A Baggerly, and Ryuji Kobayashi. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics, 21(9): , May [13] JD Storey. A direct approach to false discovery rates JR Stat. Journal of the Royal Statistical Society. Series B (Methodological), 64:479, [14] Chen-An Tsai, Sue-Jane Wang, Dung-Tsa Chen, and James J Chen. Sample size for gene expression microarray experiments. Bioinformatics, 21(8): , Apr [15] Bart J A Mertens. Proteomic diagnosis competition: design, concepts, participants and first results. J Proteomics, 72(5): , Jul [16] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2): , Jan [17] Liat Ein-Dor, Or Zuk, and Eytan Domany. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci U S A, 103(15): , Apr [18] Eleftherios P Diamandis. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Mol Cell Proteomics, 3(4): , Apr [19] Eleftherios P Diamandis and Da-Elene van der Merwe. Plasma protein profiling by mass spectrometry for cancer diagnosis: opportunities and limitations. Clin Cancer Res, 11(3): , Feb [20] Glen L Hortin. The maldi-tof mass spectrometric view of the plasma proteome and peptidome. Clin Chem, 52(7): , Jul 2006.