BIOINFORMATICS ORIGINAL PAPER

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Transcription

1 BIOINFORMATICS ORIGINAL PAPER Vol 25 no 22 29, pages doi:93/bioinformatics/btp5 Data and text mining Improving peptide identification with single-stage mass spectrum peaks Zengyou He and Weichuan Yu Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China Received on March 8, 29; revised on August 2, 29; accepted on August 3, 29 Advance Access publication August 8, 29 Associate Editor: Thomas Lengauer ABSTRACT Motivation: Database searching is the major peptide identification method in shotgun proteomics It searches tandem mass spectrometry (MS/MS) spectra against a protein database to identify target peptides The success of such a database searching method relies on a scoring algorithm that can evaluate the quality of peptide-spectrum matches (PSMs) accurately However, current scoring algorithms frequently generate inaccurate assignments due to variations and noises in the MS/MS spectra To address this issue, we like to improve peptide identification by using additional information from other data sources Results: Single-stage MS data is complementary to MS/MS data in the sense that it provides broader mass coverage but less sequence information In this article, we show that single-stage MS data can be used to re-rank PSMs The proposed method explores a linear combination of scores between MS and MS/MS data to perform re-ranking Experimental results on real data show that such a re-ranking strategy improves the identification performance significantly Availability: Contact: eezyhe@usthk Supplementary information: Supplementary data are available at Bioinformatics online INTRODUCTION In shotgun proteomics, complex protein mixtures are first digested by enzymes (such as trypsin) to generate peptide mixtures Then the peptide mixtures are separated with liquid chromatography (LC) and the eluting peptides are introduced into a mass spectrometer (MS) In tandem mass spectrometry (MS/MS or MS2) approaches, some individual peptides are chosen for further fragmentation via collision-induced dissociation (CID) The identification of peptides by searching MS/MS spectra against a protein database is one of the key computational problems in current proteomics research A peptide-spectrum match (PSM) scoring algorithm compares an experimental MS/MS spectrum with a theoretical spectrum derived from a peptide sequence If the similarity score is larger than a predefined threshold, we consider that this MS/MS spectrum corresponds to the theoretical peptide To whom correspondence should be addressed To date, MASCOT (Perkins et al, 999), SEQUEST (Eng et al, 994) and X!Tandem (Craig and Beavis, 24) are probably the mostly used PSM scoring algorithms These PSM scoring algorithms use information exclusively from one single MS/MS spectrum to infer the peptide Motivation of re-ranking Identifying the best peptide for each spectrum is only the first step in peptide identification We still need to determine whether the resulting PSM is correct A perfect scoring function should rank the correct PSMs higher than other incorrect PSMs However, existing scoring functions generally fail to achieve such a good separation due to various reasons, including poor quality of spectra and posttranslational modifications (PTMs) of proteins The objective of PSM re-ranking is to adjust the initial scores of PSMs so as to obtain a better separation 2 Existing re-ranking methods The idea of re-ranking PSMs has been discussed in many research papers For instance, machine learning techniques have been widely used to build re-ranking models (Frank, 29; Kall et al, 27; Klammer et al, 28; Lin et al, 28) The success of these algorithms can be attributed to the large volume of MS/MS data available for training Some other approaches use additional information in LC-MS/MS data to facilitate re-ranking For instance, Klammer et al (27) utilized the retention time as an additional feature to improve peptide identification The key issue in this method is to build an accurate regression model using existing MS/MS spectra for retention time prediction Different from standard PSM scoring algorithms, these re-ranking methods utilize information from multiple MS/MS spectra rather than one single MS/MS spectrum The analysis of multiple spectra could identify some common patterns to distinguish correct PSMs from incorrect ones However, it is still very difficult to obtain a discriminative model that is universally applicable to different platforms and experimental conditions Moreover, these methods use MS/MS spectra exclusively, ignoring other sources of rich information [such as single-stage MS (MS) data] 3 MS-based re-ranking Nowadays, high-accuracy mass spectrometers are widely used in MS data generation This creates the possibility of separating peptides with only subtle mass difference using MS data More The Author 29 Published by Oxford University Press All rights reserved For Permissions, please journalspermissions@oxfordjournalsorg 2969

2 ZHe and WYu importantly, MS data have broader mass coverage than MS2 data Combining MS data into the PSM re-ranking process has the potential to improve the identification of peptides Toward the direction of data fusion, several research groups have demonstrated the advantages of combining MS2 and MS3 spectra from the same peptide to improve identification performance (Bandeira et al, 28; Ulintz et al, 28) Here, we like to point out that the generation of MS3 data requires additional efforts, while the generation of MS data is effort free since it naturally precedes MS2 data in the data acquisition process 4 Our re-ranking strategy The combination of MS2 and MS data has been used in Lu et al (28) to facilitate protein identification Their main objective was to identify more proteins from the data They used MS2 data and MS data independently, in which the original PSM scores remain intact One major limitation in Lu et al (28) is that it cannot improve PSMs Here, we describe a fundamentally different approach and show that MS2 and MS data can be combined in an interactive manner to improve the identification of peptides Experimental results in Section 33 also show that improved peptide identification leads to better performance of protein identification The rest of the article is organized as follows: Section 2 describes our re-ranking strategy in detail; Section 3 presents the experimental results; Section 4 concludes the article 2 METHODS The basic idea of our method is very simple: if one PSM is correct, there exists at least one protein that contains the corresponding peptide As a consequence, this protein will produce many other peptides that correspond to peaks in the preceding MS spectrum, although these peptides may not be sequenced by the tandem MS The re-ranking strategy is described in Figure It consists of the following steps: () PSM scoring: scoring PSMs with an existing peptide identification algorithm Here, we use S (2) i to denote the i-th peptide score using MS2 data (2) Peptide protein mapping: mapping each peptide to proteins that it belongs to Let U(i) denote the set of proteins that contain the i-th peptide (3) Protein scoring: using MS peaks to rank each protein For simplicity, here we use the normalized shared peak count (nspc) as the score, ie the ratio between the number of matched MS peaks and the number of peaks in the theoretical spectrum of the corresponding protein The MS-based score of the j-th protein is denoted by A () j (4) PSM re-ranking: re-ranking PSMs by combining initial MS2-based scores in Step () and MS-based scores in Step (3) Since one peptide may belong to multiple proteins, we define the MS-based identification score of the i-th peptide as: S () i = j U(i)A () j () U(i) As S () i and S (2) i may have different ranges, we first transform them into the interval [,] using the min/max normalization The optimal combination of S () i and S (2) i scores is a typical multicriteria decision making problem (Dyer et al, 992) It is a branch of a general class of operational research models that deal with decision problems under the presence of a number of decision criteria To solve Protein Protein 2 Protein k- Protein k Peak Peak 2 Peak i Peak j Peak t Fig Overview of the re-ranking strategy using the MS information such a problem, the multi-attribute value theory (MAVT) is widely used due to its simplicity in both concept and computation One of the most popular methods in MAVT is the linear combination approach: S i =λs () i +( λ)s (2) i, (2) where S i is the fused score and λ [,] is a regularization parameter controlling the relative importance between S () i and S (2) i : When λ=, no re-ranking is performed When <λ<, we use the fused score to perform re-ranking When λ=, we ignore the MS2-based score and totally rely on MS-based score to re-rank PSMs In practice, we suggest to use λ=5 (equal weights) as the default parameter setting Moreover, it is also possible to determine the weight parameter automatically by analyzing the vectors of scores Shannon s entropy-based weighting method (Zeleny, 982) is one of the widely used heuristic approach for automatic weight determination This method first performs score transformation: n x (k) i =S (k) i / S (k) t, (3) t= where k =,2 and n is the number of spectra, ie the length of each score vector Then, the λ value is computed as: λ= lnn+ n t= lnx () t 2lnn+ n t= ln(x () t x (2) t ) (4) In Section 34, we will show some experimental results about the sensitivity of our method with respect to different λ values 3 RESULTS 3 Data and experimental design In the experiment, we use two high-resolution MS datasets that are publicly available: The ISB standard mixture data set (Klimek et al, 28) The ABRF sprg26 mixture dataset 2 The second replicate of raw data of mixture 2 on the QSTAR platform 2 (Lane/62Yrasprg525-ct5RAW) 297

3 Improving peptide identification with single-stage MS peaks X!Tandem (AUC = 64) Our Method (AUC = 93) False Positive Rate ( Specificity) X!Tandem (AUC = 86) Our Method (AUC = 9) False Positive Rate ( Specificity) Fig 2 Effect of re-ranking when X!Tandem is used as the baseline ranker We use X!Tandem to rank PSMs according to their E-values Our method re-ranks PSMs by adjusting the E-values with MS-based protein evaluation scores Here, we set λ to 5 ROC curves on the ISB standard mixture data ROC curves on the ABRF sprg 26 data Crux (AUC = 82) Our Method (AUC = 92) False Positive Rate ( Specificity) Crux (AUC = 76) Our Method (AUC = 87) False Positive Rate ( Specificity) Fig 3 Effect of re-ranking when Crux is used as the baseline ranker We use Crux to rank PSMs according to their X Corr scores Our method re-ranks PSMs by adjusting the X Corr scores with MS-based protein evaluation scores Here we set λ to 5 ROC curves on the ISB standard mixture data ROC curves on the ABRF sprg 26 data We use X!Tandem (version 2772; Craig and Beavis, 24) and Crux (version 2; Park et al, 28) as the baseline method for peptide identification Crux is a re-implementation of the widely used database search program SEQUEST (Eng et al, 994) We use the E-value and the X Corr score to rank PSMs in X!Tandem and Crux, respectively In database searching, we use the Swiss-Prot database (release 566) as the target database and create a decoy database by shuffling each target protein sequence The shuffled decoy database contains the same number of protein sequences as the target database, where each decoy protein sequence is generated by randomly permutating the residues in the corresponding target protein sequence The parameters used for peptide identification are: mono-isotopic masses, mass tolerance of 2 Da for precursor, mass tolerance of Da for fragment ion, fixed modification (carboxamidomethyl, +57 Da) on Cys, one missed cleavage site and only b and y fragment ions are taken into account The criterion for filtering PSMs is E-value for X!Tandem and X Corr 3 for Crux, respectively We use Decon2LS (Jaitly et al, 29) and VIPER (Monroe et al, 27) to identify MS peaks from the raw data with their default parameter settings The parameters used for MS-based protein scoring are: mono-isotopic peaks, one missed cleavage site, fixed modification on Cys and mass tolerance of 5 ppm In performance evaluation, a peptide-spectrum pair is labeled as a false positive if the peptide appears in the decoy database In this context, we can plot the receiver operating characteristic (ROC) curve for any given ranked PSM list We also use the area under ROC curve (AUC) as a single numeric indicator of overall performance 32 Peptide identification In Figures 2 and 3, we use X!Tandem and Crux as the baseline method to test the effectiveness of our method, respectively Our reranking strategy improves the identification consistently on both the datasets when different peptide identification algorithms are used Moreover, when the false positive rate is relatively small (eg %), 297

4 ZHe and WYu Table Protein identification performance comparison of different methods Dataset Platform No of identifications (Target/Decoy) Baseline method Method in Lu et al (28) Our method ISB data ABRF data X!Tandem 4/ 45/ 97/ Crux 38/ 43/ 7/ X!Tandem 7/ 27/8 272/ Crux 69/ 285/8 424/ In baseline method and our method, the number of correct and incorrect protein identifications are obtained with probability greater than or equal to 95 In the method of Lu et al (28), we use a random database of synthetic proteins to generate the distribution of unique m/z hits In this random database, the length of each protein is fixed to 5 In database searching, the mass tolerance threshold is set to ppm for the z-score-based PMF algorithm our method achieves significantly higher true positive rate than the baseline methods This demonstrates that the proposed re-ranking technique can identify more true positives when false identifications are strictly controlled The effectiveness of our method owes a great deal to the fact that ground-truth proteins will generate many MS peaks, even though most of them are not selected for MS2 sequencing In contrast, those false positives do not have such a property As a result, ground-truth proteins achieve better identification scores with respect to MS data, distinguishing true positives from false positives As we have observed from the experimental results, the MS-based scores of some decoy peptides are zeros 33 Protein identification Here, we conduct experiments to show that the proposed method could also lead to better protein identification We first use PeptideProphet (Keller et al, 22) to transform the ranking scores into peptide identification probabilities Then, we calculate the protein identification probability using the method given by MacCoss et al (22), which has also been used in ProteinProphet (Nesvizhskii et al, 23) as the main estimate of protein probability To evaluate the performance of protein identification, we take those confident and non-decoy proteins as true positives In other words, a reported non-decoy protein is regarded as correct if its identification probability is not less than a given probability threshold In this setting, we are able to compare the performance of the baseline method with that of our re-ranking method Lu et al (28) proposed three different approaches to improve protein identification by combining MS2 and MS data Among these three methods, the third approach is most closely related to our method since it also uses peptide mass fingerprinting (PMF) to perform MS-based protein identification Hence, we choose this method for comparison In database searching, we use their z-score-based PMF algorithm to search against the database To generate PMF identifications, we sort all candidate proteins in a descending order with respect to their z-scores and continue to accept protein identifications until the false discovery rate (FDR) is >5% Here, FDR is defined as the number of identifications from the decoy database divided by the number of identifications from the target database above a given threshold The MS-based protein identifications are then combined with MS2-based identifications of the baseline method as the final result of Lu et al (28) Table shows the number of correct and incorrect protein identifications achieved by different methods Compared with the baseline method, our method identifies more confident proteins under the same probability threshold Compared with the method in Lu et al (28), our method reports more target proteins and less decoy proteins simultaneously 34 Effect of parameters To test the sensitivity of our algorithm to the regularization parameter, we vary λ from to and plot theauc values in Figure 4 It shows that λ ranging from 3 to 7 yields better results In Figure 4a, the increase of λ from to 3 leads to a noticeable performance gain on the ISB standard mixture data This is mainly because some decoy PSMs have good MS2-based identification scores before re-ranking When we place more weight on MS data, the fused scores of these false positives are decreased In contrast, the performance of our method is less sensitive to λ on the ABRF sprg 26 data This is because most false positives have already been assigned to low ranks before re-ranking Thus, putting more weight on MS data in the re-ranking process will not help too much We apply the entropy-based method to determine λ For both the datasets, we obtain λ=49 using X!Tandem scores and λ=48 using Crux scores, respectively We have several comments on the results: The entropy-based method is a good estimator for λ and it is of practical use in automatic parameter determination There is no guarantee that the entropy-based method always provides the optimal value of λ, as shown in Figure 4a We have tested the identification performance using a single set of searching conditions It is also necessary to check the performance variation when different database searching parameters are used We note that the number of possible parameter combinations is huge, making it infeasible to check all of them Here, we focus on the mass tolerance parameter for MS peaking matching in our re-ranking method The reasons for choosing this parameter are the following: MS-based protein identification is more sensitive to the mass tolerance threshold since it only utilizes m/z information Since we use MS2-based peptide identification results as the baseline, the improvement in peptide identification mainly depends on the success of MS-based score adjustment 2972

5 Improving peptide identification with single-stage MS peaks The Regularization Parameter: λ The Regularization Parameter: λ Fig 4 Effect of regularization parameter on the peptide identification performance in terms of AUC Here λ ranges from to Baseline method: X!Tandem Baseline method: Crux Mass Tolerance Threshold (ppm) Mass Tolerance Threshold (ppm) Fig 5 Effect of mass tolerance threshold for MS peak matching on the peptide identification performance in terms of AUC Here λ is fixed to 5 during the experiments Baseline Method: X!Tandem Baseline Method: Crux In Figure 5, we vary the mass tolerance threshold from ppm to ppm to the check the change of AUC values Our method is more sensitive to the parameter on the ISB mixture data The main reason is that the quality of extracted MS peaks from the ISB mixture is not very good The most straightforward evidence is provided in Table, ie the method of Lu et al (28) can identify more non-decoy proteins from the ABRF sprg 26 data using PMF search at the same FDR threshold The performance fluctuation on the ISB mixture data indicates that we should be cautious of using a stricter mass tolerance threshold when the data quality of MS peaks is not very good In practice, we may use a larger mass tolerance threshold (eg 5 ppm) to achieve the stable performance This experiment also reflects that high-quality MS peaks are critical to the success of MS-based identification methods This is the main reason that we pick out only the ISB mixture data and the ABRF sprg 26 data The quality of extracted MS peaks is highly dependent on the preprocessing tools that we are using The development of more accurate peak picking algorithms will help us to overcome this limitation Since the quality of mass spectra may vary broadly across experiments, it is also necessary to conduct a performance comparison using spectra of different intrinsic quality However, it is non-trivial to measure and control the quality of real mass spectra, making it difficult to perform such an experimental study quantitatively in a conclusive manner 4 CONCLUSIONS MS and MS2 data are complementary to each other: the former provides broader mass coverage, while the latter provides more sequence information A seamless integration of both data enables us to achieve better peptide identification performance This article proposes a re-ranking strategy that adjusts the original peptide identification scores using MS-based protein identification scores The effectiveness of this combination method is verified experimentally Our proposed linear combination strategy is probably the simplest method in combining multiple scores There should be no limit on possible number of methods that can be used for the same purpose 2973

6 ZHe and WYu In our future work, we will focus on developing an integrated optimization model to identify proteins from the combination of MS and MS/MS data ACKNOWLEDGEMENTS The comments and suggestions from the anonymous reviewers greatly improved the article We thank Dr Henry LAM for valuable discussions Funding: This work was supported with the general research fund 6277 from the Hong Kong Research Grant Council, a research proposal competition award RPC7/8EG25 and a postdoctoral fellowship from the Hong Kong University of Science and Technology Conflict of Interest: none declared REFERENCES Bandeira,N et al (28) Multi-spectra peptide sequencing and its applications to multistage mass spectrometry Bioinformatics, 24, i46 i423 Craig,R and Beavis,RC (24) Tandem: matching proteins with tandem mass spectra Bioinformatics, 2, Dyer,JS et al (992) Multiple criteria decision making, multiattribute utility theory: the next ten years Manage Sci, 38, Eng,JK et al (994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database J Am Soc Mass Spectrom, 5, Frank,A (29) A ranking-based scoring function for peptide-spectrum matches J Proteome Res, 8, Jaitly,N et al (29) Decon2LS: an open-source software package for automated processing and visualization of high resolution mass spectrometry data BMC Bioinformatics,, 87 Kall,L et al (27) Semi-supervised learning for peptide identification from shotgun proteomics datasets Nat Methods, 4, Keller,A et al (22) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search Anal Chem, 74, Klammer,AA et al (27) Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions Anal Chem, 79, 6 68 Klammer,AA et al (28) Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification Bioinformatics, 24, i348 i356 Klimek,J et al (28) The standard protein mix database: A diverse dataset to assist in the production of improved peptide and protein identification software tools J Proteome Res, 7, 96 3 Lin,Y et al (28) A fragmentation event model for peptide identification by mass spectrometry In Vingron,M and Wong,L (eds) Proceedings of the 2th Annual International Conference on Research in Computational Molecular Biology (RECOMB 28), Singapore, Vol 4955 of LNBI Springer, pp Lu,B et al (28) Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data Anal Chem, 8, MacCoss,MJ et al (22) Probability-based validation of protein identifications using a modified SEQUEST algorithm Anal Chem, 74, Monroe,ME et al (27) VIPER: an advanced software package to support highthroughput LC-MS peptide identification Bioinformatics, 23, Nesvizhskii,AI et al (23) A statistical model for identifying proteins by tandem mass spectrometry Anal Chem, 75, Park,CY et al (28) Rapid and accurate peptide identification from tandem mass spectra J Proteome Res, 7, Perkins,DN et al (999) Probability-based protein identification by searching sequence databases using mass spectrometry data Electrophoresis, 2, Ulintz,PJ et al (28) Investigating MS2/MS3 matching statistics: a model for coupling consecutive stage mass spectrometry data for increased peptide identification confidence Mol Cell Proteomics, 7, 7 87 Zeleny,M (982) Multiple Criteria Decision Making McGraw-Hill, New York 2974