Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted Searches Using Percolator

Size: px
Start display at page:

Download "Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted Searches Using Percolator"

Transcription

1 Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted es Using Percolator Shadab Ahmad 1, Amol Prakash 1, David Sarracino 1, Bryan Krastins 1, Jennifer Sutton1, Michael Athanas 1, Maryann Vogelsang 1, Alejandra Garces 1, MingMing Ning 2, Mary F Lopez 1 1 Thermo Fisher Scientific, Cambridge, MA; 2 Massachusetts General Hospital, Boston, MA

2 Overview Purpose: To build a standardized protein for rapid and statistically significant targeted protein identification and characterization. Methods: We built a, standardized human that contains 881 commonly expressed human. This can be appended to any human targeted protein sequence. It provides a peptide validation algorithm (such as Percolator 1 ) a sufficient number of peptide-spectral matches (PSMs) to train its model to get statistical significant results. To evaluate the performance of this, two different kinds of human samples (liver tissue and plasma) were analyzed using a hybrid ion trap-orbitrap TM mass spectrometer. The data were analyzed with specialized software for the analysis of proteomics data using the SEQUEST engine coupled with the the Percolator algorithm. Results: The standardized (appended with the respective targeted protein sequence) identified the targeted correctly. Comparison with a demonstrated that peptide identification through the was statistically as robust as identification through. Moreover, the is much faster and requires less computational power. Introduction In recent years, mass spectrometry has become an established method for protein identification and characterization. Protein identification based on mass spectrometry typically involves ing protein s irrespective of individual needs or interests in particular targeted. With the improvement of sequencing techniques, these s have grown at a phenomenal rate. Each requires large amounts of time and computing power, which becomes even more challenging and impractical when there is a need to explore different enzymatic products or multiple modifications (such as post translational modifications, PTMs) in large numbers of samples. Moreover, in any such identification workflow it is equally important to validate the identified peptides. Different engines (such as Mascot TM and SEQUEST) return multiple scores for each identified peptide, but these scores are usually hard to interpret for their statistical basis to calculate the false discovery rate for the peptide. Percolator 1 is the state of the art algorithm that utilizes these features jointly with a semi-supervised machine learning algorithm and trains its model by assigning weights to these features to calculate the false discovery rate. 2,3 Using a peptide validation algorithm coupled with a engine provides a highly desirable workflow to get high-confidence protein identification. However this combination cannot be used for a targeted (consisting of few ) due to a limited size, whereby these algorithms fail to build the statistical model to calculate the FDR. To solve this problem, we created a standardized that can be appended to targeted protein sequences and that allows rapid and statistically significant Percolator-validated peptide/protein identification and characterization for human. Methods Creation of the Small Standard Database We created a standardized consisting of 881 of the most frequently identified human. To build this, we started with the list of the most commonly identified from The Global Proteome Machine 4 as well as frequently identified seen in our own experiments. This was made in such a way that it consisted of highly expressed and frequently identified from almost all cellular compartments, cytoskeleton, and organelles, and contained housekeeping and plasma (Figure 1). This can be appended to one or more targeted protein sequences and thus can be used universally to identify and characterize peptides or from almost all human tissue and plasma samples. The primary role of the is to provide a sufficient number of true hits and false hits that a validation algorithm (such as Percolator) can use to make a robust and sound statistical model to calculate FDR. Sample Preparation To evaluate the performance of the, we took two different kinds of human samples: (a) liver biopsy samples (three biological replicates) and (b) plasma samples. Human plasma and tissue samples were collected with full consent and approval. 2 Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted es Using Percolator

3 FIGURE 1. Features of standard. Glycolytic enzymes Plasma House keeping Chromosomal Database (881 Proteins) Mitochondrial Nucleus Cytoskeletal The tissue samples were lysed and subjected to reduction with dithiothreitol followed by alkylation with iodoacetamide. The reduced and alkylated samples were then digested with trypsin. The same protocol of alkylation, reduction and digestion was also followed for the plasma samples. Liquid Chromatography and Mass Spectrometry The digested samples were separated with a C18 column (15 cm x 75 μm) with a 5% 45% acetonitrile gradient in 0.1% formic acid through the Thermo Scientific EASY-nLC nanoflow LC system. The gradient was run for 90 minutes for tissue samples and 140 minutes for plasma samples. The samples were analyzed with a Thermo Scientific LTQ Orbitrap Velos hybrid ion trap-orbitrap mass spectrometer. Collision-induced dissociation (CID) was used for the fragmentation of all three tissue samples. For the plasma sample, CID, higher-energy collisional dissociation (HCD) and electron-transfer dissociation (ETD) were all used for the fragmentation. Data Analysis The data were ed against a human protein (UniProt) and a using the SEQUEST engine coupled with Percolator for peptide validation. The consisted of the aforementioned 881 sequences amended with seven targeted protein sequences (apolipoprotein A-IV, apolipoprotein E, Ig lambda-2 chain C regions, serum amyloid A protein, serum amyloid P-component, transthyretin and IGK@ protein) for tissue. For plasma, five targeted (coagulation factor II prothrombin, pregnancy zone protein, clusterin, apolipoprotein A1 and albumin) were amended to the standard. Results Tissue Samples Three post translational modifications (methylation and formylation at lysine and phosphorylation at serine) were taken into consideration for both the and the to also evaluate the performances of both s in presence of PTMs. The results obtained from both es were compared at the protein level as well as peptide level. All the targeted that were identified with high confidence in the were also identified at the same confidence level in the (Table 1). Moreover, all the peptides (with and without PTMs) that were identified in the with high confidence (FDR 0.01) were also identified with high confidence in the in all the three samples. The Venn diagrams of high confidence peptide identifications from the and the are shown in Figure 2. In addition to the peptides that were found in both the es, the was also able to identify a few more high-confidence peptides. Thermo Scientific Poster Note PN63590_E 06/12S 3

4 ing the took one-third of the time compared to the. Furthermore, if a sample contains a large number of candidate peptides it will increase the time greatly but the time will not be affected much (as time depends upon the number of in a sample and the size of the ). TABLE 1. Identified tissue from the and the with their respective number of identified peptides Targeted Proteins Tissue Sample 1 Tissue Sample 2 Tissue Sample 3 Apolipoprotein E Serum amyloid A Not Not Not Not protein identified identified identified identified 8 8 Serum amyloid P- component Transthyretin Apolipoprotein A-IV Ig lambda-2 chain C regions IGK@ protein FIGURE 2. Venn diagrams of high-confidence peptide identifications of targeted tissue protein by (Comp. ) and standardized (Small ) (a) Comp. Small (b) Comp. Small (c) Comp. Small Tissue 1 Tissue 2 Tissue 3 Plasma Sample Experiments with human plasma using different fragmentation techniques (CID, HCD and ETD) also strongly support use of a standardized instead of the for targeted protein and faster protein identification and characterization. The results obtained from the and the were compared for each of the experiments at protein level and peptide level. All the low- and high-abundance targeted (intentionally selected for comparison) that were identified with the were also identified with the with high confidence. The number of identified peptides for each protein was very similar in both es in each of the three experiments with CID, HCD and ETD (Table 2). In addition to the peptides that were found in both es, the was also able to identify a few more high-confidence peptides (FDR 0.01). The high-confidence identified peptides (including the peptides with the PTMs) from the and the were compared through Venn diagrams (Figure 3). Table 3 shows the similarities between both es with respect to identified peptides, post translational modifications and confidence levels (Percolator q values). The with the was three times faster than the. 4 Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted es Using Percolator

5 TABLE 2. Identified plasma from a and a with their respective number of identified peptides with CID, HCD and ETD fragmentation. Targeted Proteins Plasma CID Plasma HCD Plasma ETD Apolipoprotein A-I Clusterin Pregnancy zone protein Prothrombin Serum albumin FIGURE 3. Venn diagrams of high-confidence peptide identifications of targeted plasma protein by (Comp ) and standardized (Small ) (a) Comp. Small (b) Comp. Small (c) Comp. Small Plasma CID Plasma HCD Plasma ETD TABLE 3. Representative table (from plasma HCD experiment) showing similarities between both es with respect to identified peptides, post translational modifications and confidence levels (Percolator q values) Sequence Modifications comp Small Comp q val Small q Val AEFAEVSk K8(Methyl) YES YES 0 0 AEFAEVSK YES YES 0 0 AFQPFFVELTMPYSVIR YES YES 0 0 AKPALEDLR YES YES 0 0 ASSIIDELFQDR YES YES 0 0 ATEHLSTLSEK YES YES 0 0 ATEHLSTLSEk K11(Methyl) YES YES 0 0 ATVLNYLPK YES YES 0 0 AVmDDFAAFVEK M3(Oxidation) YES YES 0 0 AVMDDFAAFVEK YES YES 0 0 AVMDDFAAFVEk K12(Methyl) YES YES 0 0 DLATVYVDVLk K11(Methyl) YES YES 0 0 DLATVYVDVLK YES YES 0 0 DLGEENFK YES YES DLGEENFk K8(Methyl) YES YES DVFLGMFLYEYAR YES YES 0 0 DVFLGmFLYEYAR M6(Oxidation) YES YES 0 0 DYVSQFEGSALGK YES YES 0 0 EQLGPVTQEFWDNLEK YES YES 0 0 EQLGPVTQEFWDNLEk K16(Methyl) YES YES 0 0 ETAASLLQAGYK YES YES 0 0 ETYGEmADccAK M6(Oxidation); C9(Carbamidomethyl); C10(Carbamidomethyl) NO YES NA FKDLGEENFK YES YES 0 0 FMETVAEK YES YES FQNALLVR YES YES 0 0 HPDYSVVLLLR YES YES 0 0 HPYFYAPELLFFAk K14(Methyl) YES YES 0 0 IDSLLENDR YES YES 0 0 KQTALVELVK YES YES 0 0 KVPQVSTPTLVEVSR YES YES 0 0 kvpqvstptlvevsr K1(Formyl) YES YES 0 0 LLDNWDSVTSTFSK YES YES 0 0 LSPLGEEMR YES YES 0 0 LSPLGEEmR M8(Oxidation) YES YES LVAASQAALGL NO YES NA LVNEVTEFAk K10(Methyl) YES YES 0 0 Thermo Scientific Poster Note PN63590_E 06/12S 5

6 TABLE 3. (continued) Sequence Conclusion Targeted for using the standardized is faster as compared to the. Targeted for the using the standardized identifies with confidence comparable to that of. The number of Percolator-validated, high-confidence peptides in the exceeded those in the large. Because the standardized demands less time and computational power, it enables parallel workflows, including exploration of multiple post-translational modifications at the same time. This standardized can be amended with any targeted human protein (from different sources such as different tissue, plasma, urine, sub-cellular organelle like mitochondria, nucleolus, etc. as well as cell lines of human origin) for faster and statistically significant protein identification and characterization. References Modifications 1. Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S. & MacCoss, M. J. Semisupervised learning for peptide identification from shotgun proteomics datasets. Nat Methods, , comp Small Comp q val 2. Elias, J. E. & Gygi, S. P. Target-decoy strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods, 2007, 4, Käll, L.; Storey, J. D.; MacCoss, M. J. & Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy s. J Proteome Res, 2008, 7, The Global Proteome Machine: Small q Val LVNEVTEFAK YES YES 0 0 LVRPEVDVmcTAFHDNEETFLK M9(Oxidation); C10(Carbamidomethyl) YES YES 0 0 LVRPEVDVMcTAFHDNEETFLk C10(Carbamidomethyl); K22(Formyl) NO YES NA 0 LVRPEVDVmcTAFHDNEETFLk M9(Oxidation); C10(Carbamidomethyl); YES YES K K22(Formyl) LVTDLTK YES YES LVTDLTk K7(Methyl) YES YES MVSGFIPLKPTVK YES YES 0 0 mvsgfiplkptvk M1(Oxidation) YES YES 0 0 QTALVELVK YES YES 0 0 RHPYFYAPELLFFAK YES YES 0 0 SHcIAEVENDEmPADLPSLAAD C3(Carbamidomethyl); M12(Oxidation) NO YES NA FVESK SLHTLFGDK YES YES 0 0 SLHTLFGDk K9(Methyl) YES YES SSGSLLNNAIK YES YES 0 0 TATSEYQTFFNPR YES YES 0 0 THLAPYSDELR YES YES 0 0 TLLSNLEEAK YES YES 0 0 TYETTLEK YES YES 0 0 TYETTLEk K8(Methyl) YES YES VFDEFKPLVEEPQNLIk K17(Methyl) YES YES 0 0 VFDEFKPLVEEPQNLIK YES YES 0 0 VFDEFkPLVEEPQNLIK K6(Formyl) YES YES VPQVSTPTLVEVSR YES YES 0 0 VQPYLDDFQK YES YES 0 0 VQPYLDDFQk K10(Methyl) YES YES 0 0 VSFLSALEEYTK YES YES 0 0 VSFLSALEEYTk K12(Methyl) YES YES 0 0 VTTVASHTSDSDVPSGVTEVVV YES YES 0 0 K WQEEMELYR YES YES 0 0 WQEEmELYR M5(Oxidation) YES YES 0 0 YGAATFTR YES YES YGFYTHVFR YES YES YLYEIAR YES YES SEQUEST is a registered trademarks of University of Washington. Mascot is a trademark of Matrix Science Ltd. All other trademarks are the property of Thermo Fisher Scientific and its subsidiaries. This information is not intended to encourage use of these products in any manners that might infringe the intellectual property rights of others. 6 Small, Standardized Protein Database Provides Rapid and Statistically Significant Peptide Identifications for Targeted es Using Percolator

7 Thermo Fisher Scientific Inc. All rights reserved. ISO is a trademark of the International Standards Organization. All other trademarks are the property of Thermo Fisher Scientific Inc. and its subsidiaries. This information is presented as an example of the capabilities of Thermo Fisher Scientific Inc. products. It is not intended to encourage use of these products in any manners that might infringe the intellectual property rights of others. Specifications, terms and pricing are subject to change. Not all products are available in all countries. Please consult your local sales representative for details. Thermo Fisher Scientific, San Jose, CA USA is ISO Certified. Africa-Other Australia Austria Belgium Canada China Denmark Europe-Other Finland/Norway/Sweden France Germany India Italy Japan Latin America Middle East Netherlands New Zealand Russia/CIS South Africa Spain Switzerland UK USA PN63590_E 06/12S