Supplementary Tables. Note: Open-pFind is embedded as the default open search workflow of the pfind tool. Nature Biotechnology: doi: /nbt.

Size: px

Start display at page:

Download "Supplementary Tables. Note: Open-pFind is embedded as the default open search workflow of the pfind tool. Nature Biotechnology: doi: /nbt."

Leslie Dawson
5 years ago
Views:

1 Supplementary Tables Supplementary Table 1. Detailed information for the six datasets used in this study Dataset Mass spectrometer # Raw files # MS2 scans Reference Dong-Ecoli-QE Q Exactive 5 202,452 / Xu-Yeast-QEHF Q Exactive HF ,301 / Mann-Human-Velos LTQ Orbitrap Velos 3 64,112 [1] Gygi-Human-QE Q Exactive 24 1,121,149 [2] Mann-Mouse-QEHF Q Exactive HF 4 746,116 [3] Pandey-Human-Elite a LTQ Orbitrap Elite ,913 [4] Note: a Only the 24 raw files whose names begin with Adult_CD8Tcells_Gel_Elite were chosen. In the entrapment analysis shown in Supplementary Fig. 8, one RAW file was used for each of the four published datasets, namely, _Velos2_AnMi_QC_wt_HCD_iso4_swG for Mann-Human-Velos, b1906_293t_proteinid_01a_qe3_ for Gygi-Human-QE, _QEp8_KiSh_SA_Cerebellum_P05_Singleshot1 for Mann-Mouse-QEHF, and Adult_CD8Tcells_Gel_Elite_44_f01 for Pandey-Human-Elite. Supplementary Table 2. The eight search engines used in this study Search engine Version Open search Open-pFind 1.0 PEAKS 7.5 MODa 1.23 MSFragger v pfind Comet MS-GF+ v10072 Byonic 2.10 Note: Open-pFind is embedded as the default open search workflow of the pfind tool. 1

2 Items Database Supplementary Table 3. Parameters for database searches Settings Target + Decoy a Enzyme Trypsin Digestion Fully specific for restricted engines and MSFragger Non-Specific for Open-pFind, MODa and PEAKS Max. missed cleavage sites 3 Mass tolerance of precursor ions ± 20 ppm ± 20 ppm (± 0.02 Da if the ppm unit is not supported Mass tolerance of fragment ions for the search engines, e.g., PEAKS) Fixed: carbamidomethylation (C) Modifications for restricted search engines Variable: oxidation (M), Gln pyro-glu (N-termini of peptides) and acetylation (N-termini of proteins) Open-pFind and MODa: no modifications Modifications for open search engines MSFragger and PEAKS: the same modifications as restricted search engines b Max. modifications per peptide 4 Note: a The human protein database was downloaded from UniProt ( ) for Mann-Human-Velos, Gygi-Human-QE and Pandey-Human-Elite. The mouse protein database was downloaded from UniProt ( ) for Mann-Mouse-QEHF. Both reviewed and unreviewed proteins were used in this study by default. The E. coli protein database for the K-12 substrain MG1655 was downloaded from NCBI on for Dong-Ecoli-QE. The six-frame-translated database was used as the target database for Xu-Yeast-QEHF (the detailed information of database generation is described in Online Methods). b For PEAKS, the modifications were set as those for restricted search engines in the PEAKS DB step, and then the built-in modification list was used in PEAKS PTM for modification detection. Supplementary Table 4. The average number of protein-unique peptides per protein in the proteins co-identified by the eight search engines for the Dong-Ecoli-QE dataset Search engine # Protein-unique peptides per protein Open-pFind 17.3 PEAKS 15.3 MSFragger 14.0 MODa 10.1 Byonic 9.6 pfind 9.2 Comet 9.0 MS-GF+ 8.9 Note: A protein-unique peptide is defined by its amino acid sequence and mapped to only one protein in the given database. 2

3 Supplementary Table 5. Real search times (in min.) of the eight search engines for the six datasets pfind Byonic MS-GF+ Comet MSFragger PEAKS MODa Open-pFind a Xu-Yeast-QEHF ,269 4, (158) Dong-Ecoli-QE (32) Mann-Human-Velos (78) Mann-Mouse-QEHF ,178 52, (1,210) Gygi-Human-QE ,013 27, (903) Pandey-Human-Elite ,880 12, (414) Note: All MS/MS data were analyzed using a standard desktop computer (8-core 2.90 GHz and 32-GB RAM), in which six threads were specified for Open-pFind, MSFragger, pfind, Comet, MS-GF+ and Byonic (Multicore: Normal). MODa performed single-thread searches because multiple threading was not supported in this version. PEAKS used its built-in strategy (about 6 8 threads by observation from the task manager of the operating system). Multicore: Normal setting is used for Byonic. a The single-threaded search time is shown in parentheses. 3

4 Supplementary Table 6. The analysis of a single LC-MS/MS run consisting of 41,820 MS/MS spectra in the Gygi-Human-QE dataset Fully Specific Semi-Specific Non-Specific Time # PSM Time # PSM Time # PSM MODa , , ,748 PEAKS , , ,194 MSFragger 16 22, ,239 2,466 18,898 Open-pFind (Default) Open-pFind (Unimod-2) Open-pFind (Blind) 8 36, , , , , , , , ,304 Note: The raw file is named as b1906_293t_proteinid_01a_qe3_ raw (PXD in ProteomeXchange). The three workflows, namely Default, Unimod-2, and Blind, were introduced in Online Methods. The running time is measured in minutes. Supplementary Table 7. The results of three open search engines with the T. tengcongensis dataset Fully specific digestion Non-specific digestion Time (min.) # PSM Time (min.) # PSM PEAKS , ,521 MODa 33 26, ,941 MSFragger 8 38, ,794 Open-pFind 4 48, ,829 Note: The dataset contains 113,531 tandem mass spectra, which has been proposed by Chi et. al. in 2015 ( referred to as TTE-65 in this manuscript), and ~38.5% of the total peptides are semi- or non-specifically digested. The T. tengcongensis database was downloaded from UniProt ( ), containing both reviewed and unreviewed proteins. The other parameters were the same as those for the other analyses in this study. 4

5 Supplementary Table 8. The running time and the number of identified PSMs with different tag lengths for the four published datasets Time a (Relative change b ) Identified PSMs (Relative change) 3-tag 7,758 (647.4%) 74,772 ( 0.6%) Mann-Human-Velos 4-tag 2,603 (150.7%) 75,032 ( 0.2%) 5-tag 1,038 (0.0%) 75,203 (0.0%) 6-tag 602 ( 42.0%) 74,516 ( 0.9%) 3-tag 117,444 (848.9%) 985,916 (1.1%) Gygi-Human-QE 4-tag 34,228 (176.5%) 990,940 (1.6%) 5-tag 12,377 (0.0%) 975,629 (0.0%) 6-tag 7,008 ( 43.4%) 939,966 ( 3.7%) 3-tag 152,945 (911.7%) 683,530 ( 0.2%) Mann-Mouse-QEHF 4-tag 46,262 (206.0%) 687,070 (0.3%) 5-tag 15,117 (0.0%) 684,977 (0.0%) 6-tag 8,992 ( 40.5%) 679,067 ( 0.9%) 3-tag 53,676 (931.4%) 388,482 (0.6%) Pandey-Human-Elite 4-tag 15,170 (191.5%) 388,934 (0.7%) 5-tag 5,204 (0.0%) 386,280 (0.0%) 6-tag 3,411 ( 34.5%) 380,884 ( 1.4%) Note: a The running time is measured in seconds. b The relative changes are calculated based on the 5-tag results (in italics) which is used as the default setting in the Open-pFind workflow, e.g., for the Mann-Human-Velos dataset, if 4-tag is used in the open search step, the running time is 2,603 seconds, which is 150.7% more than that of the 5-tag database search. Supplementary Table 9. The tag frequency and tag-index storage space with different tag lengths Tag length Average frequency Storage space (MB) Note: the frequency of a tag denotes the number of positions in the protein database that exactly mapped by this tag. For example, all 6-length tags appeared 5.2 times in the database on average. Reviewed and unreviewed human proteins (152,493 in total) were downloaded from UniProt and used in this study. 5

6 Supplementary Table 10. The number of identified proteins and genes in Kim data Min. pep. FDR (%) Olfactory receptor Average coverage (%) Low coverage (< 10%) proteins Proteins Genes All pep. Unique pep. All pep. Unique pep. 1 19, , ,282 8, , , ,564 3, , , ,231 1, , , , , , , , , , , , , , , Note: Min. Pep. Denotes the minimum number of protein-unique peptides required for supporting the identification of one protein (2 by default in the main text). The coverage of one protein is defined as the fraction of amino acids supported by at least one peptide among all amino acids in this protein sequence. In terms of the protein coverage calculation, All pep. means that all peptides were used to calculate the protein coverage, and Unique pep. means that only the protein-unique peptides were used to calculate the protein coverage. Only peptides with lengths equal to or greater than 9 are considered in this analysis. 6

7 Supplementary Notes Supplementary Note 1 Using the metabolic labeling technique to estimate the error rates of search engines. NaN ratios can be used to estimate the error rates of different engines independent of the target-decoy strategy. The error rate of one search engine is defined as the fraction of incorrect PSMs in all PSMs reported by this engine. First, we investigated the relationship between decoy PSMs and NaN-ratio PSMs based on the Open-pFind results obtained from the Dong-Ecoli-QE dataset. Fig. S1 shows the increase in the number of decoys and NaN-ratio PSMs along with the numbers of target PSMs (all PSMs were sorted in ascending order of their scores). The trends of the three curves were quite consistent, and the tails (where nearly all PSMs were incorrect) showed that the proportions of both decoy and NaN-ratio PSMs were stable. Fig. S1. The relationship between the number of target PSMs and the number of PSMs from the decoy database (green) or with NaN ratios of 15 N/ 14 N (red) or 13 C/ 12 C (blue) at each score threshold in the Dong-Ecoli-QE dataset. Initially, all PSMs are sorted in ascending order by their scores (e.g., the best PSM ranked at the first place). The subplot shows the linear property of the tails of the three curves. 7

8 The number of data points (N) used for determining the R 2 values is 53,225 (located at the tail of the curves after 180,000). Therefore, the percentage of NaN-ratio PSMs is useful for estimating the error rates of the results of metabolically-labeled datasets, which is similar to but independent of the traditional target-decoy strategy. Given M as the number of total PSMs and N as the number of NaN-ratio PSMs, we get the equation MM ee rr 1 + MM (1 ee) rr 2 = NN, 1) where e denotes the error rate to be estimated, r 1 denotes the percentage of NaN-ratio PSMs in incorrect matches (e.g., target PSMs distributed at the tail of the curves in Fig. S1) and r 2 denotes the percentage of NaN-ratio PSMs in correct matches. r 1 is simply calculated using the linear least-squares method, and r 2 is estimated based on the intersection of the results of different engines because a PSM is more likely to be correct if it is consistently reported by multiple search engines, resulting in a lower probability of being a NaN-ratio PSM (Fig. S2). In this study, the intersecting results of all eight search engines were used to estimate the value of r 2. Finally, the error rate e is estimated using the following formula: ee = NN MM rr 2 MM (rr 1 rr 2 ), 2) and the precision of the given result set is equal to 1 e. This formula also shows that if r 1 and r 2 are correctly estimated based on the same dataset, then a smaller percentage of NaN-ratio results indicates a lower error rate, i.e., a higher precision. 8

9 Fig. S2. The proportions of NaN-ratio PSMs distributed in all of the possible intersections of the eight result sets from Open-pFind, PEAKS, MODa, MSFragger, MS-GF+, Byonic, Comet and pfind. The number of intersections (N) for each boxplot is 8, 28, 56, 70, 56, 28, 8, 1. For example, the number of intersections from any three result sets is 8 = 56. Box-plot elements: center line, median; box limits, 3 first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR; dots, outlier data points. 9

10 Fig. S3. Comparison of estimated precision of consistently and separately identified PSMs between every two search engines using the Dong-Ecoli-QE dataset. 15 N- and 13 C-labeled peptides are used for estimation, and the final precision is calculated from the average of the two estimates for the same resulting PSMs. Each decimal denotes the estimated precision of the consistently or separately identified PSMs. a) Only the PSMs with common modification types (the four that are specified in the restricted search engines) are considered. b) All PSMs are considered. 10

11 In the Dong-Ecoli-QE dataset, the newly estimated precision of the identified PSMs varied within % for different engines when considering only the peptides in the restricted search space (Fig. S3). For the separately identified results, the estimated precision of Open-pFind remained close to 99%, which was significantly higher in comparison with the other search engines. Generally, if considering only peptides with no or only common modifications, all open search engines reported more accurate results than those obtained with the restricted engines because the peptides from the restricted search space survived in a significantly larger space containing a huge number of competing peptide candidates. However, if all identified peptides were considered, the precision of the open search engines decreased to varying degrees. Open-pFind remained at a high global precision of 98.9%, while the precision of the other three open search engines dropped to 93.5% for the best, or to 86.6% for the worst. The potential of the metabolic labeling approach is worth being further explored. 11

Generally, a spectrum with a NaN-ratio peptide reported by one search engine may be identified as a different normal-ratio peptide by another search engine.

12 Supplementary Note 2 Using the metabolic labeling technique to examine the search engine results. The metabolic labeling technique is helpful in revealing why spectra are misidentified via different search engines and improving search engine precision. Generally, a spectrum with a NaN-ratio peptide reported by one search engine may be identified as a different normal-ratio peptide by another search engine. As described above, the normal-ratio peptide is more likely to be a correct identification. Thus, for the former search engine, this could be used to optimize the scoring function. For all NaN-ratio PSMs from Open-pFind, only less than 10% were revived by other engines, i.e., identified as normal-ratio peptides (Fig. S4). In contrast, Open-pFind revived ~40% of NaN-ratio PSMs reported by other search engines. Fig. S4. The proportions of NaN-ratio PSMs obtained from one engine but revived by others in Dong-Ecoli-QE dataset. a) Comparison between every two search engines. Each decimal denotes the percentage of PSMs revived by the search engine in the row (leftmost) for the total NaN-ratio PSMs from the search engine in the column (topmost). Only peptides with common modifications are considered. b) Similar to a), but all PSMs including all types of modifications are considered. The 15 N-labeled peptides and the unlabeled (common) peptides are used to calculate the quantitative values. 12

13 Table S1. The fraction of spectra assigned with overlapping peptides among the revived spectra from different engines in the Dong-Ecoli-QE dataset Search engine # Total peptides # Overlapping # Overlapping peptides / (from revived spectra) peptides a # Total peptides (%) b MSFragger 4,161 3, PEAKS 1,221 1, MODa 3,277 2, pfind MS-GF Comet Byonic Note: a Two peptides are called overlapping peptides if one peptide sequence is the substring of the other one. For example, GCEHVAK and C(+carbamidomethyl)EHVAK are overlapping peptides. b The fraction of overlapping peptides in all peptide reported by each search engine. For example, a total of 3,669 spectra identified by MSFragger were assigned with overlapping peptides of those reported by Open-pFind, which accounted for 88.2% of the total spectra identified by MSFragger. For the open search engines, Open-pFind reported an overlapping peptide to the one reported by the other engine for ~90% of the revived spectra (Table S1), that is, for two peptide sequences identified by Open-pFind and the other engine, one sequence is the substring of the other one (e.g., GCEHVAK/C(+carbamidomethyl)EHVAK is a pair of overlapping peptides, or we can say that each one is an overlapping peptide to the other). In other words, these peptide sequences reported by the other open search engines were partially correct, while Open-pFind confirmed the exact termini of the peptides and modification types, as well as the precise precursor information. For example, Open-pFind reported a C-terminal-specific peptide carbamyl-gaaggigqalalllk with an N-terminal carbamylation (P 1 ) for one spectrum (Fig. S5a), while MSFragger reported an overlapping tryptic peptide VAVLGAAGGLGQALALLLK with a mass shift of Da (P 2 ). However, the actual mass difference of these two peptides (P 2 P 1 ) was Da. This result implied that the mass shift of Da reported by MSFragger did not represent a real modification because a ~2 Da mass difference existed between the initially exported precursor ion and the actual one confirmed by Open-pFind (Fig. S5b). This finding also demonstrated that exact precursor ions were very important for the confirmation of modification types. 13

14 Fig. S5. Two example spectra showing the effects of the metabolic labeling technique to distinguish the correct PSMs. +, o and x denote the monoisotopic m/z s of the unlabeled, 15N- and 13C-labeled precursor ions, respectively. The first example is from 3,669 similar results in the result comparison between Open-pFind and MSFragger, and the second example is from 811 similar results in the result comparison between Open-pFind and MS-GF+. a) Ecoli-1to1to1-un-C13-N15-60mM dta, which is identified by Open-pFind as a semi-tryptic peptide, GAAGGIGQALALLLK, with a carbamylation at the N-terminus (m/z = ). MSFragger reported another peptide, VAVLGAAGGLGQALALLLK (m/z = , Hyperscore= ), with few b-ions matched. If the precursor ion m/z was changed to for MSFragger (the same to that used in Open-pFind) and semi-tryptic peptides were allowed to search against, a new peptide GAAGGLGQALALLLK was reported with a mass shift of Da (The monoisotopic mass of carbamylation), whose Hyperscore was b) The MS1 information corresponding to the PSM shown in a). c) Ecoli-1to1to1-un-C13-N15-30mM dta, which is identified by Open-pFind as a peptide, ALTEANGDIELAIENMR, with a deamidation of N at the 6 th position. d) The same spectrum as c), which is identified by Comet and MS-GF+ as a peptide, ALTEANGDIELAIENMR, without any modifications. 14

15 e) The same spectrum as c), which is identified by Byonic as a peptide, ELGDADHGLNMNRGFSK, without any modifications. f) The MS1 information corresponding to the PSMs shown in c)-e). In terms of the restricted search engines, over 90% of revived peptides reported by MS-GF+ and Comet were partially correct, which was similar to the behavior of the open search engines (Table S1). However, this number was lower for Byonic and pfind. Byonic adopted a different protein FDR control strategy that a few low-quality PSMs from reliable proteins might be reported (Online Methods). Another example shows the differences between Open-pFind and the restricted search engines (Fig. S5c-e). For the same spectrum, Open-pFind reported a tryptic peptide with a deamidation, while MS-GF+ and Comet reported the unmodified form of this peptide, which obviously matched fewer fragment ions. Byonic reported a completely different peptide, which matched few peaks in the spectrum. The isotopic envelopes of the unlabeled peptide reported by Open-pFind, as well as the corresponding 15 N- and 13 C-labeled forms shown in MS1, matched the theoretical values precisely. In contrast, the monoisotopic precursor ions of the other two identifications had larger mass deviations, which resulted in invalid quantitation values (Fig. S5f). This example indicated again that peptides reported by Open-pFind were more accurate, and more importantly, the metabolic labeling technique is extremely helpful when distinguishing correct individual PSMs, which will facilitate the improved design of search engines. 15

16 Supplementary Note 3 Analysis based on the entrapment strategy showed the robustness of the design of Open-pFind. To analyze four published datasets, two types of entrapment databases were downloaded from the UniProt database and then used in this study: a) a small database of the reviewed proteins of Arabidopsis thaliana (8.7 MB, 15,423 protein sequences) and b) a large database of the reviewed proteins of all organisms (261.8 MB, 555,100 protein sequences). The entrapment databases were appended to the original database files, respectively. The other database search parameters were the same as those shown in Supplementary Table 3. Intuitively, when the entrapment database is considered in the database search, the identification rate should decrease because more random peptide candidates are involved in the search space, but few of them are the answers to any spectra. Generally, the decrease was more remarkable when a larger entrapment database was considered (Fig. S6). The Open-pFind identification rate was more stable in both situations than that of pfind. For example, the average decrease in the identification rates of Open-pFind and pfind was 1.6 and 4.2, respectively (Fig. S6b). The reason was that Open-pFind adopted a two-step workflow and the proteins to be retrieved in the restricted search were automatically learned in the previous open search step, so that most random peptide candidates that potentially interfere with the correct candidates were eliminated at this time. Furthermore, for all PSMs reported by Open-pFind that matched with the entrapment sequences, only less than 5% of them were revived by pfind, i.e., pfind identified the sequences in the original database for those spectra; however, the corresponding pfind percentages varied from 20% to 60% (Fig. S7). This phenomenon proved again that Open-pFind reported more accurate peptides that matched the authentic protein sequences rather than the entrapment sequences, although the same FDR threshold was controlled. 16

17 Fig. S6. Decreased identification rates caused by the entrapment strategy for the four datasets. a) Proteins from Arabidopsis thaliana were considered the entrapment database. b) Proteins from all organisms recorded in UniProt were considered the entrapment database. 17

18 Fig. S7. Open-pFind revived more spectra than pfind. The orange curves denote the proportion of PSMs from the entrapment database. a) Proteins from Arabidopsis thaliana were considered the entrapment database. b) Proteins from all organisms recorded in UniProt were considered the entrapment database. 18

We also used the entrapment strategy to evaluate the precision of search engines with the Dong-Ecoli-QE dataset (the reviewed human database downloaded from UniProt was used as the entrapment

When searching against the target and entrapment databases, Open-pFind reported the highest numbers of PSMs with the smallest proportions of those matched with the entrapment proteins (Fig. S8a).

19 We also used the entrapment strategy to evaluate the precision of search engines with the Dong-Ecoli-QE dataset (the reviewed human database downloaded from UniProt was used as the entrapment database), and the performance of Open-pFind was similar to that of the four large-scale datasets. When searching against the target and entrapment databases, Open-pFind reported the highest numbers of PSMs with the smallest proportions of those matched with the entrapment proteins (Fig. S8a). Similar as the analysis shown above, less than 10% of entrapment PSMs from Open-pFind were revived by the other engines, while 22 56% of entrapment PSMs from other engines were revived by Open-pFind (Fig. S8b). Fig. S8. Entrapment analysis of the Dong-Ecoli-QE dataset. a) The number of identified PSMs (the blue bars, including PSMs from both original and entrapment protein databases) and the percentage of PSMs from the entrapment database (the orange curve). b) The number and proportion of PSMs identified with entrapment peptides from one engine and revived by the other engine. For example, 359 entrapment PSMs were identified by PEAKS and revived by Open-pFind, which accounted for 49.8% of the total entrapment PSMs identified by PEAKS. 19

20 Supplementary Note 4 Nearly 100% of high-quality spectra in the four published datasets are identified within a comprehensive search space. We also investigated why a few spectra remained uninterpretable for Open-pFind. First, spectra are classified according to the lengths of their longest tags, which are treated as a feature related to spectral quality. For example, a 0-length tag indicates that no mass difference from any two peaks is equal to the mass of any amino acid residue within a given fragment ion tolerance. A spectrum with a longer tag meant that it was more likely to have been formed by a real peptide because more fragmentation information was provided. Generally, the identification rates of spectra with longer tags were higher for all engines (Fig. S9). For all four datasets, the identification rate of Open-pFind was always greater than 90% and even close to 100% for spectra with tags longer than ten, suggesting that the search space of Open-pFind is close to complete for routine MS/MS data analysis. Additionally, the scoring scheme of Open-pFind effectively distinguishes correct peptides from the random peptides, even in such an ultra-large search space. The identification rates of Byonic sharply decreased when spectra with longer tags were considered in the Mann-Mouse-QEHF dataset (Fig. S9c), likely because more large-mass peptides were present in this dataset, and their precursor ions were not accurately exported. Among all PSMs identified via Open-pFind in this dataset, 55.0% of their precursor ions were larger than 1,500 Da, of which only 50.1% were initially exported by the vendor s software. However, in the other datasets, the proportion of precursor ions larger than 1,500 Da was markedly smaller, for example, only 38.8% for the Pandey-Human-Elite dataset, of which 82.1% were extracted initially by the vendor s software. We also tested pfind using the precursor ions extracted by the vendor software rather than pparse, and the distribution of identification rates was similar to that of Byonic (Fig. S10), which again proved that extracting accurate precursor ions was very important for search engine design. 20

of the number of the total spectra at each tag length. Fig. S10.

21 Fig. S9. Analyses of the unidentified spectra with different maximum tag lengths in the four datasets. The curves denote the identification rates of the spectra with different maximum tag lengths, and the histograms denote the distribution of the number of the total spectra at each tag length. Fig. S10. The distribution of the identification rates of Byonic and pfind at different maximum tag lengths extracted from the spectra. Two modes are adopted for pfind, and the only difference is whether pparse is used to calibrate the precursor ions. 21

22 Supplementary Note 5 Comprehensive analysis of the Kim data. The average identification rate was 62.5% for all 85 samples, and over 70% spectra were identified for the in-gel digested samples analyzed on an LTQ Orbitrap Velos (Fig. S11a). The results obtained with Open-pFind demonstrated that the characteristics of MS/MS data vary according to different methods for sample preparation and LC-MS/MS. In terms of modifications, although several common modifications, e.g., carbamidomethylation, oxidation and Gln pyro-glu, were always abundant in all datasets, many unexpected modifications still appeared in only one or two types of datasets (Fig. S11b). For example, propionamides of cysteines were hardly detected in the brplc fractionation samples but appeared as one of the most abundant modifications for cysteines in all peptides from in-gel digested samples (Supplementary Data 3), which was consistent with a previous study by Sechi et al. 5. On the other hand, the percentages of fully tryptic peptides were stable among the four types of datasets with different experimental conditions (97 99% concluded from Fig. S11c). In terms of co-eluting peptide identification, LTQ Orbitrap Elite tended to produce more mixed spectra than LTQ Orbitrap Velos, likely due to its higher sensitivity, allowing less-abundant peptides to be detected and identified via Open-pFind (Fig. S11d). The different characteristics of these datasets again proved that specifying an appropriate search space for each individual dataset based on expert experience is always difficult, and uniformly considering a comprehensive search space for different experimental conditions is essential for today s search engines. On the other hand, biological modifications and mutations were effectively discovered by Open-pFind. For example, Laminin subunit gamma-1 was identified by different types of peptides, all of which were supported by over ten PSMs (Fig. S11e). The N-terminal cleavage site of QAAMDECTDEGGRPQR was confirmed by the signal peptide recorded in UniProt. In addition, two amino acid mutations were discovered by Open-pFind, and one of them, the R1121Q, was verified previously (rs20559 in dbsnp 6 ). Identification results from the extended search space were also valuable for other biological discoveries. For example, a total of 9,559 semi-tryptic peptides were identified as being located in the 22

N-terminal regions of proteins (the C-terminal amino acid of each peptide located before the 60 th amino acid of the corresponding protein), of which 34.

These semi-tryptic peptides provide valuable clues for identifying signal peptides, and 694 of them were already verified in UniProt (Supplementary Data 4).

23 N-terminal regions of proteins (the C-terminal amino acid of each peptide located before the 60 th amino acid of the corresponding protein), of which 34.1% had complete ion series (at least one b or y ion was detected at each peptide bond), and 66.4% had at most two peptide linkages in which both the b and y ions were missing. These semi-tryptic peptides provide valuable clues for identifying signal peptides, and 694 of them were already verified in UniProt (Supplementary Data 4). The score distributions of these 9,559 peptides and the total 548,371 peptides (Fig. S12) indicated that although these semi-tryptic peptides were from a much larger search space (Supplementary Fig. 10), their confidence was still comparable to that of the total results. 23

24 Fig. S11. Profiling the Kim data using Open-pFind. a) The distribution of identification rates of each RAW file. Each boxplot denotes the distribution for each type of the experimental settings (brp_velos, brp_elite, Gel_Velos and Gel_Elite; N = 338, 775, 585, 514 for the number of raw files in the four boxplots shown from left to right, respectively). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR; grey dots, outlier points. b) The distribution of highly abundant modifications. Each number in one cell denotes the percentage of modified amino acids among all amino acids that appeared among the identified peptides. For example, 79.7% of cysteines were modified by carbamidomethylation in the identified peptides from an LTQ Orbitrap Velos MS fractionized by brplc. c) The distribution of the fraction of semi- and non-specific peptides under different experimental conditions. Each boxplot denotes the distribution for each type of the experimental settings (brp_velos, brp_elite, Gel_Velos and Gel_Elite; N = 338, 775, 585, 514 for the number of raw files in the four boxplots shown from left to right, respectively). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR. d) The distribution of peptide numbers identified from one spectrum. For example, 7.5% of the identified spectra from an LTQ Orbitrap Velos MS fractionized by brplc each contribute two peptides. e) The identified peptides in Laminin subunit gamma-1. Red numbers in the brackets denote how many PSMs correspond to each peptide. Fig. S12. The score distributions from the 9,559 semi-tryptic peptides and the scores of all 548,371 peptides identified in Kim data. 24

25 Supplementary References 1. Michalski, A. et al. Mass spectrometry-based proteomics using Q Exactive, a high-performance benchtop quadrupole Orbitrap mass spectrometer. Mol Cell Proteomics 10, M (2011). 2. Chick, J.M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotechnol 33, (2015). 3. Sharma, K. et al. Cell type- and brain region-resolved mouse brain proteome. Nat Neurosci 18, (2015). 4. Kim, M.S. et al. A draft map of the human proteome. Nature 509, (2014). 5. Sechi, S. & Chait, B.T. Modification of cysteine residues by alkylation. A tool in peptide mapping and protein identification. Anal Chem 70, (1998). 6. Sherry, S.T., Ward, M. & Sirotkin, K. dbsnp-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res 9, (1999). 25

Nature Biotechnology: doi: /nbt Supplementary Figure 1. The workflow of Open-pFind.

Nature Biotechnology: doi: /nbt Supplementary Figure 1. The workflow of Open-pFind. Supplementary Figure 1 The workflow of Open-pFind. The MS data are first preprocessed by pparse, and then the MS/MS data are searched by the open search module. Next, the MS/MS data are re-searched by