Supplementary Tables. Note: Open-pFind is embedded as the default open search workflow of the pfind tool. Nature Biotechnology: doi: /nbt.

Similar documents
Nature Biotechnology: doi: /nbt Supplementary Figure 1. The workflow of Open-pFind.

ProteinPilot Report for ProteinPilot Software

Spectral Counting Approaches and PEAKS

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Highly Confident Peptide Mapping of Protein Digests Using Agilent LC/Q TOFs

How to view Results with Scaffold. Proteomics Shared Resource

Supporting Information for

Filter-based Protein Digestion (FPD): A Detergent-free and Scaffold-based Strategy for TMT workflows

Basic protein and peptide science for proteomics. Henrik Johansson

Spectronaut Pulsar X. Maximize proteome coverage and data completeness by utilizing the power of Hybrid Libraries

How to view Results with. Proteomics Shared Resource

Modification Site Localization Scoring Integrated into a Search Engine

Supplementary information, Figure S1A ShHTL7 interacted with MAX2 but not another F-box protein COI1.

N- The rank of the specified protein relative to all other proteins in the list of detected proteins.

Algorithm for Matching Additional Spectra

Identification of Microprotein-Protein Interactions via APEX Tagging

Confident Protein ID using Spectrum Mill Software

Improving Productivity with Applied Biosystems GPS Explorer

Proteomics and some of its Mass Spectrometric Applications

Supplemental Materials

基于质谱的蛋白质药物定性定量分析技术及应用

Protein Reports CPTAC Common Data Analysis Pipeline (CDAP)

BIOINFORMATICS ORIGINAL PAPER

Monoclonal Antibody Characterization on Q Exactive and Oribtrap Elite. Yi Zhang, Ph.D Senior Proteomic Marketing Specialist Oct.

A New Strategy for Quantitative Proteomics Using Isotope-Coded Protein Labels

Peptide and protein identification in mass spectrometry based proteomics. Yafeng Zhu, PhD student Karolinska Institutet, Scilifelab

RockerBox. Filtering massive Mascot search results at the.dat level

Quantification of Isotope Encoded Proteins in 2D Gels

De novo sequencing in the identification of mass data. Wang Quanhui Liu Siqi Beijing Institute of Genomics, CAS

FACTORS THAT AFFECT PROTEIN IDENTIFICATION BY MASS SPECTROMETRY HAOFEI TIFFANY WANG. (Under the Direction of Ron Orlando) ABSTRACT

Liver Mitochondria Proteomics Employing High-Resolution MS Technology

Supporting Information. Scanning Quadrupole Data Independent Acquisition Part A Qualitative and Quantitative Characterization

A highly sensitive and robust 150 µm column to enable high-throughput proteomics

Quantitative mass spec based proteomics

Hongwei Xie, Martin Gilar, and John C. Gebler Waters Corporation, Milford, MA, U.S.A. INTRODUCTION EXPERIMENTAL

Practical Tips. : Practical Tips Matrix Science

Ensure your Success with Agilent s Biopharma Workflows

PEAKS 8 User Manual. PEAKS Team

Cell Signaling Technology

Combination of Isobaric Tagging Reagents and Cysteinyl Peptide Enrichment for In-Depth Quantification

ProMass HR Applications!

Supplementary Information

Agilent Software Tools for Mass Spectrometry Based Multi-omics Studies

Exam MOL3007 Functional Genomics

Center for Mass Spectrometry and Proteomics Phone (612) (612)

Application Note TOF/MS

ProteinPilot Software for Protein Identification and Expression Analysis

Strategies for Quantitative Proteomics. Atelier "Protéomique Quantitative" La Grande Motte, France - June 26, 2007

Quantitative Analysis on the Public Protein Prospector Web Site. Introduction

The effect of simulated microgravity on the Brassica napus seedling proteome

Spectrum Mill MS Proteomics Workbench. Comprehensive tools for MS proteomics

Proteins. Patrick Boyce Biopharmaceutical Marketing Manager Waters Corporation 1

Key questions of proteomics. Bioinformatics 2. Proteomics. Foundation of proteomics. What proteins are there? Protein digestion

High Resolution Accurate Mass Peptide Quantitation on Thermo Scientific Q Exactive Mass Spectrometers. The world leader in serving science

Supplementary Fig. 1. S-1. Supplementary Fig. 2. S-2. Supplementary Fig. 3. S-3. Supplementary Fig. 4. S-4. Supplementary Fig. 5.

ProteinPilot Software Overview

Rapid Peptide Catabolite ID using the SCIEX Routine Biotransform Solution

Appendix. Table of contents

A Highly Accurate Mass Profiling Approach to Protein Biomarker Discovery Using HPLC-Chip/ MS-Enabled ESI-TOF MS

Využití cílené proteomiky pro kontrolu falšování potravin: identifikace peptidových markerů v mase pomocí LC- Q Exactive MS/MS

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Mass Spectrometry Based Proteomics Data Analysis Using GalaxyP

Protein Valida-on (Sta-s-cal Inference) and Protein Quan-fica-on. Center for Mass Spectrometry and Proteomics Phone (612) (612)

About OMICS Group Conferences

Comparability Analysis of Protein Therapeutics by Bottom-Up LC-MS with Stable Isotope-Tagged Reference Standards

Objective. Introduction. IP assisted LC/MS/MS making study protein complexes easy. Jon Hao 1, Yi Liu 1, Xiaozhi Ren 2, and King-Wai Yau 2

Received: August 5, 2016 Published: December 26, Article. pubs.acs.org/jpr

Fast and Efficient Peptide Mapping of a Monoclonal Antibody (mab): UHPLC Performance with Superficially Porous Particles

High-throughput Proteomic Data Analysis. Suh-Yuen Liang ( 梁素雲 ) NRPGM Core Facilities for Proteomics and Glycomics Academia Sinica Dec.

Pushing the Leading Edge in Protein Quantitation: Integrated, Precise, and Reproducible Protein Quantitation Workflow Solutions

Detecting Challenging Post Translational Modifications (PTMs) using CESI-MS

Advanced QA/QC characterization MS in QC : Multi Attribute Method

Introduction. Benefits of the SWATH Acquisition Workflow for Metabolomics Applications

MBios 478: Mass Spectrometry Applications [Dr. Wyrick] Slide #1. Lecture 25: Mass Spectrometry Applications

Genomics, Transcriptomics and Proteomics

Protein Grouping, FDR Analysis and Databases.

PROTEOINFORMATICS OVERVIEW

New Approaches to Quantitative Proteomics Analysis

Thermo Scientific Peptide Mapping Workflows. Upgrade Your Maps. Fast, confident and more reliable peptide mapping.

Supplementary Results Supplementary Table 1. P1 and P2 enrichment scores for wild-type subtiligase.

ADVANCING ATTRIBUTE CONTROL OF ANTIBODIES AND ITS DERIVATIVES USING HIGH RESOLUTION ANALYTICS

timstof Innovation with Integrity Powered by PASEF TIMS-QTOF MS

Top-Down Proteomics Enables Comparative Analysis of Brain. Proteoforms Between Mouse Strains

ProteomicsBrowser User Guide

Shotgun Proteomics: How Confident are you in that Identification? or Statistical Evaluation of Shotgun Proteomic Data

iprg-2016 Proteome Informatics Research Group Study: Inferring Proteoforms from Bottom-up Proteomics Data

Host Cell Protein Analysis Using Agilent AssayMAP Bravo and 6545XT AdvanceBio LC/Q-TOF

Proteomics software at MSI. Pratik Jagtap Minnesota Supercomputing institute

Important Information for MCP Authors

Enabling Systems Biology Driven Proteome Wide Quantitation of Mycobacterium Tuberculosis

Data Pre-processing in Liquid Chromatography-Mass Spectrometry Based Proteomics

Data Quality Control in Peptide Identification

Precision de novo peptide sequencing using mirror proteases of Ac-LysargiNase and

Progenesis QI for proteomics HCP Spectral Library User Guide

for water and beverage analysis

Strategies in proteomics

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Supplementary Figure 1: MS Data Quality (1/2)

ipep User s Guide Proteomics

MIAPE: Mass Spectrometry Informatics

Transcription:

Supplementary Tables Supplementary Table 1. Detailed information for the six datasets used in this study Dataset Mass spectrometer # Raw files # MS2 scans Reference Dong-Ecoli-QE Q Exactive 5 202,452 / Xu-Yeast-QEHF Q Exactive HF 22 526,301 / Mann-Human-Velos LTQ Orbitrap Velos 3 64,112 [1] Gygi-Human-QE Q Exactive 24 1,121,149 [2] Mann-Mouse-QEHF Q Exactive HF 4 746,116 [3] Pandey-Human-Elite a LTQ Orbitrap Elite 24 406,913 [4] Note: a Only the 24 raw files whose names begin with Adult_CD8Tcells_Gel_Elite were chosen. In the entrapment analysis shown in Supplementary Fig. 8, one RAW file was used for each of the four published datasets, namely, 20100825_Velos2_AnMi_QC_wt_HCD_iso4_swG for Mann-Human-Velos, b1906_293t_proteinid_01a_qe3_122212 for Gygi-Human-QE, 20141202_QEp8_KiSh_SA_Cerebellum_P05_Singleshot1 for Mann-Mouse-QEHF, and Adult_CD8Tcells_Gel_Elite_44_f01 for Pandey-Human-Elite. Supplementary Table 2. The eight search engines used in this study Search engine Version Open search Open-pFind 1.0 PEAKS 7.5 MODa 1.23 MSFragger v20170103 pfind 3.1.0 Comet 2016012 MS-GF+ v10072 Byonic 2.10 Note: Open-pFind is embedded as the default open search workflow of the pfind tool. 1

Items Database Supplementary Table 3. Parameters for database searches Settings Target + Decoy a Enzyme Trypsin Digestion Fully specific for restricted engines and MSFragger Non-Specific for Open-pFind, MODa and PEAKS Max. missed cleavage sites 3 Mass tolerance of precursor ions ± 20 ppm ± 20 ppm (± 0.02 Da if the ppm unit is not supported Mass tolerance of fragment ions for the search engines, e.g., PEAKS) Fixed: carbamidomethylation (C) Modifications for restricted search engines Variable: oxidation (M), Gln pyro-glu (N-termini of peptides) and acetylation (N-termini of proteins) Open-pFind and MODa: no modifications Modifications for open search engines MSFragger and PEAKS: the same modifications as restricted search engines b Max. modifications per peptide 4 Note: a The human protein database was downloaded from UniProt (2016-4-20) for Mann-Human-Velos, Gygi-Human-QE and Pandey-Human-Elite. The mouse protein database was downloaded from UniProt (2016-11-29) for Mann-Mouse-QEHF. Both reviewed and unreviewed proteins were used in this study by default. The E. coli protein database for the K-12 substrain MG1655 was downloaded from NCBI on 2015-10-14 for Dong-Ecoli-QE. The six-frame-translated database was used as the target database for Xu-Yeast-QEHF (the detailed information of database generation is described in Online Methods). b For PEAKS, the modifications were set as those for restricted search engines in the PEAKS DB step, and then the built-in modification list was used in PEAKS PTM for modification detection. Supplementary Table 4. The average number of protein-unique peptides per protein in the proteins co-identified by the eight search engines for the Dong-Ecoli-QE dataset Search engine # Protein-unique peptides per protein Open-pFind 17.3 PEAKS 15.3 MSFragger 14.0 MODa 10.1 Byonic 9.6 pfind 9.2 Comet 9.0 MS-GF+ 8.9 Note: A protein-unique peptide is defined by its amino acid sequence and mapped to only one protein in the given database. 2

Supplementary Table 5. Real search times (in min.) of the eight search engines for the six datasets pfind Byonic MS-GF+ Comet MSFragger PEAKS MODa Open-pFind a Xu-Yeast-QEHF 73 144 85 118 5071 1,269 4,896 41 (158) Dong-Ecoli-QE 4 36 41 22 130 448 91 8 (32) Mann-Human-Velos 9 20 19 26 747 167 1291 20 (78) Mann-Mouse-QEHF 101 274 605 623 6260 12,178 52,228 263 (1,210) Gygi-Human-QE 94 347 383 664 12137 17,013 27,469 210 (903) Pandey-Human-Elite 61 135 186 483 4368 5,880 12,440 92 (414) Note: All MS/MS data were analyzed using a standard desktop computer (8-core CPU @ 2.90 GHz and 32-GB RAM), in which six threads were specified for Open-pFind, MSFragger, pfind, Comet, MS-GF+ and Byonic (Multicore: Normal). MODa performed single-thread searches because multiple threading was not supported in this version. PEAKS used its built-in strategy (about 6 8 threads by observation from the task manager of the operating system). Multicore: Normal setting is used for Byonic. a The single-threaded search time is shown in parentheses. 3

Supplementary Table 6. The analysis of a single LC-MS/MS run consisting of 41,820 MS/MS spectra in the Gygi-Human-QE dataset Fully Specific Semi-Specific Non-Specific Time # PSM Time # PSM Time # PSM MODa 136 14,370 179 19,593 249 19,748 PEAKS 123 23,578 164 26,300 324 26,194 MSFragger 16 22,768 453 20,239 2,466 18,898 Open-pFind (Default) Open-pFind (Unimod-2) Open-pFind (Blind) 8 36,369 9 37,895 9 37,854 12 35,929 21 37,577 31 37,487 7 36,075 11 36,421 18 36,304 Note: The raw file is named as b1906_293t_proteinid_01a_qe3_122212.raw (PXD001468 in ProteomeXchange). The three workflows, namely Default, Unimod-2, and Blind, were introduced in Online Methods. The running time is measured in minutes. Supplementary Table 7. The results of three open search engines with the T. tengcongensis dataset Fully specific digestion Non-specific digestion Time (min.) # PSM Time (min.) # PSM PEAKS 205 40,850 268 69,521 MODa 33 26,084 60 35,941 MSFragger 8 38,004 291 44,794 Open-pFind 4 48,564 6 70,829 Note: The dataset contains 113,531 tandem mass spectra, which has been proposed by Chi et. al. in 2015 (https://doi.org/10.1016/j.jprot.2015.05.009, referred to as TTE-65 in this manuscript), and ~38.5% of the total peptides are semi- or non-specifically digested. The T. tengcongensis database was downloaded from UniProt (2017-05-04), containing both reviewed and unreviewed proteins. The other parameters were the same as those for the other analyses in this study. 4

Supplementary Table 8. The running time and the number of identified PSMs with different tag lengths for the four published datasets Time a (Relative change b ) Identified PSMs (Relative change) 3-tag 7,758 (647.4%) 74,772 ( 0.6%) Mann-Human-Velos 4-tag 2,603 (150.7%) 75,032 ( 0.2%) 5-tag 1,038 (0.0%) 75,203 (0.0%) 6-tag 602 ( 42.0%) 74,516 ( 0.9%) 3-tag 117,444 (848.9%) 985,916 (1.1%) Gygi-Human-QE 4-tag 34,228 (176.5%) 990,940 (1.6%) 5-tag 12,377 (0.0%) 975,629 (0.0%) 6-tag 7,008 ( 43.4%) 939,966 ( 3.7%) 3-tag 152,945 (911.7%) 683,530 ( 0.2%) Mann-Mouse-QEHF 4-tag 46,262 (206.0%) 687,070 (0.3%) 5-tag 15,117 (0.0%) 684,977 (0.0%) 6-tag 8,992 ( 40.5%) 679,067 ( 0.9%) 3-tag 53,676 (931.4%) 388,482 (0.6%) Pandey-Human-Elite 4-tag 15,170 (191.5%) 388,934 (0.7%) 5-tag 5,204 (0.0%) 386,280 (0.0%) 6-tag 3,411 ( 34.5%) 380,884 ( 1.4%) Note: a The running time is measured in seconds. b The relative changes are calculated based on the 5-tag results (in italics) which is used as the default setting in the Open-pFind workflow, e.g., for the Mann-Human-Velos dataset, if 4-tag is used in the open search step, the running time is 2,603 seconds, which is 150.7% more than that of the 5-tag database search. Supplementary Table 9. The tag frequency and tag-index storage space with different tag lengths Tag length Average frequency Storage space (MB) 2 88621.2 0.003 3 4954.3 0.06 4 273.6 1.2 5 19.0 24.4 6 5.2 488.3 Note: the frequency of a tag denotes the number of positions in the protein database that exactly mapped by this tag. For example, all 6-length tags appeared 5.2 times in the database on average. Reviewed and unreviewed human proteins (152,493 in total) were downloaded from UniProt and used in this study. 5

Supplementary Table 10. The number of identified proteins and genes in Kim data Min. pep. FDR (%) Olfactory receptor Average coverage (%) Low coverage (< 10%) proteins Proteins Genes All pep. Unique pep. All pep. Unique pep. 1 19,067 5.63 15,153 34 32.0 27.9 7,282 8,762 2 14,064 1.05 12,723 2 41.5 36.5 2,564 3,608 3 12,239 0.43 11,536 0 46.4 41.2 1,231 1,948 4 11,168 0.22 10,708 0 49.5 44.3 707 1184 5 10,387 0.14 10,069 0 51.8 46.7 436 775 6 9,718 0.07 9,494 0 53.7 48.6 276 525 7 9,148 0.07 8,980 0 55.3 50.2 188 367 8 8,682 0.06 8,549 0 56.7 51.7 132 268 9 8,273 0.02 8,162 0 57.9 53.0 93 194 10 7,899 0.01 7,799 0 59.1 54.1 69 146 Note: Min. Pep. Denotes the minimum number of protein-unique peptides required for supporting the identification of one protein (2 by default in the main text). The coverage of one protein is defined as the fraction of amino acids supported by at least one peptide among all amino acids in this protein sequence. In terms of the protein coverage calculation, All pep. means that all peptides were used to calculate the protein coverage, and Unique pep. means that only the protein-unique peptides were used to calculate the protein coverage. Only peptides with lengths equal to or greater than 9 are considered in this analysis. 6

Supplementary Notes Supplementary Note 1 Using the metabolic labeling technique to estimate the error rates of search engines. NaN ratios can be used to estimate the error rates of different engines independent of the target-decoy strategy. The error rate of one search engine is defined as the fraction of incorrect PSMs in all PSMs reported by this engine. First, we investigated the relationship between decoy PSMs and NaN-ratio PSMs based on the Open-pFind results obtained from the Dong-Ecoli-QE dataset. Fig. S1 shows the increase in the number of decoys and NaN-ratio PSMs along with the numbers of target PSMs (all PSMs were sorted in ascending order of their scores). The trends of the three curves were quite consistent, and the tails (where nearly all PSMs were incorrect) showed that the proportions of both decoy and NaN-ratio PSMs were stable. Fig. S1. The relationship between the number of target PSMs and the number of PSMs from the decoy database (green) or with NaN ratios of 15 N/ 14 N (red) or 13 C/ 12 C (blue) at each score threshold in the Dong-Ecoli-QE dataset. Initially, all PSMs are sorted in ascending order by their scores (e.g., the best PSM ranked at the first place). The subplot shows the linear property of the tails of the three curves. 7

The number of data points (N) used for determining the R 2 values is 53,225 (located at the tail of the curves after 180,000). Therefore, the percentage of NaN-ratio PSMs is useful for estimating the error rates of the results of metabolically-labeled datasets, which is similar to but independent of the traditional target-decoy strategy. Given M as the number of total PSMs and N as the number of NaN-ratio PSMs, we get the equation MM ee rr 1 + MM (1 ee) rr 2 = NN, 1) where e denotes the error rate to be estimated, r 1 denotes the percentage of NaN-ratio PSMs in incorrect matches (e.g., target PSMs distributed at the tail of the curves in Fig. S1) and r 2 denotes the percentage of NaN-ratio PSMs in correct matches. r 1 is simply calculated using the linear least-squares method, and r 2 is estimated based on the intersection of the results of different engines because a PSM is more likely to be correct if it is consistently reported by multiple search engines, resulting in a lower probability of being a NaN-ratio PSM (Fig. S2). In this study, the intersecting results of all eight search engines were used to estimate the value of r 2. Finally, the error rate e is estimated using the following formula: ee = NN MM rr 2 MM (rr 1 rr 2 ), 2) and the precision of the given result set is equal to 1 e. This formula also shows that if r 1 and r 2 are correctly estimated based on the same dataset, then a smaller percentage of NaN-ratio results indicates a lower error rate, i.e., a higher precision. 8

Fig. S2. The proportions of NaN-ratio PSMs distributed in all of the possible intersections of the eight result sets from Open-pFind, PEAKS, MODa, MSFragger, MS-GF+, Byonic, Comet and pfind. The number of intersections (N) for each boxplot is 8, 28, 56, 70, 56, 28, 8, 1. For example, the number of intersections from any three result sets is 8 = 56. Box-plot elements: center line, median; box limits, 3 first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR; dots, outlier data points. 9

Fig. S3. Comparison of estimated precision of consistently and separately identified PSMs between every two search engines using the Dong-Ecoli-QE dataset. 15 N- and 13 C-labeled peptides are used for estimation, and the final precision is calculated from the average of the two estimates for the same resulting PSMs. Each decimal denotes the estimated precision of the consistently or separately identified PSMs. a) Only the PSMs with common modification types (the four that are specified in the restricted search engines) are considered. b) All PSMs are considered. 10

In the Dong-Ecoli-QE dataset, the newly estimated precision of the identified PSMs varied within 95.7 99.2% for different engines when considering only the peptides in the restricted search space (Fig. S3). For the separately identified results, the estimated precision of Open-pFind remained close to 99%, which was significantly higher in comparison with the other search engines. Generally, if considering only peptides with no or only common modifications, all open search engines reported more accurate results than those obtained with the restricted engines because the peptides from the restricted search space survived in a significantly larger space containing a huge number of competing peptide candidates. However, if all identified peptides were considered, the precision of the open search engines decreased to varying degrees. Open-pFind remained at a high global precision of 98.9%, while the precision of the other three open search engines dropped to 93.5% for the best, or to 86.6% for the worst. The potential of the metabolic labeling approach is worth being further explored. 11

Supplementary Note 2 Using the metabolic labeling technique to examine the search engine results. The metabolic labeling technique is helpful in revealing why spectra are misidentified via different search engines and improving search engine precision. Generally, a spectrum with a NaN-ratio peptide reported by one search engine may be identified as a different normal-ratio peptide by another search engine. As described above, the normal-ratio peptide is more likely to be a correct identification. Thus, for the former search engine, this could be used to optimize the scoring function. For all NaN-ratio PSMs from Open-pFind, only less than 10% were revived by other engines, i.e., identified as normal-ratio peptides (Fig. S4). In contrast, Open-pFind revived ~40% of NaN-ratio PSMs reported by other search engines. Fig. S4. The proportions of NaN-ratio PSMs obtained from one engine but revived by others in Dong-Ecoli-QE dataset. a) Comparison between every two search engines. Each decimal denotes the percentage of PSMs revived by the search engine in the row (leftmost) for the total NaN-ratio PSMs from the search engine in the column (topmost). Only peptides with common modifications are considered. b) Similar to a), but all PSMs including all types of modifications are considered. The 15 N-labeled peptides and the unlabeled (common) peptides are used to calculate the quantitative values. 12

Table S1. The fraction of spectra assigned with overlapping peptides among the revived spectra from different engines in the Dong-Ecoli-QE dataset Search engine # Total peptides # Overlapping # Overlapping peptides / (from revived spectra) peptides a # Total peptides (%) b MSFragger 4,161 3,669 88.2 PEAKS 1,221 1,091 89.4 MODa 3,277 2,950 90.0 pfind 40 25 62.5 MS-GF+ 811 770 94.9 Comet 856 807 94.3 Byonic 197 21 10.7 Note: a Two peptides are called overlapping peptides if one peptide sequence is the substring of the other one. For example, GCEHVAK and C(+carbamidomethyl)EHVAK are overlapping peptides. b The fraction of overlapping peptides in all peptide reported by each search engine. For example, a total of 3,669 spectra identified by MSFragger were assigned with overlapping peptides of those reported by Open-pFind, which accounted for 88.2% of the total spectra identified by MSFragger. For the open search engines, Open-pFind reported an overlapping peptide to the one reported by the other engine for ~90% of the revived spectra (Table S1), that is, for two peptide sequences identified by Open-pFind and the other engine, one sequence is the substring of the other one (e.g., GCEHVAK/C(+carbamidomethyl)EHVAK is a pair of overlapping peptides, or we can say that each one is an overlapping peptide to the other). In other words, these peptide sequences reported by the other open search engines were partially correct, while Open-pFind confirmed the exact termini of the peptides and modification types, as well as the precise precursor information. For example, Open-pFind reported a C-terminal-specific peptide carbamyl-gaaggigqalalllk with an N-terminal carbamylation (P 1 ) for one spectrum (Fig. S5a), while MSFragger reported an overlapping tryptic peptide VAVLGAAGGLGQALALLLK with a mass shift of 337.3114 Da (P 2 ). However, the actual mass difference of these two peptides (P 2 P 1 ) was 339.2522 Da. This result implied that the mass shift of 337.3114 Da reported by MSFragger did not represent a real modification because a ~2 Da mass difference existed between the initially exported precursor ion and the actual one confirmed by Open-pFind (Fig. S5b). This finding also demonstrated that exact precursor ions were very important for the confirmation of modification types. 13

Fig. S5. Two example spectra showing the effects of the metabolic labeling technique to distinguish the correct PSMs. +, o and x denote the monoisotopic m/z s of the unlabeled, 15N- and 13C-labeled precursor ions, respectively. The first example is from 3,669 similar results in the result comparison between Open-pFind and MSFragger, and the second example is from 811 similar results in the result comparison between Open-pFind and MS-GF+. a) Ecoli-1to1to1-un-C13-N15-60mM-20150823.42526.42526.2.dta, which is identified by Open-pFind as a semi-tryptic peptide, GAAGGIGQALALLLK, with a carbamylation at the N-terminus (m/z = 698.4203). MSFragger reported another peptide, VAVLGAAGGLGQALALLLK (m/z = 699.3906, Hyperscore= 13.5427), with few b-ions matched. If the precursor ion m/z was changed to 698.4203 for MSFragger (the same to that used in Open-pFind) and semi-tryptic peptides were allowed to search against, a new peptide GAAGGLGQALALLLK was reported with a mass shift of 43.0074 Da (The monoisotopic mass of carbamylation), whose Hyperscore was 35.7024. b) The MS1 information corresponding to the PSM shown in a). c) Ecoli-1to1to1-un-C13-N15-30mM-20150823.35791.35791.2.dta, which is identified by Open-pFind as a peptide, ALTEANGDIELAIENMR, with a deamidation of N at the 6 th position. d) The same spectrum as c), which is identified by Comet and MS-GF+ as a peptide, ALTEANGDIELAIENMR, without any modifications. 14

e) The same spectrum as c), which is identified by Byonic as a peptide, ELGDADHGLNMNRGFSK, without any modifications. f) The MS1 information corresponding to the PSMs shown in c)-e). In terms of the restricted search engines, over 90% of revived peptides reported by MS-GF+ and Comet were partially correct, which was similar to the behavior of the open search engines (Table S1). However, this number was lower for Byonic and pfind. Byonic adopted a different protein FDR control strategy that a few low-quality PSMs from reliable proteins might be reported (Online Methods). Another example shows the differences between Open-pFind and the restricted search engines (Fig. S5c-e). For the same spectrum, Open-pFind reported a tryptic peptide with a deamidation, while MS-GF+ and Comet reported the unmodified form of this peptide, which obviously matched fewer fragment ions. Byonic reported a completely different peptide, which matched few peaks in the spectrum. The isotopic envelopes of the unlabeled peptide reported by Open-pFind, as well as the corresponding 15 N- and 13 C-labeled forms shown in MS1, matched the theoretical values precisely. In contrast, the monoisotopic precursor ions of the other two identifications had larger mass deviations, which resulted in invalid quantitation values (Fig. S5f). This example indicated again that peptides reported by Open-pFind were more accurate, and more importantly, the metabolic labeling technique is extremely helpful when distinguishing correct individual PSMs, which will facilitate the improved design of search engines. 15

Supplementary Note 3 Analysis based on the entrapment strategy showed the robustness of the design of Open-pFind. To analyze four published datasets, two types of entrapment databases were downloaded from the UniProt database and then used in this study: a) a small database of the reviewed proteins of Arabidopsis thaliana (8.7 MB, 15,423 protein sequences) and b) a large database of the reviewed proteins of all organisms (261.8 MB, 555,100 protein sequences). The entrapment databases were appended to the original database files, respectively. The other database search parameters were the same as those shown in Supplementary Table 3. Intuitively, when the entrapment database is considered in the database search, the identification rate should decrease because more random peptide candidates are involved in the search space, but few of them are the answers to any spectra. Generally, the decrease was more remarkable when a larger entrapment database was considered (Fig. S6). The Open-pFind identification rate was more stable in both situations than that of pfind. For example, the average decrease in the identification rates of Open-pFind and pfind was 1.6 and 4.2, respectively (Fig. S6b). The reason was that Open-pFind adopted a two-step workflow and the proteins to be retrieved in the restricted search were automatically learned in the previous open search step, so that most random peptide candidates that potentially interfere with the correct candidates were eliminated at this time. Furthermore, for all PSMs reported by Open-pFind that matched with the entrapment sequences, only less than 5% of them were revived by pfind, i.e., pfind identified the sequences in the original database for those spectra; however, the corresponding pfind percentages varied from 20% to 60% (Fig. S7). This phenomenon proved again that Open-pFind reported more accurate peptides that matched the authentic protein sequences rather than the entrapment sequences, although the same FDR threshold was controlled. 16

Fig. S6. Decreased identification rates caused by the entrapment strategy for the four datasets. a) Proteins from Arabidopsis thaliana were considered the entrapment database. b) Proteins from all organisms recorded in UniProt were considered the entrapment database. 17

Fig. S7. Open-pFind revived more spectra than pfind. The orange curves denote the proportion of PSMs from the entrapment database. a) Proteins from Arabidopsis thaliana were considered the entrapment database. b) Proteins from all organisms recorded in UniProt were considered the entrapment database. 18

We also used the entrapment strategy to evaluate the precision of search engines with the Dong-Ecoli-QE dataset (the reviewed human database downloaded from UniProt was used as the entrapment database), and the performance of Open-pFind was similar to that of the four large-scale datasets. When searching against the target and entrapment databases, Open-pFind reported the highest numbers of PSMs with the smallest proportions of those matched with the entrapment proteins (Fig. S8a). Similar as the analysis shown above, less than 10% of entrapment PSMs from Open-pFind were revived by the other engines, while 22 56% of entrapment PSMs from other engines were revived by Open-pFind (Fig. S8b). Fig. S8. Entrapment analysis of the Dong-Ecoli-QE dataset. a) The number of identified PSMs (the blue bars, including PSMs from both original and entrapment protein databases) and the percentage of PSMs from the entrapment database (the orange curve). b) The number and proportion of PSMs identified with entrapment peptides from one engine and revived by the other engine. For example, 359 entrapment PSMs were identified by PEAKS and revived by Open-pFind, which accounted for 49.8% of the total entrapment PSMs identified by PEAKS. 19

Supplementary Note 4 Nearly 100% of high-quality spectra in the four published datasets are identified within a comprehensive search space. We also investigated why a few spectra remained uninterpretable for Open-pFind. First, spectra are classified according to the lengths of their longest tags, which are treated as a feature related to spectral quality. For example, a 0-length tag indicates that no mass difference from any two peaks is equal to the mass of any amino acid residue within a given fragment ion tolerance. A spectrum with a longer tag meant that it was more likely to have been formed by a real peptide because more fragmentation information was provided. Generally, the identification rates of spectra with longer tags were higher for all engines (Fig. S9). For all four datasets, the identification rate of Open-pFind was always greater than 90% and even close to 100% for spectra with tags longer than ten, suggesting that the search space of Open-pFind is close to complete for routine MS/MS data analysis. Additionally, the scoring scheme of Open-pFind effectively distinguishes correct peptides from the random peptides, even in such an ultra-large search space. The identification rates of Byonic sharply decreased when spectra with longer tags were considered in the Mann-Mouse-QEHF dataset (Fig. S9c), likely because more large-mass peptides were present in this dataset, and their precursor ions were not accurately exported. Among all PSMs identified via Open-pFind in this dataset, 55.0% of their precursor ions were larger than 1,500 Da, of which only 50.1% were initially exported by the vendor s software. However, in the other datasets, the proportion of precursor ions larger than 1,500 Da was markedly smaller, for example, only 38.8% for the Pandey-Human-Elite dataset, of which 82.1% were extracted initially by the vendor s software. We also tested pfind using the precursor ions extracted by the vendor software rather than pparse, and the distribution of identification rates was similar to that of Byonic (Fig. S10), which again proved that extracting accurate precursor ions was very important for search engine design. 20

Fig. S9. Analyses of the unidentified spectra with different maximum tag lengths in the four datasets. The curves denote the identification rates of the spectra with different maximum tag lengths, and the histograms denote the distribution of the number of the total spectra at each tag length. Fig. S10. The distribution of the identification rates of Byonic and pfind at different maximum tag lengths extracted from the spectra. Two modes are adopted for pfind, and the only difference is whether pparse is used to calibrate the precursor ions. 21

Supplementary Note 5 Comprehensive analysis of the Kim data. The average identification rate was 62.5% for all 85 samples, and over 70% spectra were identified for the in-gel digested samples analyzed on an LTQ Orbitrap Velos (Fig. S11a). The results obtained with Open-pFind demonstrated that the characteristics of MS/MS data vary according to different methods for sample preparation and LC-MS/MS. In terms of modifications, although several common modifications, e.g., carbamidomethylation, oxidation and Gln pyro-glu, were always abundant in all datasets, many unexpected modifications still appeared in only one or two types of datasets (Fig. S11b). For example, propionamides of cysteines were hardly detected in the brplc fractionation samples but appeared as one of the most abundant modifications for cysteines in all peptides from in-gel digested samples (Supplementary Data 3), which was consistent with a previous study by Sechi et al. 5. On the other hand, the percentages of fully tryptic peptides were stable among the four types of datasets with different experimental conditions (97 99% concluded from Fig. S11c). In terms of co-eluting peptide identification, LTQ Orbitrap Elite tended to produce more mixed spectra than LTQ Orbitrap Velos, likely due to its higher sensitivity, allowing less-abundant peptides to be detected and identified via Open-pFind (Fig. S11d). The different characteristics of these datasets again proved that specifying an appropriate search space for each individual dataset based on expert experience is always difficult, and uniformly considering a comprehensive search space for different experimental conditions is essential for today s search engines. On the other hand, biological modifications and mutations were effectively discovered by Open-pFind. For example, Laminin subunit gamma-1 was identified by different types of peptides, all of which were supported by over ten PSMs (Fig. S11e). The N-terminal cleavage site of QAAMDECTDEGGRPQR was confirmed by the signal peptide recorded in UniProt. In addition, two amino acid mutations were discovered by Open-pFind, and one of them, the R1121Q, was verified previously (rs20559 in dbsnp 6 ). Identification results from the extended search space were also valuable for other biological discoveries. For example, a total of 9,559 semi-tryptic peptides were identified as being located in the 22

N-terminal regions of proteins (the C-terminal amino acid of each peptide located before the 60 th amino acid of the corresponding protein), of which 34.1% had complete ion series (at least one b or y ion was detected at each peptide bond), and 66.4% had at most two peptide linkages in which both the b and y ions were missing. These semi-tryptic peptides provide valuable clues for identifying signal peptides, and 694 of them were already verified in UniProt (Supplementary Data 4). The score distributions of these 9,559 peptides and the total 548,371 peptides (Fig. S12) indicated that although these semi-tryptic peptides were from a much larger search space (Supplementary Fig. 10), their confidence was still comparable to that of the total results. 23

Fig. S11. Profiling the Kim data using Open-pFind. a) The distribution of identification rates of each RAW file. Each boxplot denotes the distribution for each type of the experimental settings (brp_velos, brp_elite, Gel_Velos and Gel_Elite; N = 338, 775, 585, 514 for the number of raw files in the four boxplots shown from left to right, respectively). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR; grey dots, outlier points. b) The distribution of highly abundant modifications. Each number in one cell denotes the percentage of modified amino acids among all amino acids that appeared among the identified peptides. For example, 79.7% of cysteines were modified by carbamidomethylation in the identified peptides from an LTQ Orbitrap Velos MS fractionized by brplc. c) The distribution of the fraction of semi- and non-specific peptides under different experimental conditions. Each boxplot denotes the distribution for each type of the experimental settings (brp_velos, brp_elite, Gel_Velos and Gel_Elite; N = 338, 775, 585, 514 for the number of raw files in the four boxplots shown from left to right, respectively). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1 1.5 IQR to Q3+1.5 IQR. d) The distribution of peptide numbers identified from one spectrum. For example, 7.5% of the identified spectra from an LTQ Orbitrap Velos MS fractionized by brplc each contribute two peptides. e) The identified peptides in Laminin subunit gamma-1. Red numbers in the brackets denote how many PSMs correspond to each peptide. Fig. S12. The score distributions from the 9,559 semi-tryptic peptides and the scores of all 548,371 peptides identified in Kim data. 24

Supplementary References 1. Michalski, A. et al. Mass spectrometry-based proteomics using Q Exactive, a high-performance benchtop quadrupole Orbitrap mass spectrometer. Mol Cell Proteomics 10, M111 011015 (2011). 2. Chick, J.M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotechnol 33, 743-749 (2015). 3. Sharma, K. et al. Cell type- and brain region-resolved mouse brain proteome. Nat Neurosci 18, 1819-1831 (2015). 4. Kim, M.S. et al. A draft map of the human proteome. Nature 509, 575-581 (2014). 5. Sechi, S. & Chait, B.T. Modification of cysteine residues by alkylation. A tool in peptide mapping and protein identification. Anal Chem 70, 5150-5158 (1998). 6. Sherry, S.T., Ward, M. & Sirotkin, K. dbsnp-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res 9, 677-679 (1999). 25