PeptideShaker enables reanalysis of mass spectrometryderived. proteomics datasets

Size: px

Start display at page:

Download "PeptideShaker enables reanalysis of mass spectrometryderived. proteomics datasets"

Harriet Sutton
5 years ago
Views:

1 PeptideShaker enables reanalysis of mass spectrometryderived proteomics datasets Marc Vaudel 1,2, Julia M. Burkhart 1, René P. Zahedi 1, Eystein Oveland 2,3,4, Frode S. Berven 2,4,5, Albert Sickmann 1, Lennart Martens 6,7,* and Harald Barsnes 2,8 1 Leibniz-Institut für Analytische Wissenschaften ISAS e.v., Dortmund, Germany 2 Proteomics Unit, Department of Biomedicine, University of Bergen, Bergen, Norway 3 Department of Clinical Medicine, University of Bergen, Bergen, Norway 4 The KG Jebsen Centre for MS-research, Department of Clinical Medicine, University of Bergen, Bergen, Norway 5 The Norwegian Multiple Sclerosis Competence Centre, Department of Neurology, Haukeland University Hospital, Bergen, Norway 6 Department of Medical Protein Research, VIB, Ghent, Belgium 7 Department of Biochemistry, Ghent University, Ghent, Belgium 8 Computational Biology Unit, University of Bergen, Norway * Correspondence: Prof. Dr. Lennart Martens A. Baertsoenkaai 3 B-9000 Gent Phone: Fax: lennart.martens@ugent.be 1

2 Table of Contents Introduction Installation and Hardware Requirements Creating a New Project PRIDE Reanalysis File Import Results Navigation Validating Proteins, Peptides and PSMs PTM Scoring Reports, Follow Up Analyses and Submissions to PRIDE Command Line Use Documentation, Help, Support and Updates References

3 Introduction The interpretation of protein identification results from identification algorithms can be conducted in several software environments 1, freeware or commercial, some delivered with the instrument by the vendor. MaxQuant 2, the TransProteomic pipeline 3-4, OpenMS 5-7, and ID Picker 8 are examples of the efforts of the scientific community to provide free solutions for protein identification. Here, we present PeptideShaker, an interface to assemble and inspect results from tandem mass spectra identification algorithms. PeptideShaker allows intuitive interpretation of peptide and protein mass spectrometry based identification results. In combination with SearchGUI 9, a user-friendly graphical user interface to conduct proteomics searches, it provides a full identification solution for both locally generated datasets and publicly available data in PRIDE 10 via ProteomeXchange 11. For every identified protein, peptide and spectrum, PeptideShaker delivers useful information like identification confidence, modification site(s) and external information via resources like Ensembl 12 or PDB 13. All results can be exported in various reports, either for further follow up analyses or for submission to PRIDE. Notably, the use of PeptideShaker does not require extensive knowledge in bioinformatics. PeptideShaker is fully documented and comes with contextual help and extended tutorials. By detailing every step from data loading to results display, the present supplementary material details how the peptides and proteins are inferred from search engine results and how these are scored, displayed and connected with rich resources for protein identification. Unless stated otherwise, the data and illustration here were obtained on the PeptideShaker example dataset, a standard measurement of HeLa cell lysate as detailed elsewhere 14 and freely available in the ProteomeXchange 15 consortium via the PRIDE 10 partner repository under the accession PXD If you have any questions about PeptideShaker, please do not hesitate to contact the developers at the PeptideShaker Google Group: 3

4 1.0 Installation and Hardware Requirements PeptideShaker is an open source project developed in Java under the very permissive Apache2 open source license. The complete source code, cross-platform executables and additional information is available at The software does not require installation beyond unzipping and then double clicking the downloaded file, and works on Windows, Linux and Mac platforms. (For the first execution on newer Mac platforms, controlclick on the file icon and then select "Open." This will provide the option to run the file regardless of its (unidentified) source. Help on usage on Mac can be found on the PeptideShaker website.) The only prerequisite to run PeptideShaker is that Java is installed. However, due to the large size of modern proteomics datasets and databases, the software performance will depend on the amount of available memory, the more memory the better the performance. When creating a new project, it is recommended to provide at least 4 GB of memory for smaller projects (<100,000 spectra, <100,000 protein sequences), while bigger projects will be more memory demanding. Working with less than 4 GB of memory is supported; import time will however be substantially extended. Memory settings can be edited from the interface directly under Edit > Java Options or via the Welcome Dialog under Settings and Help. Note that Java 32-bit does not support high memory settings. It is therefore strongly recommended to work on 64-bit machines the standard for all recent computers. Java 32-bit is often installed by default on 64- bit machines, and it is then preferable to instead install Java 64-bit. Help with Java installation and usage can be found here: When not all information can be loaded into memory, the tool will interact with locally stored data. This process will be substantially sped up on SSD discs. In general, read/write operations are the main speed limiting steps; it is thus advised to operate on SSD discs. Although the creation of a PeptideShaker project is computationally demanding, its opening and viewing does not require equally powerful hardware capabilities. It is thus possible for mass spectrometry labs to create the project on a high performance machine, save it and share it to end users who can then inspect it on standard desktop computers. It is also possible to create projects automatically on servers and clusters via the command line options of the tool (see Command Line Use). 4

PeptideShaker is started by double clicking the jar file in the unzipped downloaded file. The PeptideShaker Welcome Dialog is then displayed, see Supplementary Figure 1.

5 PeptideShaker is started by double clicking the jar file in the unzipped downloaded file. The PeptideShaker Welcome Dialog is then displayed, see Supplementary Figure 1. From this dialog it is possible to: (1) Create a New Project; (2) Open a Saved Project; (3) Start a Search using SearchGUI; (4) Reshake a PRIDE Project, i.e., reprocess a dataset in PRIDE; (5) Open an Example Project; and (6) See the Getting Started Mini Tutorial. Supplementary Figure 1: PeptideShaker Welcome Dialog. This dialog is displayed when starting the tool and can be used to start the processing of a dataset (see text for details). 5

6 2.0 - Creating a New Project After selecting New Project, the user sets up the new project as displayed Supplementary Figure 2. Supplementary Figure 2: New Project Dialog were the user defines the new PeptideShaker project. The setup includes: (i) annotation of the project very useful for later reuse and sharing; (ii) selection of the input files; and (iii) editing of processing parameters. The processing parameters include: (a) the settings used for the search; (b) the import filters; and (c) the import preferences. These parameters are important as they impact the protein and peptide result set and can thus not be modified after the project has been created. The search settings are the settings used for the search, and for SearchGUI results these are automatically loaded when selecting the search result files. Using import filters allow the removal of Peptide to Spectrum Matches (PSMs). Given that identification quality is known to depend on sequence length 16, it is possible to filter out short/long peptides. Filters on precursor mass deviation as suggested by Beausoleil et al. 17 is also supported. Finally, as some modifications are not recognized by all search engines, e.g., PTMs located on protein termini or PTMs targeting motifs of several amino acids. It is possible to use a comprehensive search targeting all termini or a single amino acid and refine the results a posteriori by using only the modifications of interest. This is achieved by filtering the PTMs not matching the PeptideShaker PTM definition and is activated by clicking Exclude Unknown PTMs. 6

7 The processing parameters include initial False Discovery Rate (FDR) validation thresholds which can be altered after the project has been created (see Validating Proteins, Peptides and PSMs) and PTM scoring options (see PTM Scoring). Clicking Load Data! starts the processing of the files and the creation of the PeptideShaker project. 7

3.0 - PRIDE Reanalysis The PRIDE Reshake option allows any scientist to easily reprocess datasets in PRIDE without requiring advanced bioinformatics skills.

8 3.0 - PRIDE Reanalysis The PRIDE Reshake option allows any scientist to easily reprocess datasets in PRIDE without requiring advanced bioinformatics skills. After clicking the PRIDE Reshake button in the Welcome Dialog, the user can choose to reanalyze either public or private datasets. (Private datasets require the input of username and password details.) In both cases the user will see the list of available projects with the associated assays and files as shown in Supplementary Figure 3. Supplementary Figure 3: PRIDE data selection. At the top the project to reanalyze is selected. The assays and associated data files are shown in the tables below. The user can search for specific projects using the advanced Find feature as shown in Supplementary Figure 4. 8

9 When the user has located the project to reanalyze, the "Reshake PRIDE Data" button is clicked. This will open the Reshake Settings Dialog shown in Supplementary Figure 5, where the user can customize the properties of the reanalysis. 9

10 After clicking the "Start the Reshaking!" button, PeptideShaker starts downloading the data file(s) and extract the spectra and search settings. (Missing information can be manually added later in SearchGUI.) PeptideShaker then starts SearchGUI where the user can edit the search parameters as displayed in Supplementary Figure 6. Supplementary Figure 4: It is possible to edit the inferred parameters to accurately reproduce the original search or completely change the context of the dataset. The search can now be started and the import in PeptideShaker is directly triggered. 10

11 4.0 - File Import Search engines assign peptide candidates derived from the protein database to every spectrum, providing a score illustrating the quality of the match: generally an e-value. PeptideShaker supports different types of import formats: X!Tandem 18 t.xml files implemented in the BioML format 19, OMSSA 20 omx files, MS Amanda 21 csv files, Mascot 22 dat files, and the PSI mzidentml 23 format. The latter, notably, allows importing search results from MS-GF+ 24, and from virtually any identification algorithm if the minimal information required for import in PeptideShaker is present in the file. Details on mzidentml and Mascot files requirements can be found on the PeptideShaker website. First, the search engine results are parsed using open source Java parsers that are all published and actively maintained The results are then loaded into an open source search engine independent structure 29, allowing PeptideShaker to manipulate, save and open this large amount of information. PeptideShaker takes advantage of the target/decoy strategy 16 to convert the scores of the search engines into Posterior Error Probability (PEP) values as commonly done in proteomics 30. Note that peptides mapping to both target and decoy proteins are excluded from the import. For every peptide candidate, the product of the search engine PEPs is given as score and the best, i.e., the lowest, scoring peptide is picked as the best candidate. Given that search engines are known to encounter difficulties at localizing PTMs, a peptide is here defined by its amino acid sequence and the number of PTMs without accounting for their location. Amino acids with a mass difference lower than the fragment ion tolerance are considered as undistinguishable. If two peptide candidates score equally, they are discriminated by: (i) the occurrence of their parent protein in the dataset; (ii) the number of search engines supporting the peptide; (iii) the number of fragment ions annotated in the spectrum; and (iv) the precursor mass error. In order to improve the identification rate, the list of PSMs resulting from the search engine combination is separated into groups according to the run (i.e. spectrum file) and identified charge. Groups are created for every spectrum file and for charges from the lowest to the highest only if the group size does not compromise statistical accuracy 31 : a group size is considered sufficient if more than 100 target hits are present before the first decoy hit and if the estimated PEP resolution is lower than 1%. A PEP is then estimated for every PSM based on the target and decoy hit distributions in the charge specific group. From these PSMs, a list of peptides is established. When two peptides differ only in the PTM localization, they are considered as separate peptide identification entities only if the PTMs are 11

12 confidently localized (see PTM Scoring). A peptide score, PSM PEPs, is attached to every peptide:, the product of the (1) Where is the estimated PEP of the i th PSM identifying the considered peptide. The peptides are then grouped according to their modification status, again, only when the size of the groups allows it. The size criterion is the same as for the PSM groups: more than 100 target hits before the first decoy hit and the estimated PEP resolution lower than 1%. A peptide level PEP is estimated based on the target and decoy hit distributions in the modification specific group. Using the FASTA file provided by the user, every peptide is mapped to the parent protein sequences. Here again, indistinguishable amino acids are considered as such based on the fragment ion accuracy. Moreover, when a protein sequence contains X's the mapping will be ignored if X's make up more than 25% of the peptide sequence. Subsequently, protein ambiguity groups are created based on peptide unicity as introduced by Nesvizhskii 32, such that peptides are unique to a group. PeptideShaker scores the protein groups using the product of the estimated peptide PEPs: (2) Where is the estimated PEP of the i th peptide identifying the considered protein group. When an ambiguity group presents a subset (for example group Proteins A or B or C is identified as well as group Protein A or B ), the complete group ( Proteins A or B or C in example) is considered as unlikely and ignored if: (i) the additional proteins in the complete group (Protein C in example) are only supported by non-enzymatic peptides (when searching with an enzyme); (ii) are uncharacterized proteins or proteins with lower evidence (UniProt accessions only); or (iii) the subset ( Proteins A or B in example) scores better and is hence more likely to be found. In these cases, the peptides are assigned to the subset. Finally, the PEP of every protein group is estimated by comparing the target and decoy distributions; a representative protein is selected for every group based on the peptide enzymaticity (when searching with an enzyme) and the protein evidence (UniProt accessions only) and description; and the peptides of complete groups ( Proteins A or B or C in example) are linked to all subsets ( Proteins A or B in example). During the processing of the data, the progress, including tips and warnings, is shown to the user as displayed in Supplementary Figure 7. 12

13 Supplementary Figure 5: The Waiting Dialog displays progress of the project creation process and also displays tips and warnings. 13

5.0 - Results Navigation When a project has been created, PeptideShaker s main interface displays the identification results in a clear and intuitive fashion, allowing the user to easily navigate

14 5.0 - Results Navigation When a project has been created, PeptideShaker s main interface displays the identification results in a clear and intuitive fashion, allowing the user to easily navigate even large datasets. The interface is divided into nine interconnected tabs. By default, the results are displayed in the Overview tab as shown in Supplementary Figure 8. Supplementary Figure 6: The interface of PeptideShaker consists of nine interconnected tabs corresponding to different use cases. The Overview tab displays extended information of the identification matches and allows the user to intuitively navigate the identification results. At the top of the Overview tab, a table displays detailed information about the protein ambiguity groups identified. The Protein Inference (PI) informs the user about the protein inference status of the protein ambiguity group using different colors: (i) green for single proteins; (ii) yellow for groups of related proteins; (iii) orange for groups of related and unrelated proteins; and (iv) red for groups of unrelated proteins only. Proteins are considered as unrelated if their associated genes (UniProt accessions only) or descriptions differ. When clicking the colored rectangle, the user can inspect the protein inference status of the protein group as displayed Supplementary Figure 9. 14

Supplementary Figure 7: The Protein Inference Dialog allows the user to inspect the protein ambiguity groups, here consisting of 53 proteins considered as related by PeptideShaker (Histocompatibility

15 Supplementary Figure 7: The Protein Inference Dialog allows the user to inspect the protein ambiguity groups, here consisting of 53 proteins considered as related by PeptideShaker (Histocompatibility antigens). The first table displays the proteins matched with information about the gene, chromosome, protein evidence and peptide enzymaticity. The dialog also displays eventual unique hits which can be related to this group as here Q29960 which was found with a very low score, and other protein groups related to this group. Note that the user can alter both the protein group label and the representative protein. The three next columns in the Overview protein table provide information about the selected protein representative: protein accession, description and chromosome number. Chromosome number is available when the species is selected and is obtained from Ensembl 12. The species can be selected when creating the project or can be set later via the Edit menu. Clicking the chromosome number provides additional information about the gene associated by UniProt to the representative protein of the protein ambiguity group: Ensembl Gene ID, Gene Name, Chromosome, and GO annotation as displayed Supplementary Figure

Supplementary Figure 8: When clicking the chromosome number in the protein table, gene information about the selected protein is extracted from Ensembl and displayed to the user.

16 Supplementary Figure 8: When clicking the chromosome number in the protein table, gene information about the selected protein is extracted from Ensembl and displayed to the user. Next, the protein coverage is displayed. The detected coverage (colored) is compared to the expected observable coverage (grey). The latter is estimated based on the size distribution of the identified peptides and the maximal size allowed for enzymatic peptides. The two following columns represent the number of identified and validated peptides and PSMs for the protein group (see Validating Proteins, Peptides and PSMs). Note how PeptideShaker takes full advantage of so-called sparklines ( wiki/sparklines) to make the coverage and the results of the statistical analysis easier to interpret. Sparklines are used throughout the PeptideShaker tables using our open source JSparklines library ( greatly enhancing the results inspection. Subsequently, a spectrum counting index is displayed. Although PeptideShaker is dedicated to identification, it does provide spectrum counting metrics that allow for a rough estimation of protein abundances directly from identification results 33. PeptideShaker comes with a version of the empai index 34 and an improved version of the NSAF index 35, chosen as the default option for its accuracy 36. NSAF is a simple and efficient method where the number of spectra for a given protein is normalized by the protein length : (3) 16

17 It should be noted that a major issue with most spectrum counting indexes is that they do not take into account protein inference issues and cleavage efficiency. PeptideShaker thus implements an improved version of the NSAF index, where the contribution of the i th validated PSM is weighted by a protein inference coefficient where is the number of protein ambiguity groups where the matched peptide is included. Note that if a peptide is redundant in the sequence of a representative protein, is increased accordingly. Moreover, the observable length of the protein, as used for the observable coverage,, is used, thus discarding all domains of the sequence which cannot generate detectable peptides: (4) Towards the right, the molecular weight (MW) of the representative protein is shown, and finally the confidence attached to the protein group and its validation status is displayed. When a protein group is selected, the peptides mapping to the selected group are displayed. As for the proteins, the protein inference status of the peptide is color coded and detailed information is accessible by clicking the colored rectangle. The peptide sequence, followed by the peptide's location in the representative protein of the protein group is shown next. Note how the graphical representation of the peptide localization in the protein sequence allows for intuitive interpretation and that PTMs are color coded in the sequence. The use of white font on a colored background indicates a confident PTM localization, while a colored font on a white background indicates a non-confident PTM localization. The PTM color coding can be edited in the search parameters. Finally, the results of the target/decoy statistical processing are displayed as in the protein table. Similarly, when a peptide is selected, the PSMs mapping to the peptide are displayed. The colored coded column (SE) shows the agreement of the search engines for the given spectrum, and clicking the colored rectangles will show the details for each search engine for the given spectrum. Subsequently, detailed information about the match is displayed: sequence with color coded PTMs, charge, precursor mass error, confidence, and validation status. The spectrum corresponding to the selected PSM is displayed at the bottom right with user customizable fragment ion annotation. The three spectrum sub plots above the spectrum make it easier to assess the quality of the match: (i) the intensity of every fragment ion is displayed relative to the peptide sequence for forward (blue, down) and rewind (red, up) ions; (ii) a histogram of the intensities of the annotated (green) and the non-annotated (grey) peaks; and (ii) the fragment ion m/z error is plotted against the peak m/z for forward (blue) and rewind (red) ions. The latter allows for straightforward detection of calibration issues. The user can 17

fully customize the spectrum annotation and display like color and appearance of the peaks, benefit from the advanced annotation as shown in Supplementary Figure 11, and export the plots in

18 fully customize the spectrum annotation and display like color and appearance of the peaks, benefit from the advanced annotation as shown in Supplementary Figure 11, and export the plots in publication-grade quality. Supplementary Figure 9: A spectrum as displayed in the Overview tab. The intensity of every fragment ion is displayed relative to the peptide sequence at the top left for forward (blue, down) and rewind (red, up) ions. In the top middle, a histogram of the intensities of the annotated (green) and the non-annotated (grey) peaks is displayed. And on the top right the fragment ion m/z error is plotted against the peak m/z for forward (blue) and rewind (red) ions. The spectrum is displayed with fully customizable annotation as exemplified here by the overlay of automated de novo sequencing using the selected PSM. PeptideShaker also provides visualizations of multiple PSMs. When selecting different PSMs simultaneously, spectra are displayed in a so-called planetary system view where the x-axis represents the m/z of the spectrum, the y-axis the m/z error and the size of the data point represents the intensity of the peak. This display makes it easy to spot outliers or mass calibration issues as displayed in Supplementary Figure

Supplementary Figure 10: The planetary system view allows the comparison of multiple measurements (PSMs) of the same peptide within the

For every fragment ion of every PSM matched in the respective spectrum, the x-axis represents the m/z of the annotated peak, the y-axis

This display makes it easy to inspect the spectrum reproducibility and detect outliers or mass deviation issues.

19 Supplementary Figure 10: The planetary system view allows the comparison of multiple measurements (PSMs) of the same peptide within the same plot. For every fragment ion of every PSM matched in the respective spectrum, the x-axis represents the m/z of the annotated peak, the y-axis the m/z error and the size of the data point represents the intensity of the peak. This display makes it easy to inspect the spectrum reproducibility and detect outliers or mass deviation issues. Here the second PSM in blue seems to deviate from the others, showing a typical ppm error. The annotated peaks can also be visualized in a table as displayed in Supplementary Figure 13. Supplementary Figure 11: For a given PSM, the fragment ion matches can be plotted in a table conveniently showing the m/z values of the detected ions. 19

Selecting multiple PSMs shows the reproducibility of the spectrum acquisition displayed in intensities using the summed spectrum intensity for normalization as displayed in Supplementary Figure 14.

20 Selecting multiple PSMs shows the reproducibility of the spectrum acquisition displayed in intensities using the summed spectrum intensity for normalization as displayed in Supplementary Figure 14. Supplementary Figure 12: When selecting multiple PSMs, the intensities of the matched peaks can be displayed with error bars indicating the variability of spectrum acquisitions. At the bottom of the Overview panel, the sequence of the representative protein of the selected protein group is displayed, where colors represent the areas of the sequence covered by the experiment, and grey the coverable areas of the sequence according to the chosen enzyme and identified peptide sizes distribution. Green, yellow and red indicate the areas of the sequence covered by confident, doubtful and not validated peptide matches (see Validating Proteins, Peptides and PSMs). As displayed in Supplementary Figure 15, the PTMs are also localized and color coded, and clicking on an area of the sequence can be used to select the given peptide. Supplementary Figure 13: The protein coverage panel displays the representative protein of the selected protein group, as demonstrated here with protein of accession P49588 of the example dataset of PeptideShaker (968 amino acids). Here 17.56% of an expected 91.84% coverage is observed as displayed in color and grey, respectively. 12.6%, 3.2%, and 1.76% of the coverage was achieved using confident, doubtful, and not validated peptide matches (green, yellow and red areas), respectively. The currently selected peptide is displayed in blue and the modifications are color coded (here blue for oxidation of methionine). When clicking in the sequence, as here done between amino acids 880 and 899 the identified peptides can be selected. Here the peptide MHSPQTSAMLFTVDNEAGK was found with and without oxidation of methionine 9. According to the PTM localization scores (see PTM Scoring), the localization of the oxidation is confident as indicated by the colored background. 20

21 The other PeptideShaker tabs cover other use cases with the same focus on intuitive navigability of the identification results: (i) the Spectrum IDs tab (Supplementary Figure 16) allows the user to compare the results of the different search engines; (ii) the Fractions tab (Supplementary Figure 17) displays the contributions of different fractions to the final results; (iii) the Modifications tab (Supplementary Figure 18) makes it possible to browse peptides carrying certain variable modifications and inspect the results of the localization scoring algorithms; (iv) the 3D Structures tab (Supplementary Figure 19) maps the identification results onto 3D structures from PDB; (v) the Annotation tab (Supplementary Figure 20) connects the identification results to various external resources like Reactome 37 or STRING 38 ; (vi) the GO analysis tab (Supplementary Figure 21) displays Gene Ontology statistics of the dataset; (viii) the Validation tab (Supplementary Figure 22) allows the inspection of the target/decoy results and adapting the validation threshold; and (ix) QC Plots tab (Supplementary Figure 23) displays different quality controls metrics on the identified proteins, peptides and PSMs. For further details on the tabs see the figure legends below the figures on the following pages. 21

The center-left panel shows the match retained by PeptideShaker followed by a table listing all matches from all identification algorithms including secondary hits.

22 Supplementary Figure 14: The Spectrum IDs tab allows the user to compare the results of the different search engines. The table at the top shows the identification results retained by PeptideShaker for a given spectrum file with extended information on the PSM retained for every spectrum. The center-left panel shows the match retained by PeptideShaker followed by a table listing all matches from all identification algorithms including secondary hits. The confidence a secondary hit would have obtained if retained as first hit is displayed allowing hits comparison accross search engines. The center-left panel displays the spectrum of the selected PSM, it is thus possible to inspect the spectrum annotation of a secondary hit. At the bottom, several plots display identification algorithms performance with respect to the results displayed in PeptideShaker, alowing straightforward control of the performance of all algorithms. From the left to the right: (1) the number of PSMs a given algorithm would provide if used alone, (2) the number of PSMs of the merged dataset which can be ascribed to a single algorithm only, (3) the number of unassigned spectra a given algorithm would lead to if used alone, and (4) the identification yield a given algorithm would lead to if used alone. Note that these metrics are biased toward search engine agreement by the strategy used to select the best match of every spectrum. 22

Supplementary Figure 15: If loading mulitple spectrum files, the Fractions tab indicates in which spectrum files the different proteins would be validated if not found in the other files.

23 Supplementary Figure 15: If loading mulitple spectrum files, the Fractions tab indicates in which spectrum files the different proteins would be validated if not found in the other files. Various plots are available to visualize the distrubution of each protein across the spectrum files, e.g., the number of peptides per spectrum file as shown here. It should be noted that the plots only give an indication of the protein distrubtion across the spectrum files, mainly for inspecting fractionated data, and should not be used as a comparison of distinct experiments where a more advanced processing is required

24 Supplementary Figure 16: The Modifications tab supports the browsing of the peptides carrying variable modifications. The modification of interest is selected in the top left and the modified peptides of the selected category are displayed in the top right table. When a modified peptide is selected, PeptideShaker looks for any related peptides where related peptides carry other modifications or present different cleavage sites. Here, a phosphorylated peptide is selected, and the same version with no modification and with an oxidation of methionine is found. The confidence in the PTM localization is color coded and the results of the localization scores are displayed on the sequence. More information on the localization is available when clicking in the PTM column or by selecting the desired score tab under the spectrum. 24

Supplementary Figure 17: The 3D Structures tab maps the identified peptides onto the 3D structure of the

25 Supplementary Figure 17: The 3D Structures tab maps the identified peptides onto the 3D structure of the protein from the PDB 13 using Jmol, an open-source Java viewer for chemical structures in 3D ( Identified peptides are shown in green, the selected peptide in blue, and the PTMs are color coded and annotated on the structure. Supplementary Figure 18: The Annotation tab allows querying various external resources with results from PeptideShaker. Using the protein accession number, the external resources can be directly searched as illustrated here with a Reactome pathway and a STRING interation network. 25

26 Supplementary Figure 19: The GO Analysis tab provides Gene Ontology statistics on the validated proteins, highlighting the terms that are significantly more or less frequent in the given dataset compared to the annotation of the species in Ensembl using a hypergeometric test. 26

27 Supplementary Figure 20: The Validation tab allows the user to inspect the target/decoy results for proteins, peptides and PSMs as selected in the top left table. False positive and negative rates are displayed and can be optimized in an intuitive cost/benefit way. Supplementary Figure 21: the QC Plots tab allows the user to quality control several parameters of the protein peptide and PSM identification. Here for example, the distribution of the precursor mass error is displayed, and it is then easy to see the instrument resolution. 27

28 6.0 - Validating Proteins, Peptides and PSMs PeptideShaker takes full advantage of the target/decoy strategy to allow the user to extensively control the quality of the identifications. As already mentioned, the confidence in identification matches is displayed as the complement of the PEP: (5) As an example, 100 matches with a confidence of 90% contain 10 false positives. In proteomics, identification results are typically validated at a given False Discovery Rate (FDR). The FDR indicates the share of false positive matches in the result set: (6) Where represents the count of target false positives and N the number of retained target hits. As an example, a result set of 100 matches with an FDR of 1% will contain only one false positive. The number of false positives can be equivalently estimated via the number of decoy hits or using the expectation value of the PEP. PeptideShaker hence provides two estimations of the FDR: (7) (8) By default the classical estimator of equation (7) is used, the user can switch to the probabilistic FDR in the validation tab to estimate the FDR using the posterior error probability (8). Similarly, the number of false negatives,, can be estimated by integrating the confidence. Complementarily to the FDR, the False Negative Rate (FNR) of the identification process is hence also estimated: (9) Where is the estimated number of true positives in the imported dataset. As an example, when 100 hits are validated at an FNR of 20%, one can expect a total of 125 true positive hits among which 25 were rejected by the threshold: with an FNR threshold of 1%, one covers 99% of the possible true positive identifications loaded in PeptideShaker. 28

29 PeptideShaker allows the user to intuitively edit the validation thresholds in the Validation tab and whether a match is validated or not is shown throughout the display. The validation process, a balance between quality and quantity, between FDR and FNR, is crucial for all downstream analyses. The accuracy of the false positive rate estimations was benchmarked by searching a pyrococcus furiosus dataset against a database consisting of the concatenation of pyrococcus furiosus sequences with the eukaryota complement of the UniProt/SwissProt database 40, downloaded on the 21 st of October 2013, 181,026 (target) sequences, including the reversed version of every sequence as decoy proteins. In this setup, eukaryote sequences (excluding known contaminants) can be considered as false identifications while pyrococcus furiosus sequences can be considered as correct matches, hence allowing the verification of target/decoy derived error rates 41. Peak lists obtained in 41 were searched using OMSSA 20 version 2.1.9, X!Tandem 18 version Sledgehammer ( ), MS Amanda 21 version and MS-GF+ 24 version Beta (v10024) (5/9/2014). The search was conducted using SearchGUI 9 version The identification settings were as follows: Trypsin with a maximum of 2 missed cleavages; 10 ppm as MS1 and 0.5 Da as MS2 tolerances; fixed modifications: Carbamidomethylation of Cys ( Da), variable modifications: Oxidation of Met ( Da), Phosphorylation of Ser, Thr and ( Da). All algorithms specific settings were left to the default of SearchGUI. The mass spectrometry data along with the identification results have been deposited to the ProteomeXchange Consortium 15 via the PRIDE partner repository 10 with the dataset identifier PXD As demonstrated in Supplementary Figures 24 to 26, the error rate estimated by PeptideShaker accurately tracks the actual error rate for PSMs, peptides and proteins. As discussed in the literature 41-43, marginal underestimation of the error rate estimation can be ascribed to the second refinement procedure of X!Tandem. 29

30 Number of Decoy Proteins 2,000 1,800 1,600 1,400 1,200 1, ,000 1,500 2,000 Number of Eukaryota Proteins Proteins y=x Supplementary Figure 22: A pyrococcus furiosus dataset was searched against a concatenation of eukaryota and pyrococcus furiosus sequences. The number of retained decoy proteins is plotted against the number of identified eukaryote proteins at increasing protein score. In this setup, the number of eukaryota proteins indicates the number of false identifications. Number of Decoy Peptides Number of Eukaryota Peptides unmodified peptides modified peptides peptides y=x Supplementary Figure 23: Similarly as Supplementary Figure 24, the number of retained decoy peptides is plotted against the number of identified eukaryote peptides at increasing peptide score. In order to increase the identification rate, peptides were separated into two groups: modified and unmodified peptides. If a category of modified peptides is substantially enriched, a standalone group is automatically created by PeptideShaker (see main text for details). 30

31 3,000 2,500 Number of Decoy PSMs 2,000 1,500 1, PSMs 3+ PSMs 4+ and 5+ PSMs PSMs y=x ,000 1,500 2,000 2,500 3,000 Number of Eukaryota PSMs Supplementary Figure 24: Similarly as Supplementary Figure 25, the number of retained decoy PSMs is plotted against the number of identified eukaryote PSMs at increasing PSM score. In order to increase the identification rate, PSMs were separated into three groups: PSMs identified with a charge of 2+, 3+ and with charge >3+ (see main text for details). In addition to the statistical validation, one generally expects proteins to be identified with at least two confident peptides in large scale proteomic shotgun experiments (note that this does not apply for dataset enriched for specific species like modified peptides). Similarly as Peptizer for PSMs 44, PeptideShaker therefore inspects all the validated matches using quality filters and doubtful matches are flagged by a yellow warning icon throughout the display. Quality filters are applied to all PSMs, peptide and protein levels. Since confident peptides require confident PSMs, and in turn confident proteins require confident peptides, the quality of the identification propagates from PSM to peptide and protein levels providing the user with stringently quality controlled results. When the database or the dataset does not allow for reliable statistical estimation (e.g., when searching small databases or identifying a low number of proteins) the validation status is similarly marked as doubtful. As displayed in Supplementary Figure 27, clicking on the yellow warning icon opens a dialog with details on the validation status and allows the user to set the status of a validated match as confident or doubtful. 31

Supplementary Figure 25: In addition to the statistical validation, PeptideShaker inspects all identification matches using quality filters. Doubtful matches are marked with a yellow warning icon.

32 Supplementary Figure 25: In addition to the statistical validation, PeptideShaker inspects all identification matches using quality filters. Doubtful matches are marked with a yellow warning icon. Clicking on the icon opens a dialog allowing the inspection of the validation procedure for this match. Details concerning the database, the target/decoy scoring and results, and the inspection by quality filters. Here, the protein identification was supported by only one confident peptide and only one confident spectrum and was thus marked as doubtful. 32

33 7.0 - PTM Scoring In proteomics, there are two main paradigms for PTM localization scoring 45 : (i) using the identification algorithm score difference between assumptions carrying PTMs at different sites as a proxy to infer the quality of the localization; or (ii) using probabilistic scores to estimate the probability that a site is actually modified, these scores are inferred from the original spectra and are independent from the identification algorithm. PeptideShaker implements the widely used MD-score 46 and its multiple search engine equivalent, the D-score 47. Complementarily, two probabilistic scores are implemented, the A-score 17 and PhosphoRS 48. Although originally designed for phosphorylation only, these scores are estimated for every variable modification which can take different sites in a peptide. Notably, the peak annotation method used for these scores is the same as the one used to annotate the spectra in the interface, thus allowing visual inspection of site determining ions in spectra. During the scoring, neutral losses of the same mass as the PTM itself are not taken into account for spectrum annotation. In our experience, disabling all neutral losses (e.g. H 2O and NH 3) increases the discrimination power of the probabilistic scores. As mentioned in the project creation section, the selection of the probabilistic score is done in the processing parameters dialog. There, it is also possible to disable neutral losses annotation for the scores and set a score threshold. Finally, one can enable an automated threshold which will be calibrated to a 99% agreement with the D-score. Whenever a peptide presents more modification sites than detected modifications and the probabilistic score passes the threshold, the modification is marked as Confident. It is labeled as Doubtful if it does not pass the threshold and as Random if different modification sites score equally. If no probabilistic score is calculated, a D-score of 95% is used as the threshold. Note that the A-score was only defined for singly modified peptides. The A-score was also established for spectra searched with a tolerance of ±0.5 m/z thus using a subdivision in windows of 100 m/z: the window size equals 100 times the fragment mass resolution. In our implementation, we kept the window size and tolerance to this original ratio. As a result, for data searched with a tolerance of 0.02 m/z the spectrum will be subdivided into windows of 4 m/z and not 100 m/z as done in PhosphoRS 48. PhosphoRS was implemented according to its original publication

34 8.0 - Reports, Follow Up Analyses and Submissions to PRIDE All the generated data can be exported to text files which can subsequently be imported into external tools like Excel or Perseus ( Contextual export options are additionally available from most tables or displays. Identification details can also be exported under the Export > Identification Features menu, where the user can select information about protein, peptide, PSM or search engine specific result to be exported to a text file. A phosphorylation oriented summary is also available. Finally, fully customizable reports can export virtually any information from PeptideShaker. Five reports are available by default: (i) Certificate Analysis: all search and processing parameters with statistics on the dataset a crucial feature for service providers and publications; (ii) Default PSM Report: all identification information at the PSM level; (iii) Default Peptide Report: all identification information at the peptide level; (iv) Default Protein Report: all identification information at the protein level; and (v) Default Hierarchical Report: all identification results presented in a hierarchical manner: protein > peptide > PSM. Note that the user can create and customize his/her own reports as displayed in Supplementary Figure 28. Documentation with description of the report content can also be exported for every report. 34

35 Supplementary Figure 26: In the Report Dialog the user can select the identification features to export. These include Annotation Settings, Input Filters, Proteins, Peptides, PSMs, PTMs, Fragment Ion Information, Project Details, Search Parameters, Spectrum Counting Settings and Target/Decoy Validation Summary. From each of these category, the user can select the wanted elements as displayed here for proteins - ranging from high level information (accession, chromosome, sequence coverage, PTM mapping, etc.) to detailed mass spectrometry results, like detected fragment ion m/z error. PeptideShaker also offers various exports for post processing of the identifications, as displayed in Supplementary Figure 29. These are available under the Export > Follow Up Analysis menu. 35

Supplementary Figure 27: The Follow Up Analysis options in PeptideShaker include: (i) export of the spectra of non-validated matches or recalibrated spectra (at the MS1 and/or MS2 level) of the

36 Supplementary Figure 27: The Follow Up Analysis options in PeptideShaker include: (i) export of the spectra of non-validated matches or recalibrated spectra (at the MS1 and/or MS2 level) of the validated matches; (ii) export of accessions and sequences of validated or not validated proteins; (iii) export to the popular Progenesis LC-MS label free quantification software; (iv) export to graph databases like Cytoscape 49 ; (v) export of inclusion/exclusion lists for various instruments; and (vi) export of libraries in the SWATH format as specified by the manufacturer. Notably, the recalibration feature can be used when calibration issues are detected (as mentioned in the Results Navigation section) as for instance often encountered on Time Of Flight (TOF) instruments. Also note that the export to graph databases provides a unique intuitive view on complex datasets with a graphical approach to the protein inference problem as displayed in Supplementary Figure

Supplementary Figure 28: By supporting export to graph database formats like Cytoscape, PeptideShaker allows for an intuitive visalization of protein inference problems, here represented by two

37 Supplementary Figure 28: By supporting export to graph database formats like Cytoscape, PeptideShaker allows for an intuitive visalization of protein inference problems, here represented by two examples from the example dataset - proteins in red, peptides in blue. On the left, an ideal case where different peptides map uniquely to a single protein accession. On the right, individual proteins are supported both by unique and shared peptides. 37

38 PeptideShaker offers the possibility to export a summary of the methods used for identification under the Export > Methods Section menu. A text is automatically generated and can serve as a basis for inclusion in the methods section of manuscripts as illustrated in Supplementary Figure 31 (see section Validating Proteins, Peptides and PSMs for an example of application). Supplementary Figure 29: PeptideShaker generates automatically a draft of the methods used for protein identification. This text can help for manuscript writing. 38

The PeptideShaker project can be exported in various forms via the Export > PeptideShaker Project As menu: (i) Zip File: a zip file containing the saved project and all related files, (ii) mzidentml:

39 The PeptideShaker project can be exported in various forms via the Export > PeptideShaker Project As menu: (i) Zip File: a zip file containing the saved project and all related files, (ii) mzidentml: the identification results in the PSI 50 mzidentml 23 format, and (iii) PRIDE XML: the peak lists and identification results in the standard PRIDE XML format. When exporting the entire project, the zip file can be shared and opened in the interface upon unzipping. Both mzidentml and PRIDE XML files allow direct submission to PRIDE via ProteomeXchange ( within a few clicks. When creating an mzidentml file, a dialog allows the annotation of the project as displayed in Supplementary Figure 32. Subsequently, a valid, fully annotated and very close to MIAPE 51 compliant mzidentml file is created. Supplementary Figure 30: The mzidentml Export Dialog makes is easy to annotate and export mzidentml files that can readily be submitted to PRIDE via ProteomeXchange. 39

Similarly, when creating a PRIDE XML file, the user is guided through the adding of the required meta data annotation by a user friendly interface as displayed in Supplementary Figure 33.

40 Similarly, when creating a PRIDE XML file, the user is guided through the adding of the required meta data annotation by a user friendly interface as displayed in Supplementary Figure 33. The generated file is comprehensively annotated by information available in PeptideShaker and by user input using controlled vocabulary, facilitated by the Ontology Lookup Service 52. Supplementary Figure 31: The PRIDE Export Dialog makes is easy to create well-annotated PRIDE XML file that can readily be submitted to PRIDE via ProteomeXchange. 40

41 9.0 - Command Line Use PeptideShaker can also be used via the command line and hence run in automated batch mode. Different command line modes are available: (i) PeptideShakerCLI: process identification files and saves the project; (ii) ReportCLI: take a saved project as input and export the results as default reports or as custom reports; (iii) FollowUpCLI: take a saved project as input and export the previously described follow up features; (iv) MzidCLI: takes a saved project as input with the relevant annotation and exports an mzidentml file. Note that all command line options for ReportCLI, FollowUpCLI and MzidCLI can also be used in the PeptideShakerCLI mode. Detailed information about the parameters can be found on the PeptideShaker website ( Notably, the PeptideShaker command line version has already been included in Galaxy 53 by an independent third party ( 41

10.0 - Documentation, Help, Support and Updates Contextual help is available everywhere in the interface in the form of question marks as

42 Documentation, Help, Support and Updates Contextual help is available everywhere in the interface in the form of question marks as displayed in Supplementary Figure 34. Supplementary Figure 32: Question marks are present everywhere in the interface triggering contextual help. 42

43 The help links to external resources, publications and additional information on the PeptideShaker website. Beginners can also have a look at our general protein identification tutorials 14 : All these resources are kept up-to-date with the development process of the software. For other questions there is also an active discussion group: peptide-shaker. Creating bug reports is easy via the Bug Report Dialog, Help > Bug Report, as displayed in Supplementary Figure 35. Please use the issue tracker at the PeptideShaker web page to report issues. Supplementary Figure 33: If encountering a problem the user is directed to online help directly from the interface, or a bug report with details can be sent to the developers for faster bug fixing. New versions are regularly released including bug fixes and new features. If an internet connection is available, the user is notified and an auto-update is proposed. Changes are documented for every version on the PeptideShaker website in the Release Note wiki ( and announced on the mailing list. 43

How to view Results with Scaffold. Proteomics Shared Resource

How to view Results with Scaffold Proteomics Shared Resource Starting out Download Scaffold from http://www.proteomes oftware.com/proteom e_software_prod_sca ffold_download.html Follow installation instructions