PRIDE Inspector: a tool to visualize and validate MS proteomics data

Size: px

Start display at page:

Download "PRIDE Inspector: a tool to visualize and validate MS proteomics data"

Gerald Harris
5 years ago
Views:

1 ! PRIDE Inspector: a tool to visualize and validate MS proteomics data Rui Wang, Antonio Fabregat, Daniel Ríos, David Ovelleiro, Joseph M. Foster, Richard G. Côté, Johannes Griss, Attila Csordas, Yasset Perez-Riverol,! Florian Reisinger, Henning Hermjakob, Lennart Martens & Juan Antonio Vizcaíno Supplementary Information Document Contents 1. PRIDE Inspector Technical Implementation General Information Design and Implementation Details New open source libraries made available with PRIDE Inspector Updated open source libraries made available with PRIDE Inspector PRIDE Inspector Java Web Start PRIDE Inspector feature list PRIDE Inspector charts documentation Theoretical isoelectric point documentation Protein status documentation Abbreviations References Supplementary Figures.28! 1!

2 1. PRIDE Inspector Technical Implementation 1.1. General Information System Requirements:! Java: JRE 1.6 +! CPU: 1 gigahertz (GHz) or faster 32-bit or 64-bit processor! Memory: 1 gigabyte (GB) RAM! Hard Disk: 50 MB available! Platform: It has been tested in Mac OS X, Linux, and Windows Additional Requirements:! Internet access is needed to connect to the PRIDE MySQL public instance, download private PRIDE experiments, and get the protein details from the different protein databases supported. Website and Source Code: Design and Implementation Details PRIDE Inspector is a standalone graphical user interface (GUI) application written in Java and released under the Apache 2 open source license. Figure 1 illustrates its overall design architecture, which contains five loosely coupled modules: the access module, service module, data model module, GUI module and the application management framework. The access module controls the data retrieval from the underlying data source. It has a flexible design and was created with extensibility in mind. By default it comes bundled with three access libraries, one for each currently supported data source. These consist of two XML file access libraries, one for PRIDE XML and one for the community standard mzml 1, and a library to directly query a publicly accessible PRIDE MySQL database. For PRIDE XML files, PRIDE-XML-JAXB [ a new library developed by the PRIDE team, is used. It uses the Java Architecture for XML Binding (JAXB, [ API (Application Programming Interface) to generate a data model based on the PRIDE XML schema to represent data content in memory. It also uses XXIndex [ a library to generate a XML XPath based index for efficient and flexible data retrieval. The PRIDE-XML-JAXB library is written in 100% Java, open source and is released with the PRIDE Inspector.! 2!

! For mzml files, the jmzml 2 library, previously developed in the team, is used [http://code.google.com/p/jmzml/]. It is based on the same techniques employed by the PRIDE-XML-JAXB library.

3 ! For mzml files, the jmzml 2 library, previously developed in the team, is used [ It is based on the same techniques employed by the PRIDE-XML-JAXB library. For direct access to the PRIDE MySQL public database instance, the access module provides an implementation using direct JDBC access with integrated connection pool management. Direct JDBC access allows for highly customizable and rapid data reads, whereas the use of C3PO [ as a connection pool manager reduces the overhead of creating database connections. Figure 1: PRIDE Inspector Design Architecture At the heart of PRIDE Inspector is the service module. It defines a common set of APIs, which facilitate the communication between the access module at the lower level and the GUI module at the higher level. In addition, these APIs holds configurable in-memory data caches, which reduce the number of duplicated data retrievals significantly and guarantee a snappier response from the GUI module. Although this module is designed to suit the particular needs of the PRIDE Inspector, it can be easily adapted to other projects, which require access from different MS data file formats in a generic manner.! 3!

4 The data model module is an abstraction layer between the data and the data representation. It is implemented using plain Java objects, which are core objects used to handle information across input formats. Different formats often don t have a unified way of representing the same information, and they may also contain different aspects of experimental details. For instance, spectrum related metadata in mzml and PRIDE XML are formatted differently. The data model represents a standardised view of the information in the underlying data sources. While reading, the raw content from the data source is first converted into objects from the data model module before the GUI module consumes them. This process of data transformation naturally depends on the input data source and a utility API is provided to facilitate the extraction of the information from the original data source. The GUI module of the PRIDE Inspector organizes itself in the form of several views, where each view focuses on a particular aspect of the data. There are currently six views: Overview view shows experimental metadata. Protein view shows protein identifications Peptide view shows peptides used to generate the protein identifications Spectrum view shows spectra and chromatograms Quantification view show quantitative data for both proteins and peptides Summary charts view provides data chars for assessing data quality To enable maximum reusability, each view is implemented as an independent component using Java Swing [ This way, more views can be added easily in the future. mzgraphbrowser [ the component responsible for visualizing spectra and chromatograms, is released and available as self-contained Java library and distributed as part of the PRIDE Inspector. Both this component and the statistical charts in the Summary Charts view, are implemented using the JFreeChart [ API. Next to the previous six layers, a generic application management framework maintains the context and information shared by the whole environment. It consists of features frequently required by medium or large Java Swing based applications, such as: lifecycle services for background tasks, in-memory caching, event bus, user property management and error handling. The framework is independent from the PRIDE Inspector and therefore can be reused for other rich client applications.! 4!

5 ! 1.3. New open source libraries made available with PRIDE Inspector The following libraries are released with the PRIDE Inspector for the first time. They are open source libraries that can make easy for developers to create MS related software. PRIDE XML JAXB Website & Source Code Description License Language External libraries PRIDE XML JAXB library is a library for indexing and parsing PRIDE XML 2.1 [ files. Unlike the conventional way of loading XML files, this library does not load the whole file into the memory up-front, instead it an employs XML indexing technique to index the file on the fly which results in fast access and a small memory footprint. Additionally, all entities from a PRIDE XML file are mapped into objects, and the internal references between the objects are resolved automatically. This gives direct access in the object model to entities that are only referenced by ID in the actual XML file. Apache 2 open source license Java XXIndex [ Indexing XML files. JAXB [ Parsing XML snippets into object model. Table 1: PRIDE XML JAXB library PRIDE Chart Website & Source Code Description The purpose of PRIDE Quality Chart library is to provide a tool for creating charts to assess the quality of MS experiments. Currently, the library provides eight different charts, but more are under development: 1. Delta m/z: Shows a relative frequency distribution of theoretical precursor ion mass - experimental precursor ion mass. 2. Number of peptides identified per protein: Shows a bar chart with the number of peptides identified per protein for! 5!

6 a single experiment. 3. Number of missed tryptic cleavages: Shows the number of missed tryptic cleavages. 4. Average MS/MS Spectrum: Shows an average of all the spectra included in one single experiment. 5. Precursor ion charge distribution: Shows a bar chart of precursor ion charge for a single PRIDE experiment. 6. Precursor ion masses distribution: Shows a frequency distribution of product ion m/z for different precursor ion charges. 7. Number of peaks per spectrum: Shows a histogram of number of peaks per MS/MS spectrum in a single experiment. 8. Peak intensity distribution: A histogram of ion intensity VS frequency for all MS/MS spectra in a single experiment. License Language External libraries Apache 2 open source license Java JFreeChart [ Creating static charts. Table 2: PRIDE Chart library PRIDE mzgraph Browser Website & Source Code Description PRIDE mzgraph Browser is a library for interactively visualizing and annotating mass spectra and chromatograms. It includes features like: Zoom in/out. Export peak values. Save/Print spectrum and chromatogram as an image. Save spectrum and Chromatogram as an PDF Highlight peak m/z and intensity values. Highlight mass differences. Display fragment ion annotations. Automatic annotation of the amino acid sequence in the spectra. Show/Hide mass difference for amino acid annotation Configurable fragment ion mass tolerance! 6! Nature Biotechnology: doi: /nbt.2112

7 ! License Language External libraries Filtering on ion series. Filtering on annotation series Apache 2 open source license Java JFreeChart [ Indexing XML files. Java Swing library [ Creating graphical user interface. Table 3: PRIDE mzgraph Browser library 1.4. Updated open source libraries made available with PRIDE Inspector These are open source libraries previously developed by the PRIDE team. They were updated with new features during the development of PRIDE Inspector. jmzml Website & Source Code Description jmzml provides a portable and lightweight JAXB-based implementation of the full mzml 1.1 standard format ( mzml files are effectively indexed on the fly and used as swap files, with only requested snippets of data loaded from a file when accessing it. Additionally, internal references in the mzml XML can be resolved automatically by jmzml, giving direct access in the object model to entities that are only referenced by ID in the actual XML file. Apart from reading indexed and non-indexed mzml files, jmzml also allows the writing of non-indexed mzml files. License Apache 2 open source license Language External libraries Java XXIndex [ Indexing XML files. JAXB [ Parsing XML snippets into object model. Table 4: jmzml 2 library!! 7!

8 XXIndex Website & Source Code Description License Language The XXIndex library is originally a part of the PSI-DEV project. It aims at addressing the needs for quick data retrieval in large XML files. The key features of the library include: Easy to Use: XXIndex uses XPath-like expressions to identify XML entities. Small Memory Footprint: Unlike some XML indexing libraries, XXIndex does not require to load the whole file into memory (large or small). Only the indices are stored in order to achieve minimum memory requirement. Quick Indexing: Although the speed of indexing depends on the complexity of the input XML file (the higher the XML tag density the slower the indexing). Based on our experience, a 500MB XML file usually takes only around 50 seconds. Apache 2 open source license Java Table 5: XXIndex library! 8! Nature Biotechnology: doi: /nbt.2112

9 ! 1.5. PRIDE Inspector Java Web Start In addition to the desktop application, PRIDE Inspector has also been integrated into the PRIDE website [ in the form of a Java Web Start application [ The benefits of this alternative are: 1. It does not require manual installation, and users can launch it directly from PRIDE s home page (Figure 2), and start viewing PRIDE experiments straightaway. 2. The Web Start version offers a more intuitive way of browsing the data already in the PRIDE database. For instance, one can use the search facility from the PRIDE website to identify a subset of experiments. They can then be cherrypicked to be viewed in the PRIDE Inspector (Figure 3). 3. It also enables one click experience for journal reviewers and editors. Traditionally, to review experiments in PRIDE, they were given a reviewer user name and password, and then they need to manually login to PRIDE to review the data. With the Web Start, only a single URL is now needed. Once clicked, users can have the option to launch either the PRIDE Inspector or to continue to the normal PRIDE website (Figure 4). Launch from the home page! Figure 2: Screenshot that shows how to launch PRIDE Inspector from PRIDE's home page: when users access the home page of the PRIDE website, PRIDE Inspector can be started there directly.! 9!

There are always two options: 1. View a single experiment. 2. View multiple selected experiments. View in the PRIDE Inspector View in the PRIDE website!

10 Launch multiple selected experiments from search results Launch individual experiment from search results! Figure 3: Screenshot that shows how to launch the PRIDE Inspector from PRIDE's search results: In the search result page, users can also use the Web Start. There are always two options: 1. View a single experiment. 2. View multiple selected experiments. View in the PRIDE Inspector View in the PRIDE website! Figure 4: Journal reviewer/editor access: Only a single URL is required. Once clicked, the page above will be shown. Users are offered either to user the PRIDE Inspector Web Start or to login to the PRIDE website.! 10!

11 ! 2. PRIDE Inspector feature list A. Key Features 1. Rapid loading of mzml and PRIDE XML files 2. Search, access and download PRIDE s public experiments through a PRIDE public MySQL instance 3. Distinct views on spectra, chromatograms, proteins, peptides, quantification and experiment details 4. Visualize all spectra and chromatograms with their annotations 5. Download additional protein details, such as the protein name and the most up-to-date protein sequence (for the following protein sequence databases: UniProt, UniParc, ENSEMBL, IPI and NCBI nr database) 6. Visualize protein sequences and highlight different features in the protein sequence such as identified peptides and PTMs 7. A decoy identification filter is available on both the protein and peptide tabs 8. Visualize quantitative data for both protein and peptide identifications 9. Display short summary of key measurements on experiment quality 10. Ability to perform an initial data quality assessment using a statistical view with different charts 11. User-friendly download facility for private PRIDE experiments (suitable for journal reviewers and editors) 12. Rich documentation on usages and features B. Spectrum and Chromatogram Related Features 1. Automatic annotation for spectra using MS2 fragment ions (if ion assignments are submitted) 2. Support for various ion types, including co-eluting and immonium ions 3. Amino acid annotation for different fragment ion series within a spectrum! 11!

12 4. Filtering based on types of fragment ions and amino acid annotations 5. Ability to show/hide mass differences for amino acid annotations 6. Configurable fragment ion mass error tolerance for performing the amino acid annotations 7. Manual annotation of peaks in a spectrum 8. Save a spectrum or chromatogram as an image. Supported formats include SVG, PNG, JPEG and GIF 9. Save a spectrum or chromatogram as a PDF file 10. Print a spectrum or chromatogram as an image 11. Zoom in/out on each spectrum or chromatogram 12. Show background grid for both spectrum and chromatogram 13. Highlight peaks with m/z and intensity values in a spectrum 14. Export m/z and intensity values for selected spectrum into a tab delimited file 15. Show mass differences for selected peaks 16. Automatic suggestion on possible amino acids and charges based on mass differences of the selected peaks 17. Highlight peak with its m/z and intensity value 18. Show categorized metadata for each spectrum or chromatogram 19. Batch loading for experiments containing large number of spectra 20. Ability to show/hide side information panel for spectra and chromatogram 21. Easy adjustment on the display size of spectra and chromatogram 22. Show MS level and precursor details for each spectrum 23. Calculate the total intensity for each spectrum 24. Count the number peaks for each spectrum C. Protein and Peptide Related Features 1. Show all protein identifications, together with the peptides and the spectra used for identification! 12!

13 ! 2. Show general details about protein identifications (such as search engine, search database, etc) 3. Automatic correction of poorly formatted protein accessions 4. Use web services to retrieve protein names, protein status and protein sequences from different protein databases (for UniProt, UniParc, ENSEMBL, IPI and the NCBI nr database) 5. Calculate the protein sequence coverage based on the protein sequence and identified peptide sequences. 6. Display and highlight whether the originally identified peptide sequences can fit in the most recent version of the protein sequence (for UniProt, UniParc, ENSEMBL, IPI and the NCBI nr database) 7. Calculate the theoretical pi for both protein and peptide identifications 8. Show the number of peptides and unique peptides for each protein identification 9. Show the total number of PTMs for each protein or peptide 10. Show all peptides, together with PTMs and spectra used for identification 11. Colour coding of the PTM-modified amino acids for each peptide 12. Show a summary of mass differences and modified amino acids for all existing PTMs 13. Calculate delta m/z for each peptide (difference between the theoretical and the experimental m/z). Highlighting is also done according to the chosen threshold (4 Da) 14. Show precursor details for each peptide 15. Show the number of fragmented ions for each peptide 16. Display the search engine peptide scores (currently available for Mascot, Sequest, OMSSA, X!Tandem and SpectrumMill) 17. Show the length, start and stop position of each peptide 18. Hyperlink all valid protein accessions and PTM accessions 19. Show additional details on both proteins and peptides! 13!

14 D. Protein Sequence Viewer Features 1. Display protein details, such as protein accession and name 2. Display peptide matching details, such as: number of peptide matched, number of distinct peptides 3. Display protein coverage 4. Highlight peptides in the protein sequence 5. Highlight PTMs in the protein sequence 6. Highlight the overlapping peptides E. Decoy Filter Features 1. Set a decoy filter based on position of the decoy filter string for each protein identification 2. Show only decoy protein and peptide identifications 3. Show only non-decoy protein and peptide identifications 4. Undo decoy filter F. Experiment Detail Related Features 1. Display experiment details, including sample, protocol and instrument configurations. The overview tab is split in three different views: Experiment General, Sample and Protocol, and Instrument and Processing 2. Hyperlink contact s 3. Hyperlink PubMed ID for each reference 4. Hyperlink Tranche hashes as download links G. Experiment Summary Related Features 1. Check the availability of spectra, proteins and peptides 2. Check the availability of quantitative data 3. Check the number of identified spectra that are missing! 14!

15 ! 4. Display distinct PTMs within a experiment 5. Show Tranche links if available 6. Show the reported peptide FDR if available 7. Show ratios of both decoy peptide identification H. Quantification Related Features 1. Show the quantification method that has been used 2. Show a summary of sample details for each quantification reagent 3. Show quantification ratios for both proteins and peptides based on a selected control sample 4. Set a different control sample for the quantification, recalculating the ratios 5. Display bar charts based on quantification ratios 6. Filter and export quantification results 7. Map filtered proteins identifications to the Ensembl web Karyotype Viewer 8. Compare protein quantification values in an histogram 9. Save/Print the protein quantification comparison histogram 10. Show the corresponding spectrum for each peptide I. Summary Chart Related Features 1. Delta m/z distribution for all peptides 2. Histogram on number of peptides per protein identification (percentage of controversial identifications) 3. Histogram on the distribution of missed tryptic cleavages for all peptides 4. Average MS/MS spectrum (for all spectra, or filtering between identified and unidentified spectra) 5. Histogram on the distribution of precursor ion charge for identified spectra! 15!

16 6. Precursor ion masses distribution for all spectra (for all, or filtering between identified and unidentified spectra). It is also possible to visualize a human, mouse and PRIDE reference curves. 7. Histogram on number of peaks per spectrum 8. Histogram on peak intensity for all spectra (for all, or filtering between identified and unidentified spectra) 9. Histogram on number of peaks per spectrum 10. Save/Print all charts as images 11. Zoom in/out ability for all the charts 12. Easy navigation between charts J. Data Export Features 1. Export all PRIDE public experiments to PRIDE XML file format 2. Export all spectra to Mascot Generic File (MGF) format 3. Export spectrum related details for all spectra (such as precursor ion details, number of peaks and sum of peak intensity) 4. Export protein related details for all protein identifications (such as: number of peptides, number of PTMs) 5. Export protein to peptide mappings 6. Export peptide related details for all peptides (such as modified peptide sequence, delta m/z and number of fragment ions) K. Other Important Features 1. Multiple table row selection for export 2. Sorting for all table columns 3. Show/hide table columns 4. Search content within a table 5. Show/hide the side panel! 16!

17 ! 6. Open compressed (gzipped) PRIDE XML and mzml files 7. Full support for PRIDE XML files generated by the Waters ProteinLynx Global Server (PLGS) software L. Web Start Related Features 1. One click to launch PRIDE Inspector directly from the home page of PRIDE, without the need for manual download or installation 2. Select and open single or multiple experiments based on search results in PRIDE Inspector 3. Capable of opening both public and private experiments (log in required) 4. Single URL to access all the experiments belong to the same data submitter or journal reviewer! 17!

3. PRIDE Inspector charts documentation Delta m/z A distribution of relative frequency of experimental precursor ion mass (m/z) - theoretical precursor ion mass (m/z).

18 3. PRIDE Inspector charts documentation Delta m/z A distribution of relative frequency of experimental precursor ion mass (m/z) - theoretical precursor ion mass (m/z). Mass deltas close to zero reflect more accurate identifications and also that the reporting of the amino acid modifications and charges have been done accurately. This plot can highlight systematic bias if not centered on zero. Other distributions can reflect modifications not being reported properly. In this example we can clearly see that the distribution is centred close to zero with very little spread. Peptide sequences, charges and modifications, have been accurately reported and the instrument calibration was fine. Peptides per Protein A bar chart displaying the percentage of protein identifications in the whole experiment according to the total number of peptides used to report the identification. Proteins supported by more peptide identifications can constitute more confident results. Note: To investigate further, in the Protein view, one can sort the proteins by number of peptide identifications. In this experiment 50% of the proteins were identified through one peptide only. The rest of the protein identifications, especially the ones with higher peptide numbers can be considered more reliable identifications.! 18!

19 ! Number of Missed Tryptic Cleavages A histogram representing the percentage of peptides in the experiment with a different number of missed tryptic cleavages in peptides. This graph is only applicable to experiments where trypsin is used. Two assumptions were made for these calculations: first, the enzyme used in the experiment is trypsin; second, the cleavage rule used by the enzyme is C-terminal side of K or R except if P is C-term to K or R. This chart can be used to compare several experiments where the same number of missed cleavages has been used as a parameter for the search, and the same experimental conditions used. Then a dramatic change in the shape of the chart could mean a change in the efficiency of the trypsin used (though many other factors can also be the reason for it such as change in the parameters of the search engine, database size and other experimental causes). In a more practical way, this chart has two immediate applications: first, checking that the search engine is working correctly and the number of missed cleavages found in the identified peptides matches with the "missed cleavages" parameters used in the search engine. Second, knowing the distribution of this chart, the researcher can adjust the number of missed cleavages used in future searches: e.g. maybe the use of 4 missed cleavages instead of 1 is producing only a 0.1% increase in peptide identifications with searches 10 times longer. This is an example with only about 10% of the peptides containing one missed cleavage. Average MS/MS spectrum This graph is obtained adding all the MS/MS spectra in the experiment. The result is an averaged spectrum. The highest peaks will reflect abundant and intense peaks in the overall set of MS/MS spectra. Most intense and ubiquitous peaks (both conditions needed) will be displayed here: contaminants, reagents used in the experiment, frequent fragmentations from highly common peptides The next chart shows an example of a public experiment in PRIDE, using itraq reagents for quantification. The zoom has been used to show in detail the highlighted information.! 19!

20 Precursor Ion Charge A bar chart representing the distribution of the precursor ion charges for the whole experiment. This information can be used to identify potential ionization problems including many 1+ charges from an ESI ionization source or an unexpected distribution of charges. MALDI experiments are expected to contain almost exclusively 1+ charged ions. An unexpected charge distribution may furthermore be caused by specific search engine parameter settings such as limiting the search to specific ion charges. In this ESI experiment there are no single charged ions but only double and triple charged ones. Precursor Ion Masses A relative frequency distribution of precursor ion masses for the experiment (red curve) against a reference (if selected by the user). It is possible to filter the information for all, identified and unidentified spectra. Three references are available for the users: 1- Empirically derived precursor ion mass distributions from all PRIDE experiments that have a single tryptic digest step annotation associated with them and its upper and lower quartiles. This reference is aimed to provide a species independent distribution. 2- Reference obtained in an analogous way from PRIDE human experiments.! 20!

21 ! 3- Reference obtained in an analogous way from PRIDE mouse experiments. Experiments that only contained peptides without missed cleavages were ignored as such results are caused by specific search engines parameters and do not reflect the biological background. These peptides are generally shorter and thus these experiments would shift the overall distribution towards the lower masses. A (red) curve that lies to the left of the empirical distribution (in a different colour) identifies a disproportionate number of lower mass peptides being identified / fragmented. In an analogous way, a (red) curve that lies to the right of the empirical distribution identifies a disproportionate number of higher mass peptides being identified / fragmented. Such alterations may be caused by the general amino acid composition of the organism being investigated, or the digestion protocol used (non-tryptic) but does not necessarily indicate a problem in your experiment. For human, the average tryptic peptide mass is 1,100 Da. This distribution should encompass this average. The shift to the right in this distribution is expected due to the number of missed cleavages resulting in higher mass peptides. Peaks per MS/MS spectrum A histogram representing the number of peaks per MS/MS spectrum in the whole experiment. This chart assumes centroid data. Too few peaks can identify poor fragmentation or a detector fault, as opposed to a large number of peaks representing very noisy spectra. This chart is extensively dependent on the pre-processing steps performed to the spectra (centroiding, deconvolution, peak picking approach, ).! 21!

22 Peak Intensity Distribution A histogram representing the ion intensity vs. frequency for all MS2 spectra in the whole experiment. It is possible to filter the information for all, identified and unidentified spectra. This plot can give a general estimate of the noise level of the spectra. Generally, one should expect to have a high number of low intensity noise peaks with a low number of high intensity signal peaks. A disproportionate number of high signal peaks may indicate heavy spectrum pre-filtering or potential experimental problems. In the case of data reuse this plot can be useful in identifying the requirement for preprocessing of the spectra prior to any downstream analysis. The quality of identifications is not linked to this data as most search engines perform internal spectrum preprocessing before matching the spectra. Thus, the spectra reported in a PRIDE experiment are not necessarily preprocessed since the search engine may have applied the preprocessing step internally. This preprocessing is not necessarily reported in the experimental metadata.! 22!

23 ! 4. Theoretical Isoelectric Point (pi) Documentation In the PRIDE Inspector, the theoretical isoelectric point is calculated by default for proteins and peptides. The value of the isoelectric point can be used as a filtering technique to validate peptide identifications 3,4. The algorithm used to estimate the theoretical isoelectric point was published by Bjellqvist and coworkers 5. The pi is calculated using pk values of the amino acids as described in Bjellqvist et al. 5. These values were defined by examining polypeptide migration between ph 4.5 to 7.3 in an immobilized ph gradient gel environment with 9.2 M and 9.8 M urea. The authors reported a standard deviation of 0.2 units for the entire ph range. Most of the recent isoelectric point calculators implement the Bjellqvist algorithm to compute it 6-8.! 23!

24 5. Protein status documentation In MS proteomics based experiments, potentially identified proteins are reported using the searched database s proprietary identifiers. These identifiers are unstable and can change or may even be deleted over time. The latter happens if, for instance, hypothetical proteins are removed when gene prediction algorithms are updated or new biological evidence is created. In a recent paper we investigated the impact of changing protein identifiers on stored proteomics data over time 9. We found that in several cases 10-20% of the reported identifiers were no longer valid after only a year after the experimental results had been published. To highlight this problem to the user as well as to keep the reported data usable, PRIDE Inspector has a function to automatically check the reported identification s status. To do this we integrated specific components that access the identifications source database and retrieve the current identifier status. If the identifier was only updated, the new accession is automatically displayed in the protein table and the updated sequence retrieved. In some cases, even though a protein s identifier did not change its underlying sequence was altered in the protein database. Therefore, PRIDE Inspector automatically fetches a protein s current sequence and checks whether the reported peptides still fit this identification. When using the Obtain Protein Details feature in the PRIDE Inspector, the status of the protein according to the original database is downloaded in addition to protein name and protein sequence. It could be one of the following cases: Active: the protein still exists in the original database, and the details remain unchanged. Unknown: the protein does not exist in the original database. Deleted: the protein has been removed from the original database. Merged: the protein has been merged with other proteins to form a new protein. Demerged: the protein has been split into two or more proteins. Changed: there have been some changes on this protein, but the type of the change is unknown. Error: there is an error associated with this protein. To summarize, there are three main results for a protein s status: active, changed, and deleted. For UniProtKB changed identifiers are subdivided in merged and demerged! 24!

25 ! identifiers. The main reason for the demerging of identifiers is that new identifiers were created for every species a protein was identified in as well as new identifiers for the various genes a protein can come from. The merging of identifiers mainly happens when based on new gene prediction algorithms proteins that were previously believed to be distinct are then considered to actually come from the same gene. The International Protein Index (IPI) database has been discontinued since September Therefore, PRIDE Inspector can only report whether a given identifier was still active in the last IPI release but cannot report on changed or deleted identifiers.! 25!

26 6. Abbreviations API: Application Programming Interface CPU: Central Processing Unit GUI: Graphical User Interface IPI: International Protein Index JAXB: Java Architecture for XML Binding JDBC: Java DataBase Connectivity JRE: Java Runtime Environment mgf: Mascot Generic File PRIDE: PRoteomics IDEntifications (database) PSI: Proteomics Standards Initiative PTM: Post-Translational Modification URL: Uniform Resource Locator! 26!

27 ! 7. References 1. Martens, L. et al. mzml--a community standard for mass spectrometry data. Mol Cell Proteomics 10, R (2011). 2. Cote, R.G., Reisinger, F. & Martens, L. jmzml, an open-source Java API for mzml, the PSI standard for MS data. Proteomics 10, (2010). 3. Cargile, B.J., Talley, D.L. & Stephenson Jr., J.L. Immobilized ph gradients as a first dimension in shotgun proteomics and analysis of the accuracy of pi predictability of peptides. Electrophoresis 25, (2004). 4. Heller, M. et al. Added Value for Tandem Mass Spectrometry Shotgun Proteomics Data Validation through Isoelectric Focusing of Peptides. Journal of Proteome Research 4, (2005). 5. Bjellqvist, B. et al. The focusing positions of polypeptides in immobilized ph gradients can be predicted from their amino acid sequences. Electrophoresis 14, (1993). 6. Skoog, B. & Wichman, A. Calculation of the isoelectric points of polypeptides from the amino acid composition. Trends in Analytical Chemistry 5, (1986). 7. Gasteiger, E. et al. ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 31, (2003). 8. Gauci, S., van Breukelen, B., Lemeer, S.M., Krijgsveld, J. & Heck, A.J.R. A versatile peptide pi calculator for phosphorylated and N"terminal acetylated peptides experimentally tested using peptide isoelectric focusing. Proteomics 8, (2008). 9 Griss, J., Cote, R. G., Gerner, C., Hermjakob, H. & Vizcaíno, J.A. Published and perished? The influence of the searched protein database on the long-term storage of proteomics data. Mol Cell Proteomics 10, M (2011).!! 27!

28 8. Supplementary Figures Legends Supplementary Figure 1. PRIDE Data Content Growth. Growth of data content in PRIDE from September 2006 until September The data included in the graph is number of spectra, number of protein identifications, number of peptide identifications and number of unique peptides. Supplementary Figure 2. Data formats supported currently by PRIDE Inspector. Supplementary Figure 3: PRIDE XML Overview : Experiment General View. The PRIDE XML Overview : Experiment General tab contains basic metadata information about an experiment file: experiment and project titles, contact information, software used for file generation, original file format amongst others. Supplementary Figure 4: PRIDE XML Overview : Sample & Protocol View. The PRIDE XML Overview : Sample & Protocol tab contains metadata information about the sample (species, disease status) and experimental protocol used in the experiment. Supplementary Figure 5: PRIDE XML Overview : Instrument & Processing View. The PRIDE XML Overview Instrument & Processing tab contains metadata information about the instrument configuration and software used. Supplementary Figure 6: mzml Overview : Instrument & Processing View. The mzml Overview Instrument & Processing View tab contains metadata information about the instrument configuration and software used. Supplementary Figure 7: Protein View with Spectrum Viewer. The Protein View tab contains three parts: details on the submitted proteins can be observed in the upper section. Information about the corresponding peptide identifications are available in the second section. In the bottom window the Spectrum Viewer is displayed. Supplementary Figure 8: Protein View with Sequence Viewer. The Protein View tab contains three parts: details on the submitted proteins can be observed in the upper section. Information about the corresponding peptide identifications are available in the second section. In the bottom window the Sequence Viewer is displayed showing useful statistics. Supplementary Figure 9. Spectrum Viewer Filtering Options. In this concrete example only b ion fragment ions and b ion amino acid annotations are enabled. Supplementary Figure 10: The Peptide View. tab contains three parts: the upper part contains peptide details with the sequences and the corresponding protein identifications, the calculated delta m/z value, the corresponding PTMs and peptide scores assigned by the search engines, if available. The second section is focused on PTMs: PSI-MOD accession number, location, modified residues and mass information. In the bottom window the Spectrum Viewer is displayed. Supplementary Figure 11: The Peptide View tab with the Sequence Viewer. The chosen peptide is highlighted.! 28!

29 ! Supplementary Figure 12: Spectrum View. The Spectrum View tab shows all mass spectra (identified + unidentified) included in the experiment. Spectrum details and the related scan and precursor information are accessible from the upper right window. It is possible to perform de novo sequencing in the Spectrum Viewer (not shown). Supplementary Figure 13. mzml Total Ion Chromatogram. If chromatograms are included in the mzml file, PRIDE Inspector can display them in the Spectrum View tab. It must be noticed that chromatograms cannot be included in PRIDE XML files. Supplementary Figure 14: PRIDE XML Summary Charts View. The PRIDE XML Summary Charts tab displays currently up to nine different charts that can help to assess the quality of the data: Delta m/z, Peptides per Protein, Missed Tryptic Cleavages, Average MS/MS Spectrum, Precursor Ion Charge, Precursor Ion Masses, Peaks per MS/MS Spectrum and Peak Intensity. Supplementary Figure 15. Delta m/z chart Help Page. The Help page of the Delta m/z chart that is available via the information button on the Summary Charts view. This is shown as an example of the Help functionality integrated in the PRIDE Inspector application. Supplementary Figure 16. Delta m/z chart. By clicking the zoom button on each one of the charts included in the Summary Charts tab (see Supplementary Figure 14), the charts can be observed at full screen resolution. In this case, the Delta m/z chart is shown as an example of this feature. Supplementary Figure 17. Number of Peptides Identified per Proteins chart. This chart displays a bar chart with the number of peptides identified per protein in a single experiment. Supplementary Figure 18. mzml Summary Charts View. The mzml Charts Summary tab currently displays up to four charts: Average MS/MS Spectrum, Precursor Ion Masses, Peaks per MS/MS Spectrum, and Peak Intensity. Supplementary Figure 19: PRIDE Inspector Quantification view: If available the Quantification view provide differential expression level details of proteins across different samples depending on the labeling method used (in the example: isobaric tags, itraq). Supplementary Figure 20: PRIDE Inspector Search PRIDE Panel. The PRIDE Inspector Search PRIDE panel gives access to the public database and the public experiments can be searched based on annotation in a flexible way. Experiments can be viewed by clicking the + sign. Supplementary Figure 21: PRIDE Inspector Search PRIDE Panel Search Result. Search result of the term mito in the PRIDE Inspector Search PRIDE panel. Supplementary Figure 22. PRIDE Inspector Export Options. The following export options are shown and are available: the spectra can be exported in Mascot Generic File (mgf) format and as a spectrum table. Additionally, the identifications can be exported as Protein-Peptide Mappings, and as detailed protein or peptide tables in a tab-delimited format.! 29!

30 Supplementary Figure 23. PRIDE Inspector Welcome Screen. The PRIDE Inspector Welcome Screen gives the user a quick overview on the tool and highlights its main functionalities. Supplementary Figure 24: PRIDE Inspector Export tab. PRIDE XML files of searched public experiments (see Supplementary Figures 20-21) can be downloaded.! 30!

31 ! Supplementary Figure 1: PRIDE Data Content growth. Growth of data content in PRIDE from September 2006 until September The data included in the graph is number of spectra, number of protein identifications, number of peptide identifications and number of unique peptides.! 31!

32 * The presence of these elements is not enforced in the file format specification. Supplementary Figure 2: Data formats supported currently by PRIDE Inspector.! 32!

33 ! Supplementary Figure 3: PRIDE XML Overview : Experiment General View. The PRIDE XML Overview : Experiment General tab contains basic metadata information about an experiment file: experiment and project titles, contact information, software used for file generation, original file format amongst others.! 33!

34 Supplementary Figure 4: PRIDE XML Overview : Sample & Protocol View. The PRIDE XML Overview : Sample & Protocol tab contains metadata information about the sample (species, disease status) and experimental protocol used in the experiment.! 34!

35 Supplementary Figure 5: PRIDE XML Overview : Instrument & Processing View. The PRIDE XML Overview Instrument & Processing tab contains metadata information about the instrument configuration and software used.! 35

36 Supplementary Figure 6: mzml Overview : Instrument & Processing View. The mzml Overview Instrument & Processing View tab contains metadata information about the instrument configuration and software used.! 36

37 Supplementary Figure 7: Protein View with Spectrum Viewer. The Protein View tab contains three parts: details on the submitted proteins can be observed in the upper section. Information about the corresponding peptide identifications are available in the second section. In the bottom window the Spectrum Viewer is displayed.! 37

Supplementary Figure 8: Protein View with Sequence Viewer. The Protein View tab contains three parts: details on the submitted proteins can be observed in the upper section.

38 Supplementary Figure 8: Protein View with Sequence Viewer. The Protein View tab contains three parts: details on the submitted proteins can be observed in the upper section. Information about the corresponding peptide identifications are available in the second section. In the bottom window the Sequence Viewer is displayed showing useful statistics.! 38

39 Supplementary Figure 9. Spectrum Viewer Filtering Options. In this concrete example only b ion fragment ions and b ion amino acid annotations are enabled.! 39

Supplementary Figure 10: The Peptide View tab contains three parts: the upper part contains peptide details with the sequences and the corresponding protein identifications, the calculated delta m/z

40 Supplementary Figure 10: The Peptide View tab contains three parts: the upper part contains peptide details with the sequences and the corresponding protein identifications, the calculated delta m/z value, the corresponding PTMs and peptide scores assigned by the search engines, if available. The second section is focused on PTMs: PSI-MOD accession number, location, modified residues and mass information. In the bottom window the Spectrum Viewer is displayed.! 40

41 Supplementary Figure 11: The Peptide View tab with the Sequence Viewer: the chosen peptide is highlighted.! 41

42 Supplementary Figure 12: Spectrum View. The Spectrum View tab shows all mass spectra (identified + unidentified) included in the experiment. Spectrum details and the related scan and precursor information are accessible from the upper right window. It is possible to perform de novo sequencing in the Spectrum Viewer (not shown).! 42

43 Supplementary Figure 13: mzml Total Ion Chromatogram. If chromatograms are included in the mzml file, PRIDE Inspector can display them in the Spectrum View tab. It must be noticed that chromatograms cannot be included in PRIDE XML files.! 43

44 Supplementary Figure 14: PRIDE XML Summary Charts View. The PRIDE XML Summary Charts tab displays currently up to nine different charts that can help to assess the quality of the data: Delta m/z, Peptides per Protein, Missed Tryptic Cleavages, Average MS/MS Spectrum, Precursor Ion Charge, Precursor Ion Masses, Peaks per MS/MS Spectrum and Peak Intensity.! 44

45 Supplementary Figure 15: Delta m/z chart Help Page. The Help page of the Delta m/z chart that is available via the information button on the Summary Charts view. This is shown as an example of the Help functionality integrated in the PRIDE Inspector application.! 45

46 Supplementary Figure 16: By clicking the zoom button on each one of the charts included in the Summary Charts tab (see Supplementary Figure 14), the charts can be observed at full screen resolution. In this case, the Delta m/z chart is shown as an example of this feature.! 46

47 Supplementary Figure 17: Number of Peptides Identified per Proteins chart. This chart displays a bar chart with the number of peptides identified per protein in a single experiment.! 47

48 Supplementary Figure 18: mzml Summary Charts View. The mzml Charts Summary tab currently displays up to four charts: Average MS/MS Spectrum, Precursor Ion Masses, Peaks per MS/MS Spectrum, and Peak Intensity.! 48

49 Supplementary Figure 19: PRIDE Inspector Quantification view: If available the Quantification view provide differential expression level details of! 49

proteins across different samples depending on the labeling method used (in the example: isobaric tags, itraq). Supplementary Figure 20: PRIDE Inspector Search PRIDE Panel.

50 proteins across different samples depending on the labeling method used (in the example: isobaric tags, itraq). Supplementary Figure 20: PRIDE Inspector Search PRIDE Panel. The PRIDE Inspector Search PRIDE panel gives access to the public database and the public experiments can be searched based on annotation in a flexible way. Experiments can be viewed by clicking the + sign.! 50

51 Supplementary Figure 21: PRIDE Inspector Search PRIDE Panel Search Result. Search result of the term mito in the PRIDE Inspector Search PRIDE panel.! 51

52 Supplementary Figure 22: PRIDE Inspector Export Options. The following export options are shown and are available: the spectra can be exported in Mascot Generic File (mgf) format and as a spectrum table. Additionally, the identifications can be exported as Protein-Peptide Mappings, and as detailed protein or peptide tables in a tab-delimited format.! 52

53 Supplementary Figure 23: PRIDE Inspector Welcome Screen. The PRIDE Inspector Welcome Screen gives the user a quick overview on the tool and highlights its main functionalities.! 53

54 Supplementary Figure 24: PRIDE Inspector Export tab: PRIDE XML files of searched public experiments (see Supplementary Figures 20-21) can be downloaded.! 54