Supporting Information for Comprehensive HCP Profiling by Targeted and Untargeted Analysis of DIA Mass Spectrometry Data with PRM Verification

Size: px
Start display at page:

Download "Supporting Information for Comprehensive HCP Profiling by Targeted and Untargeted Analysis of DIA Mass Spectrometry Data with PRM Verification"

Transcription

1 Supporting Information for Comprehensive HCP Profiling by Targeted and Untargeted Analysis of DIA Mass Spectrometry Data with PRM Verification Simion Kreimer 1, Yuanwei Gao 1, Somak Ray 1, Mi Jin 2,3, Zhijun Tan 2, Nesredin A. Mussa 2, Li Tao 2, Zhengjian Li 2, Alexander R. Ivanov 1, and Barry L. Karger 1* 1) Barnett Institute and Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA, ) Bristol-Myers Squibb, Biologics Process and Product Development, 38 Jackson Road, Devens, MA ) Present address. TEVA Biopharmaceuticals, 145 Brandywine Highway, West Chester, PA *Author for inquiries. This supporting information contains a detailed description of the DIA data processing workflows used in the manuscript for HCP characterization. The data processing was carried out by two scripts, and the step-bystep operations of these scripts are provided. This supporting information provides sufficient detail for replication of the workflow by other researchers. Additionally Table S-1 contains the sequences and quantitative measurements of the peptides used in evaluation of the presented quantitative strategy. S1

2 This supplement describes two scripts that were developed for data processing in the HCP analysis workflow by connecting open source packages. The Targeted Assay Library Assembler performs a protein sequence database search on multiple DDA data files and generates a retention time (RT) normalized MS assay library appended with decoy assays in the OpenSWATH TraML format (Figure S1). The DIA Data Analysis script performs a combined targeted and untargeted search of the DIA data and generates a list of putative peptides formatted as a QExactive inclusion list (Figure S2). The scripts were written in PERL (version 5.18 or higher) and combined the functionalities of msconvert from the ProteoWizard suite, 1 SearchGUI, 2 DIA-Umpire, 3 OpenSWATH 4 from the OpenMS 5 suite, and PyProphet. 6 The scripts use an in-house developed mzidentml 7 parser, which converts the output of SearchGUI s PeptideShaker 8 to an assay library in an OpenSWATH compatible (tab separated values) TSV format. The mzidentml parser was initially necessary because PeptideShaker did not export results in the pepxml format. Current versions of PeptideShaker have this function, and hence SpectraST can be used to generate the assay libraries. 9 All of the operations can be completed using the listed software tools individually, but the scripts connect them together and thus accelerate data processing. The Supplement is concluded with the measured intensities of 10 SIL peptides that were used to evaluate the label free quantitative strategy presented in the manuscript. Required Software: Targeted Assay Library Assembler Msconvert (ProteoWizard version or higher): OpenSWATH (OpenMS version 2.1 or higher): SearchGUI (version or higher, requires Java version 1.7 or higher) and PeptideShaker (version or higher): pyprophet (version or higher, requires Python version or higher): Script Input: Raw DDA data files from analysis of LC-MS runs containing the specified retention time (RT) standards. SearchGUI parameter file. Protein sequence database in FASTA format. Targeted Assay Library Assembler parameter file (Figure S3). Script Output: Primary output: Database search results from PeptideShaker in mzidentml format. OpenSWATH compatible retention time normalized targeted assay library appended with decoy assays. OpenSWATH compatible retention time standards assay library for targeted analysis alignment. S2

3 Secondary Output: Result files from each search engine for each data file. List of peptides identified at specified FDR threshold. Figure S1. Flow diagram for the targeted assay library generation script. S3

4 Script Procedure: 1) Each data file is converted into centroided 64-bit Mascot Generic Format (MGF) using msconvert. 2) A database search is performed on each MGF file by SearchGUI using the parameters indicated in the SearchGUI parameter file. 3) The database search results from all data files are combined in PeptideShaker, and a list of peptides is generated where the false positive IDs are limited to the specified FDR (e.g. 1%) and not compounded from combination of search results from multiple runs. 4) The database search results are processed in PeptideShaker individually for each MGF file, and exported in the mzidentml format. 5) All mzidentml files are processed using the mzidentml parser as follows: a. The measured retention times of the specified RT standards are averaged across all runs. b. A linear regression curve is generated for each result file to normalize the retention time across all runs to the averages calculated in 5a. c. The 10 highest intensity b or y-ion transitions are extracted for each peptide (in each charge state) identified above the set PeptideShaker confidence score (e.g. 100% confidence for targeted library) and copied into an OpenSWATH compatible TSV file. d. The retention times for each assay are normalized based on the calibration curves generated in step 5b. e. Redundant identifications are removed, retaining only the matches with the highest PeptideShaker raw scores for each identified peptide charge state. 6) The assay library is filtered against the peptide list from Step 3, and only peptides identified at the true FDR threshold are retained. 7) The assays for RT standards are copied into a separate TSV file to be used for retention time calibration during DIA data analysis. At this step, a different set of peptides can be selected, in our case peptides from the mab were used for RT normalization. 8) The targeted assay library and the RT standards library are converted from the TSV format into the OpenSWATH TraML 10 format using the ConvertTSVtoTraML and the AssayGenerator scripts from the OpenSWATH toolset. 9) The assay library is appended with decoy assays using the OpenSWATH DecoyGenerator script and is ready to be used in targeted DIA data analysis. S4

5 DIA Data Analysis Required Software: Msconvert (ProteoWizard version or higher): OpenSWATH (OpenMS version 2.1 or higher): SearchGUI (version or higher, requires Java version 1.7 or higher) and PeptideShaker (version or higher): pyprophet (version or higher, requires Python version or higher): DIA Umpire (version 2.0 or higher): Script Input: Raw DIA data files. Targeted assay library (generated by the Targeted Assay Library Assembler). RT standards assay library (generated by the Targeted Assay Library Assembler). SearchGUI parameter file (use same file as in Targeted Assay Library Assembler). Protein sequence database in FASTA format (use same file as in Targeted Assay Library Assembler). DIA Data Analysis script parameter file (Figure S4). DIA Umpire parameter file. Script Output: Primary Output: QExactive compatible inclusion list containing HCP peptides. DIA data converted into pseudo-ms2 files (3 MGF files for each DIA data file). OpenSWATH Workflow result files for the targeted and untargeted libraries. SearchGUI database results of the pseudo-dda data. Untargeted assay library. Secondary Output: Result files from individual search engines. S5

6 Figure S-2 Flow diagram for the DIA data analysis script S6

7 Script Procedure: 1) Each DIA data file is converted into 64-bit mzxml format (the MS1 data is kept in profile mode) by msconvert. 2) The DIA files in mzxml format are processed with the OpenSWATH Workflow using the targeted assay library and the RT standards library. 3) The DIA files are then converted into pseudo-dda files (MGF format) by DIA Umpire using the parameters indicated in the DIA Umpire parameter file. 4) Each generated MGF file is searched in SearchGUI and PeptideShaker individually and the results are exported as separate mzidentml files. 5) The files are processed using the mzidentml parser, as in the Targeted Assay Library Assembler, except that a lower PeptideShaker confidence threshold and hence a higher FDR (e.g. 5%) is used to filter identifications. The retention times are calibrated using the RT standards library. 6) The mzidentml parser generated untargeted assay library is converted using the ConvertTSVtoTraML, AssayGenerator, and DecoyGenerator scripts as in the Targeted Assay Library Generator. 7) The DIA files in mzxml format are processed with the OpenSWATH Workflow using the untargeted assay library and the RT standards library. 8) The results from the untargeted and targeted OpenSWATH searches are combined and scored using pyprophet, which identifies the best peak for each assay. 9) All matched peaks from pyprophet are exported to produce a non-redundant list of putative peptides in all detected charge states. 10) The list is filtered to exclude a set of proteins specified in the script parameter file (e.g. trypsin and the therapeutic protein). 11) The putative peptide list is manually pasted into a PRM method file for PRM verification and quantitation. S7

8 Figure S-3. Targeted assay library Generator parameter file. The listed parameters match those used in the investigation. S8

9 Figure S-4 DIA data analysis parameter file. The listed parameters match those in the investigation. S9

10 Evaluation of Quantitative Accuracy and Linearity with Spiked Peptide Standards The accuracy and linearity of the quantitative strategy was evaluated using 10 stable isotope labeled peptides, which were spiked in at levels ranging from 2.5 to 20 fmol/injection into the purified mab antibody. Their concentration was estimated based on the combined intensity of the top 3 fragment ions and a calibration curve constructed from 4 spiked-in protein standards. The results are linear, and the slope range of suggests a potential measurement error of 2 to 3 fold. Table S-1. Stable Isotope Labeled Peptides used for Assessment of Label Free Quantitation. SIL Peptide 2.5 fmol 5 fmol 10 fmol 20 fmol Slope R 2 VSAGLSVPADGPK QGAFLVNAAR >0.999 SVLLDAASGQLR FEEILQEAGSR >0.999 GETLGLIGFGR ANFYYLEGER VTSFSLAK VHSFPDTIK >0.999 LSLGSGSCSAIIK TWNDPSVQQDIK References (1) Kessner, D.; Chambers, M.; Burke, R.; Agusand, D.; Mallick, P. Bioinformatics 2008, 24, (2) Vaudel, M.; Barsnes, H.; Berven, F. S.; Sickmann, A.; Martens, L. Proteomics 2011, 11, (3) Tsou, C. C.; Avtonomov, D.; Larsen, B.; Tucholska, M.; Choi, H.; Gingras, A. C.; Nesvizhskii, A. I. Nat Methods 2015, 12, (4) Rost, H. L.; Rosenberger, G.; Navarro, P.; Gillet, L.; Miladinovic, S. M.; Schubert, O. T.; Wolskit, W.; Collins, B. C.; Malmstrom, J.; Malmstrom, L.; Aebersold, R. Nat Biotechnol 2014, 32, (5) Bertsch, A.; Gropl, C.; Reinert, K.; Kohlbacher, O. Methods Mol Biol 2011, 696, (6) Teleman, J.; Rost, H. L.; Rosenberger, G.; Schmitt, U.; Malmstrom, L.; Malmstrom, J.; Levander, F. Bioinformatics 2015, 31, (7) Jones, A. R.; Eisenacher, M.; Mayer, G.; Kohlbacher, O.; Siepen, J.; Hubbard, S. J.; Selley, J. N.; Searle, B. C.; Shofstahl, J.; Seymour, S. L.; Julian, R.; Binz, P. A.; Deutsch, E. W.; Hermjakob, H.; Reisinger, F.; Griss, J.; Vizcaino, J. A.; Chambers, M.; Pizarro, A.; Creasy, D. Mol Cell Proteomics 2012, 11, (8) Vaudel, M.; Burkhart, J. M.; Zahedi, R. P.; Oveland, E.; Berven, F. S.; Sickmann, A.; Martens, L.; Barsnes, H. Nat Biotechnol 2015, 33, (9) Schubert, O. T.; Gillet, L. C.; Collins, B. C.; Navarro, P.; Rosenberger, G.; Wolski, W. E.; Lam, H.; Amodei, D.; Mallick, P.; MacLean, B.; Aebersold, R. Nat Protoc 2015, 10, (10) Deutsch, E. W.; Chambers, M.; Neumann, S.; Levander, F.; Binz, P. A.; Shofstahl, J.; Campbell, D. S.; Mendoza, L.; Ovelleiro, D.; Helsens, K.; Martens, L.; Aebersold, R.; Moritz, R. L.; Brusniak, M. Y. Mol Cell Proteomics 2012, 11, 1-6. S10