UC San Diego UC San Diego Electronic Theses and Dissertations

Size: px

Start display at page:

Download "UC San Diego UC San Diego Electronic Theses and Dissertations"

Rodney Perkins
6 years ago
Views:

1 UC San Diego UC San Diego Electronic Theses and Dissertations Title Protein Identification via Assembly of Tandem Mass Spectra Permalink Author Guthals, Adrian Lewis Publication Date Peer reviewed Thesis/dissertation escholarship.org Powered by the California Digital Library University of California

2 UNIVERSITY OF CALIFORNIA, SAN DIEGO Protein Identification via Assembly of Tandem Mass Spectra A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Adrian Lewis Guthals Committee in charge: Professor Nuno Bandeira, Chair Professor Vineet Bafna, Co-Chair Professor Steven Briggs Professor Sanjoy Dasgupta Professor Pavel Pevzner 2015

4 The Dissertation of Adrian Lewis Guthals is approved and is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Chair University of California, San Diego 2015 iii

5 DEDICATION This manuscript is dedicated to my loving wife, parents, and family, who brought me through 9 years of personal development. iv

6 TABLE OF CONTENTS Signature Page Dedication Table of Contents List of Figures List of Tables Acknowledgements Vita Abstract of the Dissertation iii iv v vii viii ix x xi Introduction Chapter 1 The generating function approach for peptide identification in spectral networks Introduction Methods Spectral probabilities and notation Pairing of spectra Star Probabilities Processing real spectra Generating candidate PSMs Results Discussion Acknowledgements Chapter 2 Shotgun Protein Sequencing with meta-contig assembly Introduction Methods prot Data Acquisition Spectrum Preprocessing and Notation Shotgun Protein Sequencing Spectral Alignment Meta-Assembly Results Discussion Acknowledgements v

7 Chapter 3 Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides Introduction Methods MS/MS Acquisition PepNovo + Training CID/HCD/ETD Merging Results Discussion Acknowledgements Chapter 4 De novo sequencing of polyclonal antibodies from serum Preliminary Results Problem Formulation Sequencing Acknowledgements Bibliography vi

8 LIST OF FIGURES Figure 1.1. Illustration of P ovl p and the overlapping mass range between overlapping spectra Figure 1.2. Spectral and star probability distributions of observed p-values Figure 1.3. Figure 1.4. Reduction of star probability with respect to optimality of starting spectral probability Overlap of unique peptides identified at 1% peptide-level falsediscovery rate Figure 2.1. Meta-SPS Procedures Figure 2.2. Annotation of contigs and meta-contigs with MS-GFDB spectrum identifications Figure 2.3. De novo sequencing length, coverage, and accuracy Figure 2.4. Mapped Meta-contigs Figure 3.1. Updated Meta-SPS pipeline Figure 3.2. MS/MS ion statistics and performance of CID/HCD/ETD PRM scoring and merging Figure 3.3. Assembled meta-contig of CID/HCD/ETD triplets Figure 3.4. De novo sequencing coverage of six target proteins at κ Figure 4.1. Intact mass measurement and relative abundances of antigen-responsive pabs Figure 4.2. Preliminary results from sequencing an unknown pab sample vii

9 LIST OF TABLES Table 1.1. Table 1.2. Spectrum- and Peptide Level Identification Rate of Paired Peptide Spectrum Matches at 1% False-Discovery Rate Spectrum- and Peptide-Level Identification Rate of All (Paired and Unpaired) Peptide Spectrum Matches Table 2.1. Definitions of contig alignment terminology Table 3.1. De novo Sequencing Length, Coverage, and Accuracy for Alternative Minimum Meta-contig Size (κ) Cutoffs Table 3.2. De novo Sequencing and Database Search Results by Enzyme viii

10 ACKNOWLEDGEMENTS I would like to acknowledge Professor Nuno Bandeira for his support as the chair of my committee. Through multiple projects and many long nights, his guidance has proved to be invaluable. Chapter 1, in full, is a reprint of the material as it appears in the Journal of Computational Biology The generating function approach for peptide identification in spectral networks. J Comput Biol Nov 25. [Epub ahead of print] PubMed PMID: [39] Chapter 2, in full, is a reprint of the material as it appears in Molecular and Cellular Proteomics Shotgun protein sequencing with meta-contig assembly. Mol Cell Proteomics Oct;11(10): [40] Chapter 3, in full, is a reprint of the material as it appears in the Journal of Proteome Research Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides. J Proteome Res Jun 7;12(6): [41] Chapter 4, in part is currently being prepared for submission for publication of the material. Guthals, Adrian; Yutian, Gan; Sandoval, Wendy; Bandeira, Nuno. The dissertation author was the primary investigator and author of this material. ix

11 VITA Research Assistant, University of California, San Diego 2010 Bachelor of Science, University of California, San Diego 2013 Masters of Science, University of California, San Diego 2014 Research and Development Intern, Genentech, South San Francisco 2015 Doctor of Philosophy, University of California, San Diego PUBLICATIONS The generating function approach for peptide identification in spectral networks. RE- COMB [Epub ahead of print] PubMed PMID: (ref [39]) Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides. J Proteome Res Jun 7;12(6): (ref [41]) Shotgun protein sequencing with meta-contig assembly. Mol Cell Proteomics Oct;11(10): (ref [40]) The spectral networks paradigm in high throughput mass spectrometry. Mol Biosyst Oct;8(10): (ref [42]) Peptide identification by tandem mass spectrometry with alternate fragmentation modes. Mol Cell Proteomics Sep;11(9):550-7 (ref [38]) Neutron-encoded signatures enable product ion annotation from tandem mass spectra. Mol Cell Proteomics Dec;12(12): (ref [86]) FIELDS OF STUDY Major Field: Computer Science (Bioinformatics) Studies in Computational Proteomics and Mass Spectrometry Professor Nuno Bandeira x

12 ABSTRACT OF THE DISSERTATION Protein Identification via Assembly of Tandem Mass Spectra by Adrian Lewis Guthals Doctor of Philosophy in Computer Science University of California, San Diego, 2015 Professor Nuno Bandeira, Chair Professor Vineet Bafna, Co-Chair High-throughput proteomics is made possible by a combination of modern mass spectrometry instruments capable of generating many millions of tandem mass (MS 2 or MS/MS) spectra on a daily basis and the increasingly sophisticated associated software for their automated identification. Despite the growing accumulation of collections of identified spectra and the regular generation of MS 2 data from related peptides, the mainstream approach for peptide identification is still the nearly two decades old approach of matching one MS 2 spectrum at a time against a database of protein sequences. These traditional approaches fail for the identification of spectra from unknown proteins such xi

13 as antibodies or proteins from organisms with un-sequenced genomes. Furthermore, attempts to identify MS/MS spectra against large databases (e.g., the human microbiome or 6-frame translation of the human genome) face a search space that is times larger than the human proteome, where it becomes increasingly challenging to separate between true and false peptide matches. First, we describe techniques to utilize networks of spectra from related peptides to rigorously compute the joint spectral probability of multiple spectra being matched to peptides with overlapping sequences, thus improving peptide identification by 30 62% against large search spaces. We then introduce methods that dramatically improve de novo sequencing of unknown proteins using novel spectral network assembly algorithms and incorporating alternative MS/MS acquisition protocols. Finally, we describe an interesting end-goal biological problem for which the described advances in de novo sequencing can usher in a new era of therapeutic drug discovery. xii

14 Introduction The success of tandem mass spectrometry (MS 2 or MS/MS) approaches to peptide identification is partly due to advances in computational techniques allowing for the reliable interpretation of MS 2 spectra. Mainstream computational techniques mainly fall into two categories: database search approaches that score each spectrum against peptides in a sequence database [26, 80, 18, 105] and de novo techniques that directly reconstruct the peptide sequence from each spectrum [68, 31, 28, 75]. The combination of these methods with advances in high throughput MS 2 have promoted accelerated growth of spectral libraries - collections of peptide MS 2 spectra whose identifications were validated by accepted statistical methods [53, 25] and often also manually confirmed by mass spectrometry experts. A similar concept of spectral archives was also recently proposed to denote spectral libraries including interesting non-identified spectra [30] (i.e. unidentified recurring spectra with good de novo reconstructions). The growing availability of these large collections of MS 2 spectra has reignited the development of alternative peptide identification approaches based on spectral matching [19, 35, 60] and alignment [4, 90, 6] algorithms. The dominant paradigm for high-throughput protein identification is based on trypsin digestion of extracted proteins to produce peptides followed by tandem mass spectrometry to generate single-peptide MS 2 spectra that are then computationally matched one spectrum at a time against protein sequence databases to finally obtain peptide and protein identifications. This paradigm has been the basis of nearly all large-scale 1

15 2 proteomics studies to date despite its typical low spectrum identification rate of only 15-30% because enzymatic digestion generates multiple peptides per protein and, in the extreme, only one peptide needs to be identified per protein (though more are usually preferred) to enable protein-level quantification and comparison across multiple tissues or experimental conditions. However, the serious downside of this low identification rate is that it consistently leads to missing information on non-tryptic peptides and yields very low protein sequence coverage, thus substantially limiting the chances of detecting alternative splicing or to identify and localize post-translational modifications (PTMs). In fact, the limitations of PTM search are so dire that most labs still only allow for 4-6 PTMs per search (about half or which due to sample handling procedures) even though more than 500 PTMs are known and listed in UniMOD. We argue that overcoming the identification bottleneck will require new ways of thinking about MS 2 spectra in order to develop new ways of interpreting them. In particular, we describe how the spectral networks paradigm differs from the current mainstream paradigm and illustrate its potential with applications where current paradigms perform poorly or completely fail. By finding spectra from related peptides even before considering their possible identifications and using these spectra to determine consensus identifications from sets of spectra from related peptides instead of separately attempting to identify one spectrum at a time, the spectral networking paradigm is capable of addressing many of the pitfalls of mainstream spectra identification paradigms. In addition to improving identification by significantly increasing signal-to-noise ratios and deconvoluting MS 2 ion types, spectral networks further open up new computational avenues for analysis of proteins for which the amino acid sequence is unknown.

16 3 Shotgun Protein Sequencing Current approaches to proteomics focus on the reliable identification of proteins under the assumption that all proteins of interest are known and present in a database. However, the limited availability of sequenced genomes and multiple mechanisms of protein variation often refute this assumption. Well known mechanisms of protein diversity include variable recombination and somatic hypermutation of immunoglobulin genes [36]. The vital importance of some of these novel proteins is directly reflected in the success of monoclonal antibody drugs such as Rituxan TM, Herceptin TM and Avastin TM [112, 45, 3], of which all are derived from proteins that are not directly inscribed in any genome. Similarly, multiple commercial drugs have been developed from proteins obtained from species whose genomes are not known. In particular, peptides and proteins isolated from venom have provided essential clues for drug design [64, 84] - examples include drugs for controlling blood coagulation [52, 101, 58] and therapeutic treatments for breast [100, 79] and ovarian [73] cancer. Despite this vital importance of novel proteins, the mainstream method for protein sequencing is still initiated by restrictive and lowthroughput Edman degradation [115, 77] - a task made difficult by protein purification procedures, post-translational modifications and blocked protein N-termini. These problems gain additional relevance when one considers the unusually high level of variability and post-translational modifications in venom proteins [9, 85]. Conceptually, sequencing a protein from a set of MS 2 spectra can be described by a simple analogy. Imagine a jewelry box with many identical copies of a specific model of bead necklaces. Although all the beads are identical, this model is characterized by having irregular distances between consecutive beads - the set of inter-bead distances is initially chosen by the designer and all necklaces are then made using exactly the same specification. Now assume that one day you open your jewelry box and realize

17 4 that someone has vandalized all the necklaces by cutting them to fragments at randomly chosen bead positions. Can you recover the original design of this model of necklaces, as specified by the set of consecutive inter-bead distances? In this allegory inter-bead distances correspond to amino acid masses and beads correspond to MS 2 fragmentation points (between consecutive amino acids). MS 2 data add more than a few difficulties to this necklace assembly problem; for example, most peaks in MS 2 spectra do not correspond to any fragment ions (extra beads) and many fragment ions do not result in any peaks (missing beads). Shotgun Protein Sequencing (SPS) is a de novo sequencing approach [4] that utilizes multiple MS 2 spectra from overlapping peptides generated using non-specific proteases or multiple proteases with different specificities [51, 59, 27, 69, 83]. The original approach was based on the overlap layout consensus approach to assembly and shown to be efficient for the assembly of a single purified unmodified protein. However, practical applications (like sequencing snake venoms) require applicability to mixtures of modified proteins. In fact, most MS 2 samples contain both modified and unmodified versions for many peptides, including biological and chemical modifications both native and introduced during sample preparation. Sequence variations and posttranslational modifications present a formidable algorithmic challenge for assembly algorithms as the performance of the original SPS approach [4] steeply degraded as soon as even a small percentage of the spectra are from modified peptides. To use the beads analogy, the necklace puzzle becomes very difficult if in addition to the canonical necklaces (non-modified proteins), the jewelry box also contains some necklaces that deviate from the designer s specification (modified proteins). Building on spectral networks algorithms for analysis of post-translational modifications based on alignment of spectra from modified and unmodified peptide variants [5, 6], integrate these alignments are integrated into Shotgun Protein Sequencing to derive a completely new form of

18 5 spectral assembly. This utilized a generalized notion of ABruijn graphs (originally proposed in the context of DNA fragment assembly [82]) for the assembly of MS 2 spectra from overlapping, modified and unmodified peptides into contigs (sets of aligned spectra from overlapping peptides), where each contig then capitalizes on the corroborating evidence from the assembled spectra to yield a high-quality consensus de novo sequence. As a result, SPS consensus de novo sequences were found to be twice as accurate as sequences derived from single spectra (1 mistake per 10 vs 5 amino acid predictions) while yielding sequences that were much longer that single-peptide/spectrum could support (up to 24 AA long). Recently this paradigm was extended in two distinct directions. First, Bandeira et al. capitalized on homology between SPS long/accurate de novo sequences and known sequences to deliver the first automated full-length protein sequencing approach (Comparative SPS [3]) and demonstrated it with database-assisted de novo sequencing of two monoclonal antibodies. Spectral networks also underlie the related work of Castellana et al. [13, 11], who proposed an effective method for sequencing monoclonal antibodies with database-guided iterative alignment+assembly of spectra from overlapping peptides. Both of these methods rely upon the existence of a homologous database. To reduce this dependence, we have since developed Meta-SPS (see Chapters 2,3) algorithms for assembling SPS contigs into meta-contigs (sets of overlapping contigs). These methods now deliver de novo sequences over 200 AA long at sequencing error rates as low as 1 mistake per 100 predicted amino acids without requiring homology to known sequences, which demonstrates the feasibility of fully-automated de novo protein sequencing with unidentified MS 2 spectra.

19 6 Novel Contributions The spectral networks paradigm is founded on two core principles beyond mainstream approaches: 1) it is more efficient to match unidentified spectra to reference or other unidentified spectra than to reference sequences and 2) consensus interpretation of sets of related spectra is more reliable than identification of one spectrum at a time. This provides the foundation of the novel work described here. First, I describe how we utilized networks of spectra from overlapping peptides to improve the peptide-level identification rate by 40-60% over state-of-the-art database search tools (Chapter 1). Then I go into methods for improving de novo sequencing of unknown proteins. I will introduce the multi-stage assembly approach of Meta-SPS (Chapter 2) and describe techniques for combining CID/ETD/HCD spectra from each peptide to further improve de novo sequencing (Chapter 3). Finally, I describe an interesting end-goal biological problem for which the described advances in de novo analysis can usher in a new era of therapeutic drug discovery (Chapter 4).

20 Chapter 1 The generating function approach for peptide identification in spectral networks Tandem mass (MS/MS) spectrometry has become the method of choice for protein identification and has launched a quest for the identification of every translated protein and peptide. However, computational developments have lagged behind the pace of modern data acquisition protocols and have become a major bottleneck in proteomics analysis of complex samples. As it stands today, attempts to identify MS/MS spectra against large databases (e.g., the human microbiome or 6-frame translation of the human genome) face a search space that is times larger than the human proteome, where it becomes increasingly challenging to separate between true and false peptide matches. As a result, the sensitivity of current state-of-the-art database search methods drops by nearly 38% to such low identification rates that almost 90% of all MS/MS spectra are left as unidentified. We address this problem by extending the generating function approach to rigorously compute the joint spectral probability of multiple spectra being matched to peptides with overlapping sequences, thus enabling the confident assignment of higher significance to overlapping peptide spectrum matches (PSMs). We find that these joint spectral probabilities can be several orders of magnitude more significant than individual 7

21 8 PSMs, even in the ideal case when perfect separation between signal and noise peaks could be achieved per individual MS/MS spectrum. After benchmarking this approach on a typical lysate MS/MS dataset, we show that the proposed intersecting spectral probabilities for spectra from overlapping peptides improve peptide identification by 30 62%. 1.1 Introduction The leading method for protein identification by tandem mass spectrometry (MS/MS) involves digesting proteins into peptides, generating an MS/MS spectrum per peptide, and obtaining peptide identifications by individually matching each MS/MS spectrum to putative peptide sequences from a target database. Many computational approaches have been developed for this purpose, such as SEQUEST[26], Mascot[80], Spectrum Mill (Agilent Technologies), and more recently MS-GFDB[56], yet they all address the same two problems: Given an MS/MS spectrum S and a collection of possible peptide sequences, (1) find the peptide P that most likely produced spectrum S, and (2) report the statistical significance of the peptidespectrum match (P, S) (denoted as PSM) while searching many MS/MS spectra against multiple putative peptide sequences from a target database. Problem 1 is typically addressed by maximizing a scoring function proportional to the likelihood that peptide P generated spectrum S, while solving problem 2 involves choosing a score threshold that yields an experiment-wide 1% false-discovery rate (FDR)[76], usually based on an estimated distribution of PSM scores for incorrect PSMs [25]. Yet a major limitation comes from ambiguous interpretations of MS/MS fragmentation where the true peptide match for a given spectrum S may only be the 2nd or 100,000th highest scoring over all possible PSMs for the same spectrum [55]. We address this issue as it relates to problem 2, where the probability of false peptides matching S with high score can become common when searching large databases,

22 9 particularly for meta-proteomics[16] and 6-frame translation[12] searches, thus leading to higher-scoring false matches and stricter significance thresholds resulting in as little as 2% of all spectra being identified[48] since only the highest scoring PSMs become statistically significant even at 5% FDR. Identifying peptides from a large database is less of a challenge than that of de novo sequencing, where the target database contains all possible peptide sequences. Yet, recent advances in de novo sequencing have demonstrated 97 99% sequencing accuracy (percent of amino acids in matched peptides that are correct) at nearly the same level of coverage (percent of amino acids in target peptides that were matched) as that of database search for small mixtures of target proteins (Chapters 2,3). At the heart of this approach is the pairing of spectra from overlapping peptides (i.e., peptides that have overlapping sequences) to construct spectral networks[42, 4] where a node represents an individual spectrum [or a consensus spectrum from a clustered set of spectra from the same precursor[30] and edges denote pairs of spectra from peptides with overlapping sequences. It is then shown that de novo sequences assembled by simultaneous interpretation of multiple spectra from overlapping peptides are much more accurate than individual per-spectrum interpretations (Chapters 2,3). Use of multiple enzyme digestions and strong cation exchange (SCX) (Edelmann, 2011) fractionation is becoming more common in MS/MS protocols to generate broader coverage of protein sequences and yield wider distributions of overlapping peptides, but current statistical methods still ignore the peptide sequence overlaps and separately compute the significance of individual peptides matched to individual spectra [99]. Given that the set of all possible protein sequences is orders of magnitude larger than the human six-frame translation (or any other database), application of these de novo techniques to database search should substantially improve peptide identification rates, especially for large databases. Since the original generating function approach

23 10 showed how de novo algorithms can be used to estimate the significance of PSMs for individual spectra, it is expected that advances in de novo sequencing should consequently translate into better estimation of PSM significance. It has already been shown that spectral networks can be used to improve the ranking of database peptides against paired spectra[6], but it is still unclear how to accurately evaluate the statistical significance of peptides matched to multiple overlapping spectra. Intuitively, if it is known that these overlapping spectra yield more accurate de novo sequencing, then the probability of observing multiple incorrect high-scoring PSMs with overlapping sequences should be lower than the probability of single incorrect peptides matching single spectra with high scores. To this end, we introduce StarGF, a novel approach for peptide identification that accurately models the distribution of all peptide sequences against pairs of spectra from overlapping peptides. We demonstrate its performance on a typical lysate mass spectrometry dataset and show that it can improve peptide-level identification by up to 62% compared to a state-of-the-art database search tool. 1.2 Methods Spectral probabilities and notation We describe a method to assess the significance of overlapping PSMs based on the generating function approach for computing the significance of individual PSMs [55]. Although traditional methods for scoring PSMs incorporate prior knowledge of N/Cterminal ions, peak intensities, charges, and mass inaccuracies, these terms are avoided here for simplicity of presentation, and later we describe how these features were considered for real spectra. Let a peptide P of length n be a string of amino acids a[1...n] with parent mass P = i a[i] and each a[i] is one of the standard amino acids a[i] A. For clarity

24 11 of presentation, we define amino acid masses a[i] to be integer valued and that each MS/MS spectrum is an integer vector S[1... S ] = s[1]... s[ S ], where s[i] > 0 if there is a peak at mass i (having intensity s[i]), and s[i] = 0 otherwise (denote S as the parent mass of S). Let Spectrum(P) be a spectrum with parent mass P such that s[i] = 1 if i is the mass of a prefix of P. We define the match score between spectra S = s[1... S ] and S = s [1... S ] as S s[i] s [i]. Thus, the match score Score(P,S) between a peptide P i=1 and a spectrum S is equivalent to the match score between Spectrum(P) and S if both spectra have the same parent mass (otherwise, Score(P,S) = ). The problem faced by peptide identification algorithms is to find a peptide P from a database of known protein sequences that maximizes Score(P,S), and then assess the statistical significance of each top-scoring PSM. Given a PSM (P,S) with score Score(P,S) = T, the spectral probability introduced by MSGF[55] computes the significance of the match as the aggregate probability that a random peptide P achieves a Score(P,S) T, otherwise termed as Prob T (S). The probability of a peptide P = a[1...n] is defined as the product of probabilities of its amino acids n p(a[i]), where each amino acid a A has a fixed probability of occurrence i=1 of 1/ A (or could be set to the observed frequencies in a target database). In MSGF, computing Prob T (S) is done in polynomial time by filling in the dynamic programming matrix SP(i,t), which denotes the aggregate probability that a random peptide P with mass P = i achieves Score(P,S[1...i]) = t. The SP matrix is initialized to SP(0,0) = 1, zero elsewhere, and updated using the following recursion [55]. SP(i,t) = a A:i a,t s[i] SP(i a,t s[i]) p(a) (1.1)

25 12 Prob T (S) is calculated from the SP matrix as follows: Prob T (S) = t T SP( S,t) (1.2) Pairing of spectra A pair of overlapping PSMs is defined as a pair (P,S) and (P,S ) such that (1) both spectra are matched to the same peptide (P = P ) or (2) the spectra are matched to peptides with partially overlapping sequences: either P is a substring of P or a prefix of P matches a suffix of P. We also enforce that partially overlapping peptide sequences exist in the target database. For example, given the peptide pair PEPTIDE and PTIDES, we enforce that PEPTIDES is a substring of at least one protein in the database; otherwise, the pair is discarded. As mentioned above, spectral pairs can be detected using spectral alignment without explicitly knowing which peptide sequences produced each spectrum (as described previously [81, 2]). Intersecting spectral probabilities (described below) are calculated for all pairs of spectra with overlapping PSMs. In addition, we use all neighbours of each paired spectrum to calculate the star probability for the center nodes in each subcomponent defined by S and all of its immediate neighbours Star Probabilities In the simplest case of a pair of overlapping PSMs (P,S) and (P,S ), where P = P, we want to find the aggregate probability that a random peptide matches S with score T and matches S with score T (denoted the intersecting spectral probability Prob T,T (S,S )). A naive solution is to simply take the product of Prob T (S) and Prob T (S), but this approach fails to capture the dependence between Prob T,T (S,S ) induced by the similarity between S and S. Intuitively, a high similarity between S and

26 13 S should correlate with a high probability that both spectra get matched to the same peptide, regardless of whether it is a correct match. Prob T,T (S,S ) can be computed efficiently by adding an extra dimension to the dynamic programming recursion SP, yielding a three-dimensional matrix ISP s (i,t,t ) that tracks the aggregate probability that a random peptide P with mass i matches S[i...t] with score t and matches S [i...t] with score t. The ISP s matrix is initialized to ISP s (0,0,0) = 1, zero elsewhere, and computed as follows. ISP s (i,t,t ) = a A:i a,t s[i],t s [i] ISP s(i a,t s[i],t s [i]) p(a) (1.3) Prob T (S) is calculated from the SP matrix as follows: Prob T,T (S,S ) = t T t T ISP s ( S,t,t ) (1.4) To generalize intersecting spectral probabilities to include pairs of spectra from partially overlapping peptides, we define ISP(i,t,t ) to address the case where S is shifted in relation to S (Figure 1.1) by a given mass shift λ, which may be positive or negative. The shift λ defines an overlapping mass range between the spectra; in spectrum S, the range starts at mass b = max(0,λ) and ends at mass e = min( S, S + λ), while in spectrum S the range starts at mass b = max(0,λ) and ends at mass e = min( S, S λ). Since partially overlapping spectra may originate from different peptides (λ 0 or S S ), the probabilities of peptides matching S must be processed differently from those matching S. If one considers a peptide P matching S, only the portion of P from b to e (denoted as P ovl p ) can be matched against S [b...e ] = s [b ]...s [e ]. For example, in

27 14 Figure 1.1. Illustration of P ovl p and the overlapping mass range between overlapping spectra S and S matched to peptides (PEPTIDE, PTIDES) (left) and (PTIDES, PEPTIDE) (right), respectively. Figure 1.1, P ovl p is equal to the peptide PTIDE. First, ISP(i,t,t ) is defined to hold the aggregate probability that a random peptide P with mass i achieves Score(P,S[1...i]) = t such that Score(P ovl p,s [b...min(e,1 λ)]) = t. In cases where i is less than b (i.e., when λ > 0), P ovl p is empty and is defined to have zero score against S. The base case for ISP(i,t,t ) is the same as the base case for ISP s, but the recursion must be separated into three separate cases depending on whether i b, b < i e, or i > e. If i b, then ISP(i,t,t ) is tracking peptides matching with score t, but score 0 against S. If i b (t = 0): ISP(i,t, 0) = ISP(i a,t s[i],0) p(a) a i t s[i] (1.5) When i is inside the overlapping mass range of S, the matrix tracks peptides

28 15 matching S[1...i] with score t that contain a suffix matching S [b...i λ] with score t. If b < i e: ISP(i,t,t ) = ISP(i a,t s[i],t s [i λ]) p(a) (1.6) a i t s[i] t s [1 λ] i a b When e < i S and, thus, i is outside the overlapping mass range, ISP(i,t,t ) is extending peptides P matching S[1...i] with score t where P ovl p has score t against S [b...e ]. If i > e: ISP(i,t,t ) = a i t s[i] i a e ISP(i a,t s[i],t ) p(a) (1.7) If P matches S with score T and P ovl p matches S [b...e ] with score T, the probability of both events is computed as given below. Prob T,T (S,S [b...e ]) = t T t T ISP( S,t,t ) (1.8) Note that since λ may be positive or negative, the intersecting probability of a peptide P matching S with score T and P ovl p matching S[b...e] with score T is computed by simply setting λ = λ before calculating Prob T,T (S,S [b...e ]). The term star is defined as the set of all spectra directly connected with spectrum S in the spectral network [6]. We are interested in the minimum Prob T,T (S,S [b...e ])

29 16 over all S in the star of S, otherwise termed as the star probability of S. Computation of the star probability is more precisely defined in pseudo code below. Algorithm 1. Computation of star probability 1: procedure STARPROBABILITY(P, S) 2: T Score(P,S) 3: starp Prob T (S) 4: for all (S,S ) star of S do 5: λ mass shift of S in relation to S 6: T Score(P ovl p,s [b...e ]) 7: if Prob T,T (S,S [b...e ]) > 0 then 8: starp min(starp,prob T,T (S,S [b...e ])) return starp Processing real spectra Each MS/MS spectrum was transformed into a prefix-residue mass (PRM) spectrum [20] with integer-valued masses and likelihood intensities s 1...s S using the PepNovo + probabilistic scoring model [32]. PepNovo + interprets MS/MS fragmentation patterns and converts MS/MS spectra into PRM spectra where peak intensities are replaced with log-likelihood scores and peak masses are replaced by PRMs (cumulative amino acid masses of putative N term prefixes of the peptide sequence). PRM scores combine evidence supporting peptide breaks: observed cleavages along the peptide backbone supported by either N- or C terminal fragments. To minimize rounding errors, floating point peak masses returned by PepNovo + were converted to integer values as in MS-GF[55], where cumulative peak mass rounding errors were reduced by multiplying by before rounding to integers (amino acid masses were also rounded to integer values). High-resolution peak masses could also be supported by using a larger multiplicative constant (e.g., 100.0) prior to rounding. Peak intensities were first normalized, and so each spectrum contained a maximum total score of σ = 150, and then they were rounded to integers (peaks with score < 0.5 were effectively removed). With

30 17 these parameters, the time complexity of computing individual and intersecting spectral probabilities is approximately O( S σ A ) and O( S σ 2 A ), respectively. In practice, we implemented the intersecting spectral probability calculation in C++ and achieved a running time of approximately < 0.01 seconds per pair on average. It is conceivable to further generalize star probabilities to include m > 2 networked PSMs by adding m 1 more dimensions to the dynamic programming table (ISP) used to calculate intersecting spectral probabilities, but this would of course yield an exponential running time of O( S σ m A ). Thus, it is possible that the results of the StarGF approach would further improve if further implementation efforts and compute time were invested into ways to approximate this calculation for larger components of networked spectra Generating candidate PSMs A published set of ion-trap CID spectra acquired from the model organism Saccharomyces cerevisiae was used to benchmark this approach [99]. To aid in the acquisition of spectra from overlapping peptides, 12 SCX fractions were obtained for each of five enzyme digests. Three technical replicates were also run for each digest, but only spectra from the second replicate were used here. Thermo RAW files were converted to mzxml using ProteoWizard[54] (version ) with peak-picking enabled and clustered using MSCluster[30] (version 2.0, release ) to merge repeated spectra, yielding 255,561 clusters of one or more spectra. MS-GFDB[56] (version 7747) was used to match spectra against candidate peptides from target and decoy protein databases. Two sets of target+decoy databases (labelled small and large) were used to evaluate the performance of individual versus StarGF spectral probabilities when searching databases of different size. The small target database consisted of all reference S. cerevisiae protein sequences downloaded from UniProt[1] (4 MB on 09/27/2013), while the large database contained all reference fungi

31 18 UniProt protein sequences (130 MB on 09/27/2013). The large database (32 times larger) was used to represent searches against large search spaces, such as meta-proteomics[16] or 6-frame translation[12] searches. Separate small and large decoy databases were generated by randomly shuffling protein sequences from the target database [25]. The 255,561 cluster-consensus spectra were separately searched against the small target, small decoy, large target, and large decoy databases with MS-GFDB[56] configured to report the top 10 PSMs for each spectrum. The no enzyme model was selected along with 30ppm parent mass tolerance, Low-res LCQ/LTQ instrument ID, one 13C, two allowed nonenzymatic termini, and amino acid probabilities set to 0.05 (the same amino acid probabilities used by StarGF). Target and decoy PSMs were then merged by an in-house program that discarded decoy PSMs whose peptides were also found in the target database (allowing for I/L, Q/K, and M+16/F ambiguities). Although variable post-translational modifications (PTMs) were permitted in each initial search to reproduce typical search parameters (oxidized methionine and deamidated asparagine/glutamine), spectra assigned to modified PSMs were removed from consideration at this stage (the incorporation of PTMs into intersecting spectral probabilities is not considered here). The top-scoring peptide match for each remaining spectrum was then set to the target or decoy PSM with the highest matching score to the PRM spectrum. Each set of unfiltered target+decoy PSMs was evaluated at 1% FDR[76] using star probabilities. To benchmark StarGF, each set of MS-GFDB results was separately evaluated at 1% FDR using MS-GFDB s spectral probability[56] while allowing MS-GFDB to report the top-scoring PSM per spectrum. X!Tandem Cyclone ( )[18] was also run on the same set of MS/MS spectra in a separate search against each database, and results were filtered at 1% spectrum- and peptide-level FDR using the same target-decoy approach. X!Tandem search parameters consisted of 0.5Da peak tolerance, 30ppm parent mass tolerance, multiple 13C, and nonspecific enzyme cleavage (remaining parameters

32 19 were set to their default values). All raw and clustered MS/MS spectra associated with this study have been uploaded to the MassIVE public repository (massive.ucsd.edu) while StarGF can be obtained from proteomics.ucsd.edu. 1.3 Results Two sets of pairwise alignments were used to demonstrate the effectiveness of StarGF: (1) the set of pairs obtained by spectral alignment in the spectral network[6], and (2) to simulate the situation when maximal pairwise alignment sensitivity is achieved, pairs were also obtained using sequence-based alignment of the top-scoring peptide matches returned by the MS-GFDB searches. A pair of overlapping PSMs was retained if they shared at least seven overlapping residues and at least three matching theoretical PRM masses from the overlapping sequence. Networks of paired PSMs were generated using either one of these two pairing strategies, leading to two different star probability calculations for each PSM: one in which the star probability was selected as the minimum intersecting probability over all sequenced-based pairs (method 1), and the other where the star probability was selected as the minimum intersecting probability over all spectrum-based pairs (method 2). To eliminate the possibility of pairing unique peptides from different proteins, each target PSM pair was enforced to have at least one target protein containing the full sequence supported by the pair (e.g., the pair (PEPTIDE,PTIDES) must be supported by a protein containing the substring PEPTIDES). Unless otherwise stated, results are reported after applying the sequence-based pairing strategy to 40,926 unmodified target PSMs from the small database (separately identified by MS-GFDB at 1% spectrum-level FDR), yielding 32,777 paired spectra in the network. Using these parameters, less than 1% of pairs contained at least one decoy PSM, while 5% of paired PSMs were decoys for the large database set. The significance

33 20 of each PSM (P,S) was reported as the star probability of S. To evaluate the utility of intersecting probabilities, we separately assessed intersecting spectral probabilities for same-peptide pairs and partially overlapping pairs: we computed a same-peptide star probability (equal to the minimum Prob T,T (S,S [b...e ]) such that P = P ) and a partially overlapping star probability (equal to the minimum Prob T,T (S,S [b...e ]) such that P P ) for each spectrum in the network. Figure 1.2 illustrates the substantial separation between individual spectral probabilities, same-peptide star probabilities, and partially overlapping star probabilities (top panel). Same-peptide star probabilities can be further separated into those where the minimum intersecting probability was selected for a pair of PSMs with equal precursor charge (higher correlation between MS/MS fragmentation patterns[103]), and those where the minimum was selected for a pair with different precursor charge states (lesscorrelated MS/MS fragmentation). Due to repeated instrument acquisition of multiple spectra from the same peptide and charge state, it was expected that individual spectral probabilities would be approximately the same as intersecting probabilities for most same-peptide/same-charge pairs since duplicate spectra often have high similarity [103]. Nevertheless, star probabilities for same-peptide/same-charge pairs still prove valuable in improving spectral probabilities by an average of 2 orders of magnitude (Figure 1.2, bottom left), while same-peptide/different-charge and partially overlapping pairs enable an even greater improvement in spectral probabilities by an average of 8 orders of magnitude. The distributions of decoy spectral probabilities in the bottom right panel of Figure 1.2 illustrate the effect of star probabilities on paired decoy PSMs. It was rare for decoy PSMs to pair with others in the network (only 919 of 37,522 decoy PSMs were detected in a spectral pair), and those that did had their spectral probabilities improve by an average of 2 orders of magnitude, which is significantly less than that observed

34 Figure 1.2. Spectral and star probability distributions of observed p-values. (Top) Distribution of the spectral, same-peptide star, and partially overlapping star probabilities for peptidespectrum matches (PSMs) with at least one same-peptide pair and at least one partial overlapping pair. (Bottom left) Distribution of spectral, same-charge star, and unequal-charge star probabilities for PSMs from at least one same-peptide pair. (Bottom right) Distribution of spectral and star probabilities for all 919 small-database decoy PSMs found in the network where 480 had a same-peptide pair and 450 had a partially overlapping pair (11 had more than one pair). Also shown is the distribution of the product of individual spectral probabilities for the same decoys (where Prob T,T (S,S ) is computed as Prob T (S) Prob T (S )] to illustrate how it would substantially underestimate Prob T,T (S,S ) by ignoring the dependencies between repeated MS/MS spectra acquisitions from the same peptide with the same charge state. 21

35 22 for correct PSM pairs. Also shown is the distribution of decoy star probabilities as computed by the product of probabilities (Prob T,T (S,S ) = Prob T (S) Prob T (S )). As expected, the product of spectral probabilities ignores the dependencies between the spectra and severely under-estimates the true intersecting spectral probability by several orders of magnitude. This would likely lead to increased sampling of false-positive PSMs at any given star probability cutoff and thus result in an overall reduced number of identifications by requiring strict probability thresholds to achieve the same 1% FDR. This effect can be explained intuitively for a given pair of PSMs (P,S) and (P,S ), where S = S and P = P : if a random peptide matches S with a high score, then with probability 1 the same random peptide also matches S with an equally high score. Thus, in this special case, Prob T,T (S,S ) should equal Prob T (S) = Prob T (S), not the product of the individual spectral probabilities. Figure1.3 compares every PSM s star probability to its optimal spectral probability, which is defined as the spectral probability of the same peptide matched against the subset of peaks from the spectrum that correspond to true PRM masses (i.e., a noise-free version of the spectrum). In general, star probabilities improved the least for spectral probabilities that were already close to optimal. But the vast majority of star probabilities improved past optimal, particularly for stars with same-peptide/unequal-charge and partially overlapping pairs. Star probabilities can improve past optimal when missing PRMs from one spectrum S are present in the overlapping region of the spectrum S is paired with, thus enforcing that high-scoring peptide matches contain prefix masses that would otherwise be missed. This demonstrates that StarGF probabilities can improve on spectral probabilities by orders of magnitude even if perfect separation between signal and noise peaks could be achieved for any given spectrum. Star probabilities of unfiltered target+decoy PSMs were evaluated at 1% FDR using both paired and unpaired PSMs (spectral probabilities were computed for unpaired

36 Figure 1.3. Reduction of star probability (y-axis) with respect to optimality of starting spectral probability (x-axis). Each red dot denotes either a same-peptide (left, middle) or partially overlapping (right) star probability. Values on the x-axis that approach zero indicate a starting spectral probability that approaches optimal while larger values indicate suboptimal starting spectral probabilities (by orders of magnitude) due to the presence of unexplained PRM masses in the spectrum. Values on the y-axis that approach zero indicate star probabilities that did not improve substantially over the original spectral probabilities, while larger values indicate star probabilities that are orders of magnitude smaller than spectral probabilities. The blue line is shown to indicate star probabilities that equal their optimal spectral probability; any data point above the blue line indicates a star probability that is more significant than optimal (see text for a detailed explanation). Red numbers next to the lines indicate the percentage of data points above and below each blue line. 23

37 24 PSMs). Paired PSMs that were identified by StarGF against the large database were verified to have an FDR of 1% (both at the spectrum level and peptide level) by considering any peptide identified against the fungi database to be a false positive if it was not present in the yeast database (allowing for I/L and Q/K ambiguities). Table 1.1 shows how many paired PSMs were identified by MS-GFDB[56] and StarGF using either spectral alignments or sequenced-based PSM alignments. Although sequenced-based alignment was effective here, it may prove difficult to pair spectra by top-scoring PSMs from very large databases (e.g., meta-proteomics databases or six-frame translations) where the highest-scoring PSMs are much less likely to be correct due to the increased search space. For these applications, spectral alignment may prove more effective at detecting pairs and using them to re-rank matching PSMs[6] before computing PSM significance by StarGF. Results for sequence-based alignments thus indicate the upper bound of improvement when perfect pairwise sensitivity is achieved by spectral alignment. The 37% drop in MS-GFDB peptide identification rate of paired PSMs from the small to large database is expected since the larger search space allows decoy peptides and false matches to target to randomly match individual spectra with higher scores, thus decreasing the overall number of detected spectra/peptides at a fixed FDR. Using the same set of unfiltered PSMs as MS-GFDB, however, StarGF only lost 20% of paired peptides from the small database as it could identify 3666% more spectra and 2962% more peptides by significantly improving the significance of true overlapping PSMs while only marginally increasing the significance of decoy overlapping PSMs (see Table 1.1). Note that as described here StarGF could not identify any spectra that were matched to decoy peptides, but only re-rank them by their star probability. The drop in StarGF identification rate from the small to the large database is explained by this effect; of the 10,648 spectra identified in the small database search but missed in the large database, only 6% were assigned the same peptide from the large database and had their preferred

38 25 Table 1.1. Spectrum- and Peptide-Level Identification Rate of Paired Peptide Spectrum Matches at 1% False-Discovery Rate. The Small database column indicates results using the UniProt reference yeast protein database (4MB), while results on the right are from searching the larger UniProt reference fungi protein database (130MB). Rows separate results by the type of alignment used to capture overlapping peptide spectrum matches (PSMs): Aligned spectra indicates pairing by spectral alignment and Aligned seqs. indicates pairing by PSM sequence similarity. Bold numerals indicate the increased percentage of PSMs/peptides captured by StarGF. Aligned spectra Small database Large database MS-GFDB StarGF % Incr. MS-GFDB StarGF % Incr. Spectra 13,305 18, ,799 13, Peptides 9,653 12, ,439 9, Aligned seqs. Spectra 32,777 44, ,251 33, Peptides 26,422 34, ,525 26, neighbor (the paired PSM from which the lowest intersecting probability was selected) matched to the same peptide. The remaining PSMs were either matched to a different peptide (75%) or had their preferred neighbors matched to different peptides (19%). Thus, the majority (94%) of PSMs lost by StarGF from the small to the large database search could potentially be recovered by re-ranking candidate peptides against paired spectra (as done before in spectral networks using de novo sequence tags [6]). Although the results in Table 1.1 are over paired PSMs, StarGF still significantly improved spectrum- and peptide-level identification rate for all spectra since a large portion (89%) of all PSMs were paired (Table 1.2). Considering both paired and unpaired (unmodified) PSMs when searching against the small database, MS-GFDB was able to identify 40,926 spectra (34,165 peptides), while StarGF identified 50,310 spectra (35,521 peptides). However, when searching against the large database, MS-GFDB could identify

39 26 Table 1.2. Spectrum- and Peptide-Level Identification Rate of All (Paired and Unpaired) Peptide Spectrum Matches at 1% False-Discovery Rate Using the Sequence-Based Pairing Strategy. The Small database column indicates results using the UniProt reference yeast protein database (4MB), while results in the Large database column are from searching the larger UniProt reference fungi protein database (130MB). (Top, Middle) Identification rates of all three search tools; numbers in bold indicate the increased percentage of IDs retained by StarGF compared to X!Tandem and MS-GFDB. (Bottom) Percent of PSMs and peptides lost by each search tool at 1% false-discovery rate as they moved from the small to large search space. Small database X!Tandem MS-GFDB StarGF % Incr. Spectra 28,923 40,926 50, Peptides 23,957 34,165 39, Large database X!Tandem MS-GFDB StarGF % Incr. Spectra 13,847 27,128 40, Peptides 11,483 22,782 32, % Lost from larger search space X!Tandem MS-GFDB StarGF Spectra Peptides only 27,128 spectra (22,782 peptides, 33% loss from the small-database search), while StarGF could identify 40,269 spectra (32,891 peptides, 16% loss from the small-database search) using PSM sequence alignments, an overall improvement over MS-GFDB of 48% more identified spectra (44% more identified peptides) and revealing StarGF to be nearly as sensitive when searching a 32 times larger database as MS-GFDB is when searching a small database. Figure 1.4 illustrates the overlap between peptides identified by MS-GFDB against the small database and peptides identified by StarGF. The majority (74%) of

40 27 Figure 1.4. Overlap of unique peptides identified at 1% peptide-level false-discovery rate. The top circle denotes peptides identified by MS-GFDB against the small database, while the left and right circles denote peptides identified by StarGF against the small and large databases, respectively. Peptides that only differed by I/L or K/Q ambiguities were counted as the same. Figure is not drawn to exact scale. peptides identified by StarGF against the small database were also identified by MS- GFDB. The remaining peptides that MS-GFDB did not identify were predominantly found in PSM pairs (96%), and thus assigned higher significance by StarGF. Of the peptides identified by StarGF against the large database, nearly all matched peptides were rescued from sets of peptides identified against the small database by either MS-GFDB or StarGF. 1.4 Discussion While MS-GF[55] demonstrated how de novo sequencing techniques could be used to greatly improve the state of the art in peptide identification by rigorously computing the score distribution of all peptides against every spectrum, it still misses as many as 38% (= ((26,68916,525)/26,689) 100) of identifiable (unmodified) peptides

41 28 when searching large databases by ignoring the significance of overlapping PSMs (see Table 1.1). By now extending this principle using a multi-spectrum approach to compute the probability distribution of PSM scores for all peptides against every pair of overlapping spectra, StarGF is able to assign higher significance p-values to true PSMs while only marginally increasing the significance of false PSMs. Thus, where traditional database search loses sensitivity in searching larger databases, we now show that it is possible to regain nearly all peptides that are lost by MS-GFDB when searching a database 32 times the size. Although StarGF performs best when paired with MS/MS protocols that maximize acquisition of spectra from partially overlapping peptides, our results indicate that significant gains in identification rate can still be made by utilizing commonly observed pairs of spectra from the same peptide, particularly pairs of spectra with different precursor charge states. Previous applications of multiple enzyme digestions have demonstrated significant gains in proteome coverage, but did not address how they could be used to improve peptide identification rates against larger search spaces [99]. The results presented in Figure 1.2 particularly demonstrate how independent MS/MS acquisitions of the same peptide sequence, whether they are from different charge states or overlapping peptides, dramatically reduce the probability of random peptides matching both spectra. This should give greater value toward the application of multiple enzyme digestions and further offset the elevated experimental costs associated with their application. Although StarGF significantly outperforms a state-of-the-art database search tool (MS-GFDB)[56] in identifying tandem mass spectra at an empirically validated FDR of 1% (confirmed here using matches to non-yeast peptides in the large fungi database), it would be useful to thoroughly assess the limitations of the target/decoy approach when estimating FDR for searches against small databases, as previously done for MS-GFDB searches [49]. In some cases, the enforcement of overlapping PSMs may sometimes

42 29 result in so few decoy PSMs that it becomes difficult to accurately estimate FDR [37]. A similar situation can also occur in searches with highly accurate parent masses since the number of high-scoring decoy peptides with a given parent mass becomes minuscule with decreasing parent mass tolerance. While the generating function described here supports only unmodified peptides, it can be extended to analyse modified peptides by considering modified amino acid mass edges (as shown before[56]). Further improvements are foreseeable with additional support for high-resolution MS/MS peak masses and incorporation of alternative fragmentation modes (e.g., HCD, ETD) to improve of the quality of PRM spectra, especially if from highly charged precursors [38]. Given that MS-GFDB supports multiple fragmentation modes and that we utilize PepNovo + to transform MS/MS spectra to PRM spectra, it is possible for this approach to support any fragmentation mode since PepNovo+ can be trained to process new types of spectra (Chapter 3). 1.5 Acknowledgements Chapter 1, in full, is a reprint of the material as it appears in the Journal of Computational Biology The generating function approach for peptide identification in spectral networks. RECOMB Nov 25. [Epub ahead of print] PubMed PMID: [39]

43 Chapter 2 Shotgun Protein Sequencing with metacontig assembly Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any 30

44 31 knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings. 2.1 Introduction Database search tools, such as Sequest[114], Mascot[80], and InsPecT[105], are the most frequently used methods for reliable protein identification in tandem mass (MS/MS) spectrometry based proteomics. These operate by separately matching each MS/MS spectrum to peptide sequences from reference protein databases where all proteins of interest are presumably contained. But this assumption often does not hold true as many important proteins, such as monoclonal antibodies, are not contained in any database because mechanisms of antibody variation (including genetic recombination and somatic hyper-mutation[23]) constantly create new proteins with novel unique sequences. These mechanisms of variation are the foundation of adaptive immune systems and have enabled highly successful antibody-based therapeutic strategies [71, 45]. Nevertheless, such variation also means that antibody MS/MS spectra are typically impossible to identify via standard database search techniques whenever the corresponding sequences are not known in advance. An inherent drawback of database search strategies is that they are only as good as the database(s) being searched and incomplete databases often result in proteins being misidentified or left unidentified [24]. Despite the importance of novel protein identification, few high-throughput methods have been developed for de novo sequencing of unknown proteins. Low-throughput Edman degradation is a well-known de novo sequencing approach that can accurately call amino acid sequences in N/C-terminal regions of unknown proteins but has drawbacks that make it unsuitable for sequencing proteins longer than 50 amino acids or proteins with post-translational modifications [108, 113]. Many have recognized the potential of

45 32 tandem mass spectrometry for protein sequencing. For example, in 1987 Johnson and Biemann[51] manually sequenced a complete protein from rabbit bone marrow. Meanwhile, automated de novo sequencing methods that rely on interpretations of individual MS/MS spectra are limited in that they typically cannot reconstruct long (8+ AA) sequences without mis-predicting 1 in 5 AA on average for low accuracy collision-induced dissociation (CID) spectra [31, 68]. Recent advances in de novo peptide sequencing have improved sequencing accuracy to over 95% for high resolution higher energy collisional dissociation (HCD) spectra [15], but at limited sequence coverage (Chi H et al. report only 55% sequence coverage of peptides identified by database search). In fact, all current per-spectrum de novo sequencing strategies face a significant tradeoff between sequencing accuracy and coverage as spectra exhibiting complete peptide fragmentation rarely cover entire target proteins, yet are required to accurately reconstruct full-length peptide sequences. An alternative approach to separately sequencing individual spectra is to simultaneously interpret multiple MS/MS spectra from overlapping peptides. This Shotgun Protein Sequencing (SPS) paradigm differs from traditional algorithms by deriving consensus sequences from contigs - sets of multiple MS/MS spectra from distinct peptides with overlapping sequences [2, 4]. Because SPS aggregates multiple spectra from overlapping peptides, protein sequences extending beyond the length of enzymatically digested peptides can be extracted from spectra with incomplete peptide fragmentation. Furthermore, SPS has been found to generate sequences that frequently cover % of the target protein sequence(s) whereas mis-predicting only 1 out of every 20 amino acids on high resolution MS/MS spectra [3]. But a remaining limitation of SPS is that it still generates fragmented sequences that do not singularly cover large regions of the target protein sequences, much less complete proteins: SPS sequences have an average length of amino acids (depending on input data) and the longest recovered SPS de novo sequence is less than 45 amino acids long [2].

46 33 The considerable limitations of de novo sequencing strategies have typically been addressed by attempting to circumvent them using error-tolerant matching to known protein sequences. One such strategy[72] is to generate short de novo sequence tags and then match them exactly to protein databases without requiring matching the N/Cterm flanking masses (to allow for unexpected polymorphisms or post-translational modifications). Short sequence tags are usually derived from parts of the spectrum with high signal-to-noise ratios and typically have higher sequencing accuracy than full-length de novo sequences [33]. This approach was later extended in MS-Shotgun[46] and continues to be a popular technique for speeding up database search tools [105, 57, 21, 97]. Homology matching of full length de novo sequences was first explored in CIDentify[106] and later in MS-BLAST[96] by searching de novo sequences using FASTA and WU-BLAST2 (respectively) to find homologous matches to sequences of related proteins; FASTS[70] also approached the problem using a modified version of FASTA. However, common de novo sequencing errors tend to produce sequences that are heavily penalized in pure sequence homology searches. For example, missing peaks in MS/MS spectra may easily cause GA subsequences to be reconstructed as Q or AG (same-mass sequences), thus making subsequent BLAST searches unlikely to succeed. This issue was partially considered in CIDentify and more thoroughly addressed in SPIDER[43] by explicitly modelling de novo sequencing errors together with BLOSUM scores in MS/MS-based sequence homology searches. In addition, OpenSea[92] further explored database matching of de novo sequences for analysis of unexpected posttranslational modifications (PTMs). Finally, Shen et al.[93] used short unique de novo sequence tags, called UStags, to discover protein-localized PTMs. Recent approaches to homology matching of de novo sequences have built on genome assembly and sequencing techniques to achieve database-assisted full-length sequencing of unknown proteins. Comparative Shotgun Protein Sequencing (csps)

47 34 complemented SPS assembly techniques with usage of error tolerant matching of de novo sequences to find overlapping SPS de novo sequences that are then further assembled into full-length protein sequences [3]. csps was designed to support the sequencing of highly divergent proteins that have regions close enough in homology to transfer matches from a reference. csps was shown to enable de novo sequencing of monoclonal antibodies at 95+% sequencing accuracy, while simultaneously tolerating and identifying unexpected PTMs [6]. In difference from csps, Champs[66] de novo sequences individual spectra to obtain putative peptide sequences, which are then mapped to homologous proteins to correct sequencing errors and reconstruct protein sequences with 100% accuracy and 99% coverage. However, Champs is designed to only map peptides that differ from the reference sequence by one or two amino acids and does not handle PTMs. As such, its sequencing accuracy is not directly comparable to that of csps as Champs was not designed to sequence highly divergent proteins (such as monoclonal antibodies) with multiple PTMs, insertions, deletions, and/or recombinations. GenoMS[13] extended the approaches in csps/champs by explicitly modeling protein splice variants as paths in splice graphs where nodes represent translated exon regions [12]. MS/MS spectra are first searched for exact sequence matches against all possible protein isoforms. The remaining unidentified MS/MS spectra are then aligned to the matched peptides and de novo sequenced to extend the matched sequences into novel regions. Reported sequences are 9799% accurate and cover 9699% of target proteins depending on sequence similarity between the novel and reference sequences [13]. However, GenoMS de novo sequences are usually extended less than 3 amino acids beyond matched peptides because sequencing accuracy degrades as sequences are extended, thus preventing the consistent extension of long (10+ AA) sequences. Altogether, the use of homology matching approaches for full-length de novo protein sequencing continues to be limited by 1) requiring the previous knowledge of closely related protein sequences and 2) the inherent

48 35 difficulties in statistically significant homology-tolerant matching of error-prone short de novo sequences. The Meta-SPS approach proposed here seeks to de novo sequence complete proteins, or long protein regions, without any use of a database. Meta-SPS builds upon SPS by treating SPS de novo sequences (contig sequences) as input spectra and further assembling them into longer de novo sequences (meta-contig sequences). We show that Meta-SPS extends de novo sequences to lengths over 100 AA while boosting sequencing accuracy to only 1 mistake per 40 amino acid predictions, thus enabling database-free de novo sequencing of completely novel proteins while also allowing error-tolerant matching approaches to support higher-divergence homologies (by searching longer, more accurate de novo sequences). Meta-SPS algorithms are demonstrated on CID and HCD MS/MS spectra and its limitations are discussed in relation to the underlying limitations of bottom-up tandem mass spectrometry. 2.2 Methods The Meta-SPS workflow is illustrated in Figure 2.1. In brief, because Meta-SPS relies upon the interpretation of MS/MS spectra from overlapping peptides, sample proteins were digested with multiple enzymes. Following MS/MS acquisition, MS/MS Charge Deconvolution was performed to convert all MS/MS fragment peaks to charge one (see supplemental Materials - MS/MS Charge Deconvolution) and Shotgun Protein Sequencing (SPS)[2] was used to assemble unidentified MS/MS spectra into contigs sets of aligned spectra from peptides with overlapping sequences. SPS contigs were then aligned to each other using Spectral Alignment and further assembled into meta-contig sequences in the Meta-Assembly step. Two data sets were used to develop and benchmark Meta-SPS: a mixture of 6 known proteins (6-prot) and a previously described data set from a purified monoclonal antibody raised against the B- and T- lymphocyte attenuator

49 36 molecule (abtla) [3]. Briefly, the abtla data set consisted of 44,985 MS/MS spectra from the heavy chain and 39,135 MS/MS spectra from the light chain acquired on a Thermo LTQ XL instrument either in the Linear trap (low MS/MS mass accuracy) or in the Orbitrap (high MS/MS mass accuracy). Heavy-chain samples were prepared using five different protease digestions (trypsin, chymotrypsin, pepsin, Glu-C, and AspN) and light-chain samples were prepared with four different protease digestions (trypsin, chymotrypsin, pepsin, and AspN) prot Data Acquisition For the 6-prot sample, first an equimolar mixture of six proteins was prepared. After reduction and alkylation of cysteines, aliquots were digested by different means to produce sets of overlapping peptides. Bovine (6.5 kda, catalog # A-4529) purified from lung, recombinant murine leptin (16 kda, catalog # L-3772) expressed in E. coli, horse heart myoglobin (17 kda, catalog # M-1882) purified from heart, and horseradish peroxidase (39 kda, catalog # P-6782) purified from horseradish roots were purchased from Sigma-Aldrich. E. coli GroEL (57 kda, catalog # G8976) purified from an E. coli strain overexpressing GroEL was purchased from United States Biological (Swampscott, MA). Human prostate-specific antigen also known as kallikrein-related peptidase (29 kda, catalog # P0725) purified from seminal fluid was purchased from Scripps Laboratories (San Diego, CA). The 252 g total protein mixture was prepared in 100 mm NH4HCO3 then reduced with 5 mm dithiothreitol, and the cysteines were alkylated with 20 mm iodoacetamide. The proteins that had not already precipitated were further precipitated with 60% ice-cold ethanol. After centrifugation, the supernatant was removed and discarded. The pellet was washed several times with 95% cold ethanol and then resuspended in 0.04% Rapigest (Waters Corp. Milford, MA) an acid-labile SDS-like detergent. Seven 32 g aliquots were created. Three aliquots were diluted to 0.085%

50 Figure 2.1. Meta-SPS Procedures. A) Green arrows denote procedures previously described in [2] and red arrows denote procedures described here. The SPS step involves spectral clustering by MSCluster[30], PepNovo + PRM scoring[32], and assembly of mass spectra into contigs [2]. B) An alignment between two PRM spectra is represented as the shift of the second spectrum wrt the first that yields the highest possible score. The displayed scoring function takes the minimum matched/overlapping intensity ratio and multiplies by the number of matching peaks (denoted by MP(A) for alignment A S i,s j between contig PRM spectra S i and S j ). Matched and overlapping intensities for each spectrum are displayed as red and blue boxed regions, respectively. Sequences are not known in advance; shown only for illustration purposes. C) Here aligned SPS contigs are assembled into meta-contigs by iteratively merging the highest scoring alignment until remaining alignments have a low score. By merging the highest scoring alignment at every iteration, it is guaranteed that all inconsistent alignments that were removed have a lower score. D) Green arrows denote merged alignments and numbers correspond to the order in which they alignments are merged. Initially, every contig was in its own meta-contig. The 6 meta-contigs were then merged by five alignments, yielding a single meta-contig PRM spectrum and its meta-contig sequence. 37

51 38 Rapigest at ph 8.0 in 100 mm NH4HCO3 and digested for 6 h. with trypsin 1:150, Lys-C 1:300, or Glu-C 1:150. Three aliquots were diluted to 0.01% Rapigest at ph 8.0 in 100 mm NH4HCO3 and digested for 6 h. with Asp-N 1:300, Chymotrypsin 1:150, or Arg-C 1:150. Digestions were stopped, and the detergent was cleaved by acidifying with 1% trifluoro acetic acid (TFA), ph 2. The 7th aliquot was acidified and precipitated with 60% ice-cold ethanol, washed with 95% cold ethanol, dried and digested with cyanogen bromide (CNBr) using 70% TFA for 36 h before drying in a SpeedVac and resuspending in 0.1% TFA. Digests were stored at 80C prior to LC-MS/MS. Digests were analyzed with an automated nano LC-MS/MS system, consisting of an Agilent 1100 nano-lc system (Agilent Technologies, Wilmington, DE) coupled to an LTQ-Orbitrap Fourier transform mass spectrometer (Thermo Fisher Scientific, San Jose, CA) equipped with a nanoflow ionization source (James A. Hill Instrument Services, Arlington, MA). Peptides were eluted from a 10 cm column (Picofrit 75 m ID, New Objectives) packed in-house with ReproSil-Pur C18-AQ 3 um reversed phase resin (Dr. Maisch, Ammerbuch Germany) using a 95 min acetonitrile/0.1% formic acid gradient at a flow rate of 200 nl/min to yield 20 s peak widths. Solvent A was 0.1% formic acid and solvent B was 90% acetonitrile/0.1% formic acid. The elution portion of the LC gradient was 3-7% solvent B in 1 min, 67-37% in 60 min, 37-90% in 6 min, and held at 90% solvent B for 5 min. Data-dependent LC-MS/MS spectra were acquired in 3 s cycles; each cycle was of the following form: one full Orbitrap MS scan at 60,000 resolution followed by five MS/MS scans in the orbitrap at 7,500 or 15,000 resolution on the most abundant precursor ions using an isolation width of 2.0 or 2.5 m/z. Dynamic exclusion was enabled with a mass width of 25 ppm, a repeat count of 1 and an exclusion duration of 8 s. Charge state screening was enabled along with monoisotopic precursor selection and nonpeptide monoisotopic recognition to prevent triggering of MS/MS on precursor ions with unassigned charge or a charge state of 1. For CID fragmentation the normalized

52 39 collision energy was set to 30 with an activation Q of 0.25 and activation time of 30 ms. For HCD fragmentation the normalized collision energy was set to 60 (first generation, software HCD with 1 segment of black restrictor capillary tubing removed to elevate the ion gauge operating pressure to 1.6 e-5 Torr) Spectrum Preprocessing and Notation A total of high resolution CID and high resolution HCD 6-prot spectra were obtained after quality filtering by SpectrumMill. All 6-prot spectra were then deconvoluted using MS/MS Charge Deconvolution (see supplemental Materials) and searched with MS-GFDB[56] against the six target proteins and known contaminants with a spectrum-level false discovery rate of 1%; resulting peptide IDs covered 87% of the target proteins. See supplemental Materials for parameters used for SpectrumMill, MS/MS Charge Deconvolution, and MS-GFDB. High resolution abtla MS/MS spectra were also deconvoluted using our approach and repeated spectra were detected and converted to consensus spectra using MS-Cluster[30] separately for low and high resolution spectra. This resulted in 8328 high resolution and 13,863 low resolution clustered CID spectra from the abtla light chain, as well as 13,261 high resolution and 14,424 low resolution clustered CID spectra form the abtla heavy chain. Spectra were then searched using MS-GFDB at 1% spectrum-level FDR and the resulting peptide identifications covered 99% of the abtla protein sequence. We note that peptide identifications were only used for benchmarking the accuracy and coverage of de novo sequences. The following notation is used below: a peptide MS/MS spectrum S is defined as a collection of peaks where each peak p S corresponds to an ion with mass m[p], charge z[p], intensity i[p], and where p = m[p]/z[p]. The parent mass P[S] is the cumulative mass of all residues in the peptide sequence plus the mass of H 2 O and the precursor charge Z[S] is the charge of the peptide ion.

53 Shotgun Protein Sequencing SPS uses MS-Cluster[30] to cluster deconvoluted spectra from the same peptide and uses PepNovo + [32] to convert clustered MS/MS spectra into PRM (prefix residue mass) spectra where peak intensities are replaced with log-likelihood scores. Ideal PRM spectra have peaks only at prefix residue masses (PRMs, cumulative amino acid masses of N-term prefixes of the peptide sequence) and peak scores combining evidence supporting the presence of b/y-ions, such as peak intensity, neutral losses (e.g. loss of H 2 O) and b/y-ion complementarities, and contrasting it with the estimated level of noise [31, 20]. But in actuality, PRM scoring procedures cannot perfectly differentiate between prefix residue masses and suffix residue masses (SRMs, cumulative amino acid masses of C-term suffixes of the peptide sequence plus the mass of H 2 O) when complementary b and y ion series are present in a spectrum. PRM and SRM peaks typically receive high scores relative to other peaks whereas PRM peaks usually explain a higher percentage of a spectrum s total score. SPS then aligns PRM spectra to each other in an all-to-all comparison. For each pair of overlapped spectra, PRM and SRM peaks are separated by two complementary alignments, which can be visualized as complementary paths in an alignment matrix [2]. PRM spectrum alignments are retained if their scores are above a certain threshold: SPS fits a Gaussian distribution to spectra alignment scores and chooses score thresholds corresponding to a given p-value (0.045); an alignment between two spectra is retained if it passes the significance threshold for both aligned spectra. Because MS/MS spectra from different acquisition modes have different ion statistics, PRM spectra from different acquisition modes were run separately through SPS. Because the alignments are symmetric because of the b/y-ion and PRM/SRM complementarities, SPS cannot tell which peaks are PRMs and which peaks are SRMs, only differentiate between the two.

54 41 Therefore, contig sequences can assemble either aligned PRM peaks or aligned SRM peaks with the majority (70%) of sequences assembling PRM peaks as they typically receive higher scores than SRM peaks [2]. Contig sequences assembling SRM peaks must be reversed to match the target protein sequence in the correct orientation. Finally, SPS assembles aligned PRM spectra into contigs, which are sets of aligned spectra from overlapping peptides [2]. Each contig has a corresponding de novo contig sequence, which is the sequence of amino acids and mass gaps (masses that do not match the mass of a single amino acid) that best explains the overlapping peaks in the assembled spectra [2]. Each contig sequence returned by SPS is represented as a contig PRM spectrum, which is a spectrum S with PM[S] equal to the cumulative mass of all residues and gaps in the contig sequence. Each prefix of the contig sequence corresponds to a contig PRM peak and the score of each contig PRM is the summed score of its assembled spectrum PRMs Spectral Alignment Overlaps between contig PRM spectra were computed using a modified version of the spectral alignment technique introduced in SPS [2]. An alignment between two PRM spectra S i and S j is a set of matched PRM pairs imposed by the shift A S i,s j (defined below) such that for each matched PRM pair (p i, p j ), p i S i, p j S j, and p i = p j + A S i,s j. Because some contig sequences may be reversed wrt each other, the highest scoring alignment of S i and S j may be between S i and the reversed orientation of S j. Reversing the orientation of a PRM spectrum S involves simply converting all of S s masses to SRMs by subtracting each PRM mass from the parent mass. Thus, S R represents the reversed orientation of spectrum S with PRMs {p = PM[S] p, p S}. The definitions in Table 2.1 (below) are illustrated in Figure 2.1B. For each unique pair of contig PRM spectra (S i,s j ), all possible shifts of S j

55 42 Table 2.1. Definitions of contig alignment terminology A S i,s j MP(A) MI(A) i MI(A) j OI(A) i OI(A) j score(a) Mass shift (in Da) of S j wrt S i that yields the maximum score(a) Number of matched peak pairs between S i and S j Summed intensity of all peaks in S i that match peaks in S j Summed intensity of all peaks in S j that match peaks in S i Summed intensity of all peaks in the m/z range of S i that overlaps with the aligned m/z range of S j Summed intensity of all peaks in the m/z range of S j that overlaps with the aligned m/z range of S i min( MI(A) i OI(A) i, MI(A) j OI(A) j ) MP(A) wrt S i and S R j wrt S i that yielded at least 6 matching peaks were considered and the shifts A S i,s j and A S i,s R j were set. Of these two shifts, the shift with the highest score was reported for the pair. If score(a S i,s R j ) > score(a S i,s j ), the reverse state of A, R[A], was set to true in order to indicate that S j should be reversed wrt Si (R[A] f alse otherwise). Given an input minimum score τ, alignments were then discarded if score(a) < τ. The parameter τ is also enforced in Meta-Assembly and was separately trained for low mass accuracy contig PRM spectra (0.5 Da peak tolerance) and high mass accuracy contig PRM spectra (0.05 Da peak tolerance) Meta-Assembly Similar to the SPS assembly of aligned PRM spectra into contigs, Meta-Assembly groups aligned contig PRM spectra into meta-contigs. Similar to the relationship between a contig and a contig PRM spectrum, every meta-contig also has a meta-contig PRM spectrum. Each meta-contig initially contains one contig PRM spectrum. As illustrated in Figure 2.1C, Meta-assembly then iterates over the following steps: (1) Recruit (2) Reverse (3) Re-sequence (4) Re-score. Step 1 finds the highest scoring aligned pair

56 43 of meta-contigs A M i,m j and stops if the score is below threshold τ; Step 2 reverses M j if required by the alignment; Step 3 merges M i and M j into Mi and determines the updated meta-contig PRM spectrum; Step 4 transfers and re-scores alignments from M i and M j to Mi and returns to Step 1. The problem addressed by Meta-Assembly is in the context of an overlap graph[4], where each vertex is a meta-contig M i initialized to SPS contig S i and meta-contig vertices are connected by scored alignment edges labelled with shifts A M i,m j, scores score(a M i,m j ), and reverse states R[A M i,m j ] all initialized using alignments between the corresponding contigs, as described above. In a perfect graph, all connected meta-contigs can be aggregated by merging every alignment edge. However, even though contig PRM spectral alignments are much more reliable than alignments between PRM spectra derived directly from MS/MS spectra, there are still incorrect edges in the graph. There are two types of incorrect edges: inconsistent edges disagree on the shift of metacontigs wrt each other and incoherent edges disagree on the orientation of meta-contigs wrt each other. For example, there may be three alignment edges A 1 M i,m j, A 2 M i,m j, and A 3 M i,m j between three meta-contigs M i, M j, and M k such that A 1 +A 2 A 3. Here the path from M i to M k following A 1 and A 2 imposes a transitive shift (a shift imposed by two or more pair-wise alignments) between M i and M k that is not consistent with A 3. It may also be the case that R[A 3 ] = true while R[A 1 ] = R[A 2 ] = f alse. Here the edges are incoherent because A 1 and A 2 indicate that M i, M j, and M k are in the same orientation whereas A 1 and A 3 indicate that M k is reversed wrt M i and M j. The meta-contig assembly problem is that of finding and merging the maximal scoring subset of consistent and coherent alignment edges such that every contig PRM spectrum can be aligned to its meta-contig PRM spectrum with score at least τ. It has been shown that finding the maximal scoring subset of consistent and coherent edges is a hard problem [2]. Thus, we propose an iterative algorithm to approach the optimal solution. See supplemental

57 44 materials for a detailed description of Meta-Assembly steps. In step 1, we recruit the highest scoring edge A M i,m j between any two metacontigs M i and M j. If score(a ) < τ, then all remaining edges have a score below the threshold and the merging process ends. Otherwise, M i and M j are merged in steps 24. In step 2, M j is reversed if R[A ] = true. As described in Spectral Alignment, some alignments between contig PRM spectra are in different orientations. Thus, if aligned contig PRM spectra are to be assembled into coherent meta-contigs, some of them will need to be reversed. In step 2, meta-contig M j is reversed to M R j if R[A ] = true to assure spectra inside M i and inside M j are in the same orientation before the meta-contigs are merged. The reversed meta-contig M R j is obtained from M j by reversing all of its assembled contig PRM spectra and their relative alignments. Given an alignment shift A S a,s b, its reversed alignment shift A R Sa R,Sb R is equal to PM[S a] A PM[S b ]. The final step in reversing M j is to update the reverse state of alignment edges connected to it. For all alignment edges A k M j,m k connecting M j to other meta-contigs, A k is also reversed and R[A k ] notr[a k ] to indicate whether M k also needs to be reversed if it is to be merged to M i and M j in a subsequent iteration (only M j is reversed in this iteration). In step 3, M i is created as the union of M i and M j and the meta-contig PRM spectrum of Mi is determined. A is used as the shift to connect contig PRM spectra in M i to contig PRM spectra in M j. So after Mi M i Mj, every contig PRM spectrum S x M i is connected to every contig PRM spectrum S y M j by the transitive shift A S x,s y = A S x,s i + A + A S j,s y where S i and S j were the first contig PRM spectra in M i and M j, respectively. Because only one shift is used to connect contig PRM spectra in M i and M j, all assembled alignments between spectra in M i are guaranteed to be consistent because M i and M j are internally consistent. The contig PRM spectra and their spectral alignments in M i are then used as input to SPS to determine its meta-contig PRM spectrum as in [2].

58 45 In step 4, alignment edges connected to M i and M j are re-scored and moved to Mi. For every M k connected to M i through some alignment edge A 1 M i,m k, A 1 M i,m k A 1 is used to connect Mi to M k. If a M k is connected to M j through some alignment edge A 2 M j,m k,a 2 M i,m k A + A 2 is used to connect Mi to M k. If a M k is connected to both M i and M j through A 1 and A 2, A 1 is used if score(a 1 ) > score(a 2 ) and A 2 is used otherwise. After all edges are transferred from M i and M j to Mi, M i and M j are removed from the graph. Then the scores of all edges connected to Mi are updated for recruitment in step 1 of the next iteration. Figure 2.1D illustrates how this approach aggregates contigs connected by high scoring alignments before considering contigs with less reliable alignments. An important benefit of this property is that meta-contig sequences are reliably extended and updated (by merging high scoring alignment edges first) before they are used to re-score less reliable alignments. An alternative approach to further capitalize on this property by discovering new alignments between updated meta-contigs could be to add a step between Re-score and Recruit that re-aligns Mi to every other meta-contig in the overlap graph. This was attempted, but it significantly increased the running time of the implementation without yielding longer meta-contig sequences. After iterative merging of meta-contigs, only meta-contigs that assemble at least 2 contig PRM spectra or more are reported. Also, contigs and meta-contigs were required to yield an amino acid sub-sequence of at least five consecutive residues. 2.3 Results The performance of Meta-SPS and SPS was assessed in reference to target protein sequences and compared with determine the effectiveness of these additions to the SPS workflow. Two separate procedures were used to evaluate the performance of SPS and Meta-SPS, which was mainly measured in terms of de novo sequencing length,

59 46 coverage, and accuracy. First, PRM spectra identified by MS-GFDB at 1% spectrumlevel FDR were used to annotate contig PRM spectra (described in Figure 2.2) and determine de novo sequencing accuracy. If a contig assembled at least one identified PRM spectrum, the contig itself was labelled identified. Peptides IDs were then mapped to their corresponding protein IDs and used to annotate peaks in identified PRM spectra as PRMs or SRMs. Mass differences between consecutive peaks in contig PRM spectra (i.e. sequence calls or gaps) were labeled using peaks from the annotated PRM spectra they assembled (Figure 2.2). A contig sequence call was labelled annotated if its flanking peaks each assemble a mass from the same identified PRM spectrum. An annotated sequence call was correct if its flanking peaks assemble spectrum masses in the same ion series in the same identified spectrum (i.e. both are identified PRMs or both are identified SRMs) on the same protein. Annotated sequence calls not labeled correct are labelled incorrect. Because meta-contigs assemble contigs, every peak in a meta-contig PRM spectrum also assembles a set of PRM masses. Therefore, meta-contigs are annotated in the same manner as contigs. The graph displayed in Figure 2.3A demonstrates that sequencing errors are localized toward the ends of sequences and are not distributed randomly. This occurs because often more PRM spectra overlap toward the middle of contig and meta-contig sequences, which gives a stronger consensus sequence. Given that sequence calls at the first or last residue of every meta-contig sequence were 20% less accurate than sequence calls two or more positions in from both ends, we truncated every meta-contig and contig sequence by one sequence call from each end. This post-processing step had the effect of increasing sequencing accuracy by roughly 2% over all contig and meta-contig sequences at a limited loss in sequencing coverage. Meta-contigs were then 94% accurate (1 error per 18 AA) over all 6-prot proteins and 97% (1 error per 35 AA) accurate over the abtla antibody (Figure 2.3B) whereas SPS contigs were 88% accurate (1 error per

Figure 2.2. Annotation of contigs and meta-contigs with MS-GFDB spectrum identifications. The annotation of a SPS contig is shown here but the same procedure applies for meta-contigs.

60 Figure 2.2. Annotation of contigs and meta-contigs with MS-GFDB spectrum identifications. The annotation of a SPS contig is shown here but the same procedure applies for meta-contigs. Above the contig PRM spectrum are all sequence calls that align to the reference. Below the contig PRM spectrum are all spectra from overlapping peptides that were assembled to yield the contig PRM spectrum. Only assembled peaks are shown in each assembled PRM spectrum. For a sequence call to be labeled correct, it must be flanked by at least one pair of annotated PRM or SRM peaks in the same ion series that map to the same protein. If a sequence call that is not labeled correct is flanked by at least one pair of peaks from an identified spectrum then it is labeled incorrect. If a sequence call is not flanked by at least one pair of peaks from an identified spectrum then it is labeled un-annotated. 47

61 48 8 AA) over all 6-prot proteins and 96% accurate (1 error per 25 AA) over the abtla antibody (supplemental materials). MS-GFDB IDs could also have been used to evaluate sequencing coverage and length, but because less than 45% of spectra assembled into contigs and meta-contigs were identified in both data sets, such an approach would ignore many contigs that assemble unidentified spectra. Thus, contig and meta-contig PRM spectra were also directly mapped to reference proteins to evaluate de novo sequencing coverage and length. Contig spectra were aligned to protein sequences using an algorithm similar to MS-Alignment [81, 110]. The protein sequences were first converted to perfect, unmodified PRM spectra and they were aligned (as in [2]) to contig and meta-contig PRM spectra requiring at least seven matching peaks. Alignments of contig PRM spectra were allowed with one modification to capture PTMs and meta-contig PRM spectra were allowed with at most two modifications because of their increased length. A contig or meta-contig that was aligned to a reference protein in this manner is termed mapped. Roughly 50% more SPS contigs were mapped than were identified over both data sets, which is expected as many contigs assemble low-quality MS/MS spectra that are often left unidentified at 1% FDR. Only about 10% more meta-contigs were mapped than were identified, which is also expected as 5X more spectra were assembled per meta-contig than for SPS contigs. To evaluate the accuracy of the alignment mappings, the mapped residue locations of aligned contig and meta-contig PRM peaks were compared with those of assembled annotated peaks in MS-GFDB identified spectra. Over all aligned contig and meta-contig PRM peaks that assembled at least one mass from an identified spectrum, greater than 95% were aligned to the same residue as at least one their assembled masses. 593 of 666 (89%) 6-prot SPS contigs were mapped to target or contaminant proteins (482 mapped to target proteins) whereas for 6-prot, all 68 meta-contigs were mapped (64 mapped to target proteins). Similarly, 290 of 329 (88%) abtla SPS contigs were

62 49 Figure 2.3. De novo sequencing length, coverage, and accuracy. A) The x axis plots the minimum distance (κ) a sequence call or gap is from one end of a meta-contig sequence and the y axis plots the average sequencing accuracy over all annotated calls at each k-distance. Over all annotated calls reported more than 8 positions from their closest end, there were a total of 3 incorrect sequence calls at κ = 20,21,22 of a single meta-contig aligned to the abtla heavy chain (discussed in the Results section of Supplementary Materials). B) Protein identifiers are: P 1 - leptin precursor, P 2 - kallikreinrelated peptidase, P 3 - GroEL, P 4 - myoglobin, P 5 - aprotinin, P 6 - peroxidase, P 7 - abtla light chain, and P 8 - abtla heavy chain. Protein Length is the length of each reference protein in amino acid residues. Spectrum Coverage is the percent of each protein covered by peptides identified MS-GFDB with 1% FDR. Coverage is taken over all mapped contigs and Accuracy is taken over all identified meta-contigs. Mapped meta-contigs must be aligned to a reference protein as described in the text whereas identified meta-contigs must assemble at least one identified spectrum whose peptide sequence is a substring of a reference protein. Sequencing Coverage is the percent of amino acids in each protein covered by at least one mapped meta-contig sequence. Coverage Redundancy is the average number of mapped meta-contig sequences covering each amino acid residue that is covered by at least one meta-contig sequence. Spectra Per Meta-contig is the average number of spectra assembled by each mapped meta-contig whereas Peptides Per Meta-contig is the average number of peptides (spectra with distinct parent masses) assembled by each mapped meta-contig. Average Seq. Length is the average number of amino acid residues covered by each mapped meta-contig and Longest Sequence is the maximum number of amino acid residues covered by a mapped metacontig. Correct Sequence Calls is the percentage of annotated sequence calls that were correct in identified meta-contigs. Un-annotated Seq. Calls is the percentage of sequence calls that were un-annotated in identified meta-contigs.

63 50

64 51 mapped (192 mapped to the antibody sequence) whereas all 43 abtla meta-contigs were mapped (27 mapped to the antibody sequence). Figures 2.4A and 2.4B illustrate the resulting meta-contig coverage for kallikreinrelated peptidase and abtla light chain, respectively. Supplemental materials illustrate meta-contig coverage for remaining 6-prot proteins as well as the abtla heavy-chain. The largest meta-contig in Figure 2.4A (coloured red) corresponds to a 91 AA metacontig sequence covering more than one third of the protein. The yellow meta-contig in Figure 2.4A appears to have sufficient overlap with neighboring blue and purple meta-contigs to combine them, but the ends of the three meta-contig sequences contained too many gaps (seven missing PRMs) and incorrect sequence calls (two incorrect PRMs) to exceed the current acceptance threshold of sharing six or more matching peaks (supplemental materials). Such gaps and errors stem from incomplete MS/MS peptide fragmentation. In the discussion section we describe foreseeable data acquisition and algorithmic adjustments that could either generate data with higher sequence content and/or enable reducing the acceptance threshold without diminishing sequencing accuracy. In Figure 2.4B, the largest meta-contig (colored orange) corresponds to a 106 AA meta-contig sequence covering more than one half of the target protein. See Fig. 3B for meta-contig coverage statistics on all proteins and see supplemental Materials for SPS contig coverage statistics in the same format. De novo sequencing gave 83% of MS-GFDB coverage between both data sets and we observe much higher sequence coverage of the purified abtla antibody (89%) compared with 6-prot proteins (42 83%). Because the heavy and light chains of the abtla antibody were purified prior to MS/MS analysis, higher abtla sequencing coverage is expected as more spectra from distinct peptides were identified by MS-GFDB per target protein in the abtla sample compared with the 6-prot sample (Figure 2.3B). This is not an algorithmic limitation of our approach, but rather limitations of subcellular protein processing and MS/MS data

65 52 acquisition. The lack of coverage of certain regions of the kallikrein-related peptidase in Figure 2.4A is expected. The commercially obtained protein used in these studies was purified from human seminal fluid. Thus, it can be expected to lack the N-terminal region 1 24 because of prior cleavage of the signal peptide, residues 1 17, and activation by cleavage of the propeptide, residues Furthermore, N-linked glycosylation is known to occur at residue 69. The subsequent sugar micro-heterogeneity at that position should render any individual proteolytically-generated peptide containing that residue much less concentrated in the digestion mixture, and if subjected to MS/MS much less likely to yield interpretable fragmentation. SPS contig alignments were also used to train the minimum spectrum alignment score to impose in Spectral Alignment and Meta-Assembly. τ was trained such that at least 97% of transitive alignments (alignments induced by two or more pair-wise alignments) between mapped contig PRM spectra in the same meta-contig were correct (a correct alignment is one whose observed shift matches the theoretical shift within the mass of a PTM). Over both data sets, 91% of all correct alignments were retained between pairs of mapped contig PRM spectra with at least 6 matching peaks. was trained to be 2.8 for 6-prot data, 3.0 for abtla, and can be estimated for any data set using a subset of identified spectra. After alignments with scores less than were removed (just prior to Meta-Assembly), 99% of all pair-wise alignments between mapped contig PRM spectra with at least 6 matching peaks were reported at 90% accuracy. But if transitive alignments were also considered, only 23% of alignments were correct because of the incorrect alignments reported by Spectral Alignment, those between components of multiple aligned contigs induced many more incorrect transitive alignments. The iterative merging procedure of Meta-Assembly was effective at discarding such incorrect alignments as 97% of transitive alignments ultimately reported were correct (supplemental materials). The efficiency of Meta-SPS merging of SPS contigs is indicated by the decrease

66 53 Figure 2.4. Mapped Meta-contigs. Meta-contig PRM spectra were aligned to reference proteins to evaluate de novo sequencing coverage. Every colored row corresponds to a contig PRM spectrum as separately mapped to the target protein sequence (information not used by Meta-SPS). Every set of overlapping contigs of the same color corresponds to a meta-contig; sets of contigs of the same color with no overlap indicate separate meta-contigs. Below each coverage map is the longest meta-contig sequence of the boxed meta-contig for the corresponding protein. Purple gaps correspond to mapped sequence calls with PTMs verified by MS-GFDB; blue gaps correspond to mapped gaps that span 2 or more residues in the reference. Remaining un-colored residues represent sequence calls that map to reference amino acid masses. A) Meta-contig coverage of kallikrein-related peptidase from the 6-prot sample is displayed here; 8 meta-contigs covered 78% of the 261 AA protein with the longest sequence spanning 94 AA. B) Meta-contig coverage of the abtla light chain is displayed here; 9 meta-contigs covered 87% of the 219 AA protein with the longest sequence spanning 107 AA.

67 54

68 55 in coverage redundancy from 3.7 to 1.1 (Figure 2.3B) as contigs covering the same regions were aggregated into meta-contigs. But because meta-contigs must assemble at least two contigs, meta-contigs do not cover regions missed by SPS contigs (i.e. coverage can only decrease from contigs to meta-contigs). Meta-contigs covered roughly 10% less of the 6-prot proteins than SPS contigs and 2% less of the abtla antibody. Thus, we generally observed a drop in coverage as a trade-off for Meta-SPS s higher sequencing accuracy. Coverage can be recovered by using leftover SPS contigs that were not merged by Meta-SPS, although lower sequencing accuracy is to be expected for certain applications. Meta-SPS also had the effect of doubling the average length of SPS contig sequences (to 20 AA in 6-prot meta-contigs and 25 AA in abtla meta-contigs) and tripling their maximum length (to 91 AA over 6-prot meta-contigs and 106 AA over abtla meta-contigs). Furthermore, the longest meta-contigs yielded the highest sequencing accuracy as the 91 AA and 106 AA de novo sequences displayed in Figures 2.4A and 2.4B, respectively, were 100% annotated and correct. Although one peak in the 91 AA sequence incorrectly assembled masses mapping to different residues, this error was not reflected in the final sequence because the majority of the peak s assembled masses mapped to the correct residue. The running time of Spectral Alignment and Meta-Assembly was found to be minor (< 9 min for the 6-prot data set) in comparison to that of SPS, which requires an all-to-all alignment of PRM spectra (see supplemental Materials for a more detailed description). All SPS contigs, meta-contigs, input MS/MS spectra, identified spectra, and annotated de novo sequences associated with this paper may be downloaded from massive.ucsd.edu at ftp://msv :a@massive.ucsd.edu/. This link also contains de novo sequencing reports that visualize how MS/MS spectra from each data set were used to generate de novo protein sequences. A subset of these reports detailing all 6-prot

69 56 meta-contigs can also be found directly at 6-prot meta-contigs/index.html. In supplemental materials we provide a description of how to interpret these reports in relation to algorithmic steps outlined Figure 2.1. Although the 6-prot sample contained a mixture of proteins, applications of de novo protein sequencing are often targeted toward specific proteins within a larger mixture. To test how Meta-SPS performance might be impacted by such samples, we combined the 6-prot CID MS/MS spectra with the high resolution CID MS/MS spectra from the abtla sample and executed the algorithmic steps outlined in Figure 2.1A on the combined set of MS/MS spectra. Here, the proteins of interest were the heavy and light chain of abtla antibody and the background mixture was represented by the 6-prot data. Because the high resolution CID spectra from the abtla and 6prot samples were acquired on a similar model of instrument (LTQ Orbitrap XL and LTQ Orbitrap, respectively), the low-resolution abtla spectra were excluded from this experiment to better simulate high resolution data acquisition of an abtla/6-prot mixture sample. Although this does not rigorously simulate the expected loss in MS/MS coverage one might expect from such a mixture (because of incomplete peptide sampling by the instrument), it is still a fair approximation of the algorithmic challenges associated with sequencing a small subset of proteins within the background of higher complexity. In practice, one would simply extend the LC gradient time or collect the data on a faster scanning instrument in order to maintain adequate peptide sampling. Compared with sequencing results on the abtla high resolution spectra, Meta-SPS produced the same sequencing accuracy (98.1% compared with 98.6%) and average length (18 AA compared with 17 AA) of the abtla antibody from the combined abtla/6-prot set of MS/MS spectra at the cost of reduced sequencing coverage (58% compared with 71%) and shorter maximum sequence length (35 AA compared with 45 AA). Compared with SPS, Meta-SPS generated de novo sequences 100% longer on average from the combined set with 2x as many correct

70 57 sequence calls per incorrect sequence call. We note that in a real MS experiment mixing the 6-prot and abtla samples, the absence of a faster spectral acquisition rate and/or extended peptide separation time could diminish protein sequence coverage by MS/MS spectra and thus further limit the overall sequencing length and coverage. 2.4 Discussion Shotgun protein sequencing with meta-contig assembly is a modification-tolerant method for de novo protein sequence reconstruction. We demonstrate that extensive and accurate protein sequencing can be achieved without the use of a database, meaning more can be gained from experimental MS/MS data before mapping to a reference database. Compared with any other automated approach, our method provides the longest and most accurate de novo sequences without requiring any sequence homology steps. Furthermore, we demonstrate that de novo sequences which extend beyond 90 amino acids can be assembled with 100% accuracy. In the shorter sequences we report sequencing errors that are not distributed randomly, but located overwhelmingly toward the ends of sequences (Figure 2.3A). Meta-SPS offers an effective improvement to Shotgun Protein Sequencing by doubling the average length of SPS de novo sequences, tripling their maximum sequence length, reducing sequence coverage redundancy 4X, and increasing sequencing accuracy 4 5%. There was only one protein, myoglobin, whose meta-contig sequences were less accurate (by 3%) than its SPS contig sequences (supplementary materials). In this case there were no sequencing errors introduced in myoglobin s meta-contig sequences that were not already present in its SPS contig sequences. But rather there was little overlap between incorrect and correct SPS sequence calls. When incorrect SPS contig sequence calls overlap with multiple correct contig sequences at multiple positions in the protein sequence, Meta-SPS can repair the incorrect sequence calls in

71 58 meta-contigs at those positions if the correct calls are the consensus. But if such overlaps occur with limited frequency, as in the case of myoglobin, the reduced percentage of correct sequence calls (because of SPS contig redundancy) is greater than the reduced percentage of incorrect sequence calls in meta-contig sequences. This has the effect of lowering the observed percentage of correct calls from contig to meta-contig sequences. Although Meta-SPS fell short of fully reconstructing a protein sequence in either data set, it assembled de novo sequences up to 91 AA long for a protein mixture and 106 AA long for a purified antibody, which are the longest confirmed de novo sequences ever obtained from the automated analysis of unidentified MS/MS spectra. Furthermore, 11 sequences from 6-prot and 6 sequences from abtla were extended beyond 40 AA. Sequencing accuracy was 95% for abtla and 6-prot samples, whereas the 91 AA and 106 AA sequences were 100% accurate. If we remove the first and last two residues or gaps of every sequence (where there was weaker consensus on average), sequencing accuracy improves to 96% over 6-prot proteins and 98% over the abtla antibody. Increased accuracy and reduced coverage redundancy of meta-contigs compared with SPS contigs was achieved at the cost of reduced meta-contig coverage (10% less coverage of 6-prot proteins and 2.5% less coverage of the abtla antibody). Full reconstruction of protein sequence encoded by the genome is subject to limitations of sub cellular protein processing and posttranslational modification. When a protein is purified from its biological source it can be expected to have N and C-terminal signal peptides and pre-pro activation sequences already cleaved off [63]. Although these can be predicted from a gene sequence, when a protein isolated from an organism with an un-sequenced genome is sequenced by the process described here, one would not be certain of having obtained the protein termini, unless they were chemically labelled prior to digestion [104]. Furthermore, in higher organisms N-linked glycosylation can occur at NX(S/T) motifs, particularly for secreted and extracellular membrane proteins [62].

72 59 Unless, the protein is de-glycosylated with an enzyme like PNGase-F, prior to proteolytic digestion, the sugar micro-heterogeneity at those sites should render any individual proteolytically-generated peptides containing the Asn residue from the motif much less concentrated in the digestion mixture, and if subjected to MS/MS much less likely to yield interpretable fragmentation. As with SPS sequencing, Meta-SPS also faces limitations related to proteomics mass spectrometry, such as incomplete enzyme digestion, peptide sampling bias, and ambiguous amino acid masses [72]. Cleaved peptides are not equally sampled by MS/MS instrumentation (e.g. hydrophobicity, ionizability, location of basic residues, etc.), leading to biased peptide coverage of target proteins. Furthermore, certain combinations of amino acids have identical masses and may lead to ambiguity in the final sequences (Ile = Leu = 113, GG = Asn = 114, and GA = Gln = 128). Because we require large sets of MS/MS spectra from overlapping peptides covering an entire protein to generate long sequences, we also face limitations when analyzing complex mixtures of proteins and proteins with related sequences. Our method is currently optimized for small mixtures of unrelated proteins or purified proteins, as we observe that coverage and sequence length degrade as fewer quality spectra are acquired per protein in the sample (Figure 2.3B). Nonetheless, even in the background of the 6prot MS/MS spectra, Meta-SPS still improved upon SPS sequencing accuracy (from 97% to 98%), average sequence length (from 11 AA to 20 AA), and maximum sequence length (from 25 AA to 35 AA) for the abtla antibody. Analyzing more complex mixtures with greater effectiveness may require faster spectral acquisition rates or extended peptide separations to generate enough spectra to cover all proteins in a sample. To enable assembling longer meta-contigs and achieve higher protein coverage a few adjustments to both data acquisition and algorithmic strategies are currently foreseeable. Compared with the use of CID and/or HCD fragmentation, it has been shown

73 60 that electron transfer dissociation (ETD) can yield more interpretable MS/MS spectra from more unique peptides and greatly increase the number of interpretable spectra from longer peptides (with precursor charge 3 + or higher) [34, 44]. The high resolution CID and HCD spectra described were collected in separate LC-MS/MS runs on a first generation LTQ Orbitrap that is not equipped with ETD. However, the duty cycle on newer LTQ Velos Orbitrap instruments is more than twice as fast. Thus in nearly equivalent chromatographic run time the newer instruments can subject each precursor ion to CID, HCD, and ETD fragmentation in 3 consecutive high resolution MS/MS spectra to provide information that is not only overlapping and complementary, but also all 3 can be directly attributed to the same peptide sequence. To support the combined processing of ETD, CID, and HCD spectra, spectral alignment steps in SPS will have to support alignments between b/c and y/z-type ions all the way from detection of pairs of spectra from overlapping peptides, through assembly of pairwise alignments into multiple alignments, and finally during the consensus interpretation of assembled ABruijn contigs. Furthermore, high resolution MS/MS spectra allow for more accurate determination of true amino acid mass differences between MS/MS peaks and helps distinguish those from incorrect amino acid predictions in de novo sequencing applications [15]. Although most results described here were achieved with high resolution MS/MS spectra acquired with ±15 ppm fragment mass accuracy, a fixed 0.05 Da fragment tolerance was imposed as SPS was not originally designed to support ppm tolerance. The 15 ppm is equivalent to Da at mass 100 and 0.06 Da at mass Implementing ppm tolerance in the Meta-SPS pipeline will impose much tighter tolerances in the mid-low mass range ( Da) and allow alignments of N- and C- terminal fragment peaks in MS/MS spectra from overlapping peptides to be more reliably separated from random alignments of noise peaks. This should also improve the separation of correct and incorrect contig/contig alignments in the Meta-Assembly step. In particular, Meta-

74 61 SPS currently requires six or more matching peaks to confidently align PRM spectra of overlapping peptides but implementation of ppm tolerance could enable decreasing this threshold without diminishing sequencing accuracy. The six matching peak requirement further translates into a five consecutive amino acid minimum overlap requirement in pair-wise peptide alignments. However, the proteolytic enzymes currently in common usage have overlapping specificities. For example trypsin cleaves at the C-terminal side of Lys and Arg, whereas Lys-C cleaves only at Lys and Arg-C cleaves only at Arg. Thus in a combined data set peptide triplets often result where two shorter peptides are present that when concatenated are the equivalent of a longer peptide that is also present, but our current algorithmic approach makes only pairwise comparisons. Thus we expect to better capitalize on the enzyme specificity by introducing a step that attempts to concatenate the PRM spectra of 2 smaller peptides prior to comparison to the PRM spectrum of a larger peptide when the sum of the 2 precursor masses matches the larger one after adjusting for precursor charge and the mass difference because of terminal groups added upon peptide bond cleavage. Consequently, we foresee these data acquisition and algorithmic strategy improvements will most likely yield longer, more accurate meta-contig sequences and higher protein coverage. 2.5 Acknowledgements Chapter 2, in full, is a reprint of the material as it appears in Molecular and Cellular Proteomics Shotgun protein sequencing with meta-contig assembly. Mol Cell Proteomics Oct;11(10): [40]

75 Chapter 3 Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides Full-length de novo sequencing of unknown proteins remains a challenging open problem. Traditional methods that sequence spectra individually are limited by short peptide length, incomplete peptide fragmentation, and ambiguous de novo interpretations. We address these issues by determining consensus sequences for assembled tandem mass (MS/MS) spectra from overlapping peptides (e.g., by using multiple enzymatic digests). We have combined electron-transfer dissociation (ETD) with collision-induced dissociation (CID) and higher-energy collision-induced dissociation (HCD) fragmentation methods to boost interpretation of long, highly charged peptides and take advantage of corroborating b/y/c/z ions in CID/HCD/ETD. Using these strategies, we show that triplet CID/HCD/ETD MS/MS spectra from overlapping peptides yield de novo sequences of average length 70 AA and as long as 200 AA at up to 99% sequencing accuracy. 3.1 Introduction In most proteomics studies, proteins are identified by digesting sample proteins into peptides (with an enzyme such as trypsin), generating a tandem mass (MS/MS) 62

76 63 spectrum for each peptide precursor, and identifying the peptide sequence of each MS/MS spectrum with a database search tool, such as SEQUEST[26] Mascot[80] MS- GFDB[56] or Spectrum Mill (Agilent Technologies). Proteins IDs are then inferred from unique peptide sequence identifications. The utility of protein identification by database search depends upon the existence of a reference database that contains all peptides of interest. But due to mechanisms of sequence variation (such as genetic recombination and somatic hyper-mutation in monoclonal antibodies[23]) and the existence of unsequenced genomes, many protein sequences remain unknown. Nevertheless, the characterization of monoclonal antibodies and venoms from unsequenced species remains a key step in many therapeutic drug development pipelines [71, 45, 64]. Historically, only a few low-throughput strategies have been available for de novo protein sequencing. As far back as 1987, Johnson and Biemann manually sequenced a complete protein from rabbit bone marrow using mass spectromtetry [51]. Edman degradation is another established approach for sequencing novel proteins but it has experimental bottlenecks that make it unsuitable for sequencing mixtures of proteins, proteins longer than 50 amino acids (AA), or post-translationally modified proteins [108, 113]. As such, many current applications of de novo sequencing still continue to rely upon manual curation of MS/MS spectra and/or Edman degradation [10, 74]. Fully automated de novo strategies that interpret MS/MS spectra individually have been less successful compared to database search in part because they are limited by ambiguous interpretations of MS/MS fragmentation [32]. Even if both approaches use the same function for scoring peptide-spectrum matches (PSMs), the top scoring peptide in the database for a given MS/MS spectrum may be the second or th highest scoring peptide over all possible de novo peptides, even if it is correct. Thus, de novo peptide sequencing algorithms typically report a ranked list of candidate PSMs for each spectrum where top-scoring PSMs have an accuracy of 80 90% for low-resolution CID

77 64 spectra[31, 68] and 90 92% for high-resolution CID spectra[32] (whereas database search results can typically be validated with 1% false discovery rate, FDR[76]). To yield these levels of accuracy, de novo tools face a significant trade-off between sequencing accuracy and protein sequence coverage as spectra exhibiting complete peptide fragmentation rarely cover entire proteins, yet are required to reconstruct accurate sequences. De novo peptide sequencing approaches are also limited compared to low-throughput Edman methods in that they can only generate sequences as long as enzymatically digested peptides (8 20 AA) and thus cannot fully sequence protein(s) of interest. An alternative approach to sequencing individual spectra is to simultaneously interpret multiple MS/MS spectra from overlapping peptides [4]. This Shotgun Protein Sequencing (SPS) paradigm has two distinct advantages over per-spectrum strategies. First, the alignment of spectra from overlapping peptides separates true N- and C-terminal ions from noise and leads to more accurate de novo sequences ( 95% for high-resolution CID spectra) at almost full sequence coverage (95%) [2]. Second, the assembly of multiple aligned spectra allows for the extension of longer de novo sequences (up to 40 AA for high-resolution CID spectra) [2]. Remaining limitations of per-spectrum and SPS-based computational strategies have been addressed by incorporating imperfect databases of known proteins that are homologous to those in the sample. Depending upon the level of similarity between reference and target, an imperfect database can be used to correct de novo sequencing errors and anchor sequences to the reference (as done with Champs[65]), extend de novo sequences from known to unknown regions (as done with GenoMS[13]), or reorder de novo sequences to enable nearly full-length sequencing (as done with Comparative SPS, csps[3]). De novo sequencing techniques have also been improved by utilizing multiple fragmentation modes. Compared to CID, alternative fragmentation strategies such as higher-energy collision dissociation (HCD[78]) and electron transfer dissociation

78 65 (ETD[102]) are known to improve fragmentation and identification of long, highly charged peptides [38]. HCD in particular has been shown to improve de novo peptide sequencing accuracy to 95% and boost interpretations of long peptides, albeit at only 55% sequence coverage of peptides identified by database search [15]. When high-resolution CID and HCD spectra were processed with an updated SPS assembly algorithm (called Meta-SPS (Chapter 2)), de novo protein sequences were extended to 100 AA at the maximum and 20 AA on average at 94% sequencing accuracy/65% sequence coverage for a 6-protein sample mixture and 97% sequencing accuracy/89% sequence coverage for a purified monoclonal antibody. ETD has also been shown to improve per-spectrum sequencing length and accuracy [67], but the benefits of ETD for de novo sequencing are perhaps better utilized when it is paired with CID. In this approach, a CID spectrum and an ETD spectrum are acquired for every precursor such that each pair of CID/ETD can be attributed to the same peptide. It is well-known that CID and ETD exhibit complementary fragmentation patterns that, when paired with each other, can yield much richer N/C-terminal ion ladders for a greater variety of peptides [38]. Although the decreased scan rate of ETD means fewer MS/MS spectra can be acquired per aliquot of sample material, ETD significantly increases the fraction of identifiable spectra for both database search[56] and per-spectrum de novo sequencing[22, 89], particularly when used in conjunction with enzymes such as LysC and GluC to acquire spectra from a greater variety of longer peptides (>20 AA) [99]. However, per-spectrum interpretation of paired fragmentation methods still cannot produce sequences longer than enzymatically digested peptides (13 20 AA depending on the digestion parameters) and has not achieved levels of sequencing accuracy/coverage greater than 95%/65% for high-resolution MS/MS [89]. Furthermore, published de novo sequencing tools capable of processing paired CID, HCD, or ETD spectra have not been made publicly available. Advances in MS/MS instrumentation have enabled fast acquisition of a CID

79 66 spectrum, HCD spectrum, and ETD spectrum per precursor such that each triplet of CID/HCD/ETD can be attributed to the same peptide. For example, a LTQ Velos Orbitrap instrument can acquire 5 triplets of CID/HCD/ETD MS/MS in a cycle of 1 MS in approximately the same amount of time as a cycle of 1 MS and 5 CID only MS/MS spectra on a prior generation LTQ-Orbitrap instrument. To take advantage of this capability, we describe a fully automated de novo protein sequencing approach that utilizes CID/HCD/ETD triplets from overlapping peptides to yield sequences as long as 200 AA ( 70 AA on average) at 99% sequencing accuracy and 71% sequencing coverage. To this end we updated algorithmic steps of the Meta-SPS(Chapter 2) pipeline to process any combination of high-resolution CID, HCD, and ETD spectra from each peptide. Investigations into separate acquisition of CID, HCD, and ETD have showed promise for database search[95, 94, 34] but, to the best of our knowledge, this is the first application of triplet CID/HCD/ETD acquisition for de novo protein sequencing. We demonstrate that corroborating evidence of peptide fragmentation observed in CID/ETD pairs and CID/HCD/ETD triplets from overlapping peptides enables near-full length de novo protein sequencing at nearly perfect accuracy. 3.2 Methods Since Shotgun Protein Sequencing[2] interprets spectra from overlapping peptides, sample proteins were digested with multiple enzymes. High-resolution MS/MS CID/HCD/ETD triplets were then acquired on a Thermo LTQ-Orbitrap Velos and run through the updated Meta-SPS pipeline illustrated in Figure 3.1. To enable support for CID/HCD/ETD spectra we updated our prealignment steps to process and merge any combination of CID/HCD/ETD spectra from each precursor by adding two new stages to the Meta-SPS workflow. First PepNovo + [32] was trained to score high resolution CID, HCD, and ETD MS/MS spectra (see section PepNovo+Training). Since PepNovo +

67 Figure 3.1. Updated Meta-SPS pipeline. Green arrows denote procedures previously described in Chapter 2 and [2] while red arrows denote updated procedures described here.

80 67 Figure 3.1. Updated Meta-SPS pipeline. Green arrows denote procedures previously described in Chapter 2 and [2] while red arrows denote updated procedures described here. cannot analyze multiple spectra from the same precursor, a procedure was developed to merge scored CID/HCD/ETD spectra and take advantage of corroborating evidence (see section CID/HCD/ETD Merging) MS/MS Acquisition To benchmark and test this approach, 21,901 CID/HCD/ETD triplets (65,703 total MS/MS spectra) were separately acquired from aliquots of 7 digests of a mixture of 6 known proteins. An equimolar mixture of 6 commercially purified proteins containing 252 g of total protein was prepared. Cysteines were reduced with dithiothreitol (DTT) and alkylated with iodoacetamide. Seven 32 ug aliquots were created and used for 7 different digests with Trypsin, Chymotrypsin, Lys-C, Arg-C, Glu-C, Asp-N, or CNBr. The 6 proteins with accompanying molecular weights and Swiss-Prot accession numbers are bovine aprotinin (6.5 kda, P00974), murine leptin (16 kda, P41160), horse heart myoglobin (17 kda, P68082), horseradish peroxidase (39 kda, P00433), E. coli GroEL (57 kda, P0A6F5), and human kallikrein-related peptidase (29 kda, P07288). Details of sample preparation have been described previously (Chapter 2). Aliquots of each digest (0.5 ug) were analyzed with an automated nano LCMS/MS system, consisting of an Agilent 1200 nano-lc system (Agilent Technologies, Wilm-

81 68 ington, DE) coupled to an LTQ-Orbitrap Velos Fourier transform mass spectrometer (Thermo Fisher Scientific, San Jose, CA) equipped with generation 2 ion optics (Velos Pro) and a nanoflow ionization source (James A. Hill Instrument Services, Arlington, MA). Peptides were eluted from a 10 cm column (Picofrit 75 um ID, New Objectives) packed in-house with ReproSil-Pur C18-AQ 3 m reversed phase resin (Dr. Maisch, Ammerbuch Germany) using a 95 min acetonitrile/0.1% formic acid gradient at a flow rate of 200 nl/min to yield 20 s peak widths. Solvent A was 0.1% formic acid and solvent B was 90% acetonitrile/0.1% formic acid. The elution portion of the LC gradient was 3 6% solvent B in 1 min, 6 31% in 50 min, 31 60% in 13 min, 60 90% in 1 min and held at 90% solvent B for 5 min. Data-dependent LC MS/MS spectra were acquired in 3 s cycles; each cycle was of the following form: one full Orbitrap MS scan at resolution followed by 15 MS/MS scans in the orbitrap at resolution using an isolation width of 3.0 m/z. The top 5 most abundant precursor ions were each sequentially subjected to CID, HCD, and ETD dissociation. Dynamic exclusion was enabled with a mass width of 20 ppm, a repeat count of 1, and exclusion duration of 12 s. Charge state screening was enabled along with monoisotopic precursor selection and nonpeptide monoisotopic recognition to prevent triggering of MS/MS on precursor ions with unassigned charge or a charge state of 1. For CID, the normalized collision energy was set to 30 with an activation Q of 0.25 and activation time of 30 ms. For HCD, the normalized collision energy was set to 45. For ETD, fluoranthene was used as the ETD reagent with an anion AGC target of ions, supplemental activation was enabled, and the reaction time was dependent on the precursor charge state (precursor charge state - reaction time in msec: , , +4 50, +5 40, , etc). All MS/MS spectra were collected with an AGC target ion setting of ions. The instrument control software does not currently allow for separate AGC targets for each dissociation mode. Optimal AGC targets would be closer to 30,000 ions for CID, HCD; and 200,000

82 69 ions for ETD [34]. All mass spectra associated with this paper may be downloaded from Spectrum Preprocessing and Notation Thermo RAW files were converted to mzxml with ProteoWizard[54] (version ). To validate de novo sequencing accuracy, all combinations of CID/HCD/ETD pairs/triplets as well as individual CID, HCD, and ETD spectra were searched with MS-GFDB[56] against the 6 target proteins and known contaminants with a spectrumlevel false discovery rate of 1% (see Supporting Information for parameters used for MS-GFDB). As part of the Meta-SPS pipeline, high-resolution MS/MS peaks were first deconvoluted such that all peaks were converted to charge one. The following notation is used below: a peptide MS/MS spectrum S is defined as a collection of peaks where each peak p S has mass m[p] and intensity i[p]. The parent mass M[S] is the cumulative mass of all amino acids in the peptide sequence and the precursor charge Z[S] is the charge of the peptide precursor ion PepNovo + Training Rather than processing MS/MS spectra directly, Meta-SPS uses PepNovo+16 to interpret MS/MS fragmentation patterns and convert MS/MS spectra into PRM (prefix residue mass) spectra where peak intensities are replaced with log-likelihood scores and peak masses are replaced by PRMs[20], or Prefix-Residue Masses (cumulative amino acid masses of N-term prefixes of the peptide sequence). Peak scores combine evidence supporting peptide breaks (observed cleavages along the peptide backbone, supported by either N- or C-terminal fragments). N/C-terminal fragments may be observed by b/y ions in CID/HCD and by c/z /z ± H [88] ions in ETD. Because complementarity between b/y and c/z ions can cause C-terminal MS/MS ions to be misinterpreted as

83 70 N-terminal ions, PRM spectra also typically contain many SRMs, or Suffix-Residue Masses (cumulative amino acid masses of C-terminal suffixes of the peptide sequence). This approach considers peaks in PRM spectra as both PRMs and SRMs because some spectra may contain predominantly SRMs and on average they make up 30 40% of all true PRMs or SRMs. In previous work, high-resolution CID and HCD MS/MS spectra were scored with a PepNovo+ scoring model that was not trained to process deconvoluted (Chapter 2) spectra and there was no PepNovo + scoring model for ETD. In training the new models, we deconvoluted the training spectra because PepNovo + was optimized to analyze charge 2 and 3 tryptic CID spectra, and thus does not give enough weight to MS/MS peaks of charge 3 or higher in spectra from precursors of charge 3. Here we trained three new scoring models for deconvoluted high-resolution CID, HCD, and ETD MS/MS spectra using multiple data sets. These new models can only be used to generate PRM spectra, not de novo peptide sequences. Although PepNovo+ PRM models were trained automatically with PSMs from 3,000 unique peptides per precursor charge state, training the rank-boosting[29] models needed for peptide sequencing required too many PSMs from unique peptides (>100,000) as well as more extensive modification of PepNovo + source code. Due to the limited availability of large sets of annotated CID, HCD, and ETD high-resolution MS/MS spectra from multiple enzymes at the time of this study, only tryptic spectra were used to train the CID model while tryptic and Lys-C spectra were combined to train each of the HCD and ETD models. The first data set consists of high-resolution CID, HCD, and ETD MS/MS spectra from tryptic peptides [34]. Another 175,595 tryptic HCD MS/MS spectra were provided by the Zubarev lab at the Karolinska Institute. The third data set consists of high-resolution ETD and HCD MS/MS spectra from Lys-C digestion and SCX fractionation of a yeast lysate collected in conjunction with the 2011 ABRF-iPRG study (see Supporting Information for description) [17]. All

84 71 raw MS/MS spectra then were identified by MS-GFDB at 1% spectrum-level FDR to yield the set of training PSMs. PepNovo + used these PSMs to automatically learn ion types, intensity ranks, and noise models for each type of spectra and output models which can be used to score unidentified MS/MS spectra of the same type. See Supporting Information for details regarding the MS-GFDB searches and the specific PepNovo + training procedure CID/HCD/ETD Merging Given a CID (S CID = {c 1,...,c n }), HCD (S HCD = {h 1,...,h m }), and/or ETD (S ET D = {e 1,...,e q }) PRM spectrum from the same precursor, the merging procedure generates a single merged PRM spectrum (S = {p 1,..., p r }) (with the same parent mass M[S]) for all available spectra. Using the set of training PSMs, the objective is to maximize observed breaks, which is the percentage of all breaks observed as PRMs/SRMs at correct N/C-terminal masses (a measure of sensitivity), while also maximizing explained score, which is the percentage of score in correct PRMs/SRMs relative to the score of all PRMs/SRMs in the same spectrum (a measure of accuracy). PRM spectra typically contain many C-terminal SRM masses along with N-terminal PRM masses. While PRM peaks have no offset from the summed amino acid masses, C-terminal peaks are offset by +18 Da (mass of H 2 O) from SRMs in CID and HCD spectra [20, 107]. In ETD spectra, C-terminal peaks are offset by 15 Da (mass of NH) from SRMs [56]. Given a PRM or SRM mass m, one can locate the complementary SRM or PRM mass in CID and HCD spectra with the formula twin CID (m,s) = twin HCD (m,s) = M[S] m + 18, while complementary masses in ETD can be found with twin ET D (m,s) = M[S] m 15. Using these offsets, one can locate corroborating peaks from CID/ETD and HCD/ETD pairs that support the same peptide break, which are much more likely to explain true peptide breaks than individual PRMs. For example, we found that 92% of the score in peaks from

85 72 identified ETD PRM spectra with matching peaks at the same (or complementary) mass in CID or HCD spectra was found in true PRMs/SRMs. In contrast, only 70 80% explained score is typically found in individual PRM spectra. Since PepNovo + does not currently recognize CID/HCD + ETD corroborating evidence when assigning log-likelihood scores, we postprocessed the scores of corroborating PRMs/SRMs into combined scores in the merged PRM spectrum. However, since corroborating PRMs/SRMs only account for 47% of all peptide breaks in identified CID/HCD/ETD triplets, peaks without corroborating evidence must also be added to the merged spectrum. Since 80% explained score was found to yield high de novo sequencing accuracy (97%) in a previous application of Meta-SPS (Chapter 2), steps were developed to maximize the percentage of observed breaks at 80% explained score for all precursor charge states. First, corroborating PRMs and SRMs from CID/ETD and HCD/ETD pairs were extracted from PRM spectra and the corresponding combined PRMs were inserted into the merged spectrum. This was done in a series of steps to reduce the chances of misinterpreting SRMs as PRMs. But since steps 14 only captured PRMs and SRMs explaining 47% of all peptide breaks, the remaining peaks from CID, HCD, and ETD were also added to the merged spectrum in step 5 to bring the percentage of observed breaks to 94%. While this improved sensitivity, it also combined the noise between all three spectra such that the percentage of explained score was only 59% (instead of 91% for PRMs with corroborating evidence). Thus, local rank-based filtering was applied in step 6 to yield 86% observed breaks at 80% explained score over all precursor charge states (Figure 3.2b). We describe this procedure for merging CID/ETD pairs, but the

86 73 Figure 3.2. MS/MS ion statistics and performance of CID/HCD/ETD PRM scoring and merging. A) Observed MS/MS ions: Percentage of peptide breaks observed by N-terminal ions (b ions in CID/HCD and c ions in ETD) and/or C-terminal ions (y ions in CID/HCD and z /z + H ions[88] in ETD) over all MS/MS CID/HCD/ETD triplets identified by MS-GFDB (considering a 10 ppm peak tolerance). To filter out low-intensity noise peaks, a peak was counted if and only if its intensity was ranked in the top seven over all neighbouring peak intensities with in a ±56 Da radius. Rows separate baseline PSMs by precursor charge of identified triplets. B) Performance of PRM scoring: Percentage of observed peptide breaks and percentage of explained score (the summed score of all true PRMs over the sum of all scores in the spectrum 100) was counted over all combinations of merged/unmerged PRM spectra (without clustering) with identified MS/MS spectra. Peaks at N/C-terminal masses indicated peptide breaks in all cases. Each combination of PRM spectra was benchmarked by MS-GFDB IDs of the same combination of MS/MS spectra[56] (CID/HCD/ETD PRMs were benchmarked with CID/HCD/ETD IDs, CID/ETD PRMs with CID/ETD IDs, HCD PRMs with HCD IDs, etc). Also indicated is the performance gained by retraining PepNovo + to individually score high resolution CID, HCD, and ETD spectra. C) Identified spectra and peptides: The numbers of identified spectra and unique peptides are shown for each combination of MS/MS spectra used to benchmark PRM scores in (b). As expected, incorporation of ETD significantly improves identification rates of spectra from highly charged precursors.

87 74

88 75 same method can also be applied to HCD/ETD pairs. 1. Consider all PRM/PRM matches: Find all pairs of peaks with same mass (c i,e k : m[c i ] = m[e k ]) and add a peak s to the merged spectrum S with PRM mass m[s] = m[e k ]. Whenever a peak is added to the merged spectrum, it only defines a new mass if that mass does not already exist in the merged spectrum within peak tolerance (otherwise the new peaks score is just added to the existing peak). Also find any complementary SRMs from the set {c x,e z : m[c x ] = twin CID (m[s],s) m[e z ] = twin ET D (m[s],s)}. For all of these peaks that were found, assign s the merged score i[s] = 2 (i[c i ]+i[c x ]+i[e k ]+i[e z ]) and remove c i, c x, h j, h y, e k, and e z from S CID and S ET D, respectively. 2. Consider all SRM/SRM matches with at least one PRM: Find all pairs of SRM peaks with mass difference (c x,e z : m[c x ] = m[e z ] + 33) and where at least one PRM from the set {c i,e k : m[e k ] = twin ET D (m[e z ],S) m[c i ] = m[e k ]} is found from any spectrum (CID or ETD) for these SRMs. Then add a peak s to the merged spectrum S with the PRM mass m[s] = m[e k ], remove all of these peaks from S CID and S ET D, and assign s the merged score by the same formula in stage Consider all PRM/SRM and SRM/PRM pairs: Find all pairs of PRM/SRM peaks (c i S CID,e z S ET D : m[c i ] = twin ET D (m[e z ],S)) or SRM/PRM peaks (c x S CID,e k S ET D : m[c x ] = twin ET D (m[e k ],S)). Add a peak s to the merged spectrum with the PRM mass (m[s] = m[c i ] for PRM/SRM pairs or m[s] = m[e k ] for SRM/PRM pairs), remove all of its supporting peaks from S CID and S ET D, and assign s the merged score by the same formula in stage Consider all SRM/SRM matches without PRMs: Find all pairs of SRM peaks with mass difference (c x,e z : m[c x ] = m[e z ] + 33). Then add a peak s to the

89 76 merged spectrum with the PRM mass m[s] = twin ET D (m[e z ],S), remove all of its supporting peaks from S CID and S ET D, and assign s the merged score by the same formula in stage Add left over peaks from S CID and S ET D to S without changing their scores. 6. Filter out peaks with low scores in S: a peak is retained if and only if its score is ranked in the top three over all neighboring PRM scores within a ±56 Da mass range. The MS/MS spectra were acquired under conditions yielding mass measurement errors of ±10 ppm. But since PepNovo + incorporates the parent mass error when assigning PRM masses from C-terminal fragment masses, a fixed 0.04 Da tolerance was used. This corresponds to 400 m/z 100, 40 m/z 1000, and 10 m/z Merged PRM spectra from the same peptide were then clustered by an approach similar to MSCluster[30] (see Supporting Information for description). 21,901 CID/HCD/ETD triplets were combined into 11,325 clusters, each containing one or more triplets. A cluster contains only triplets sharing the same parent mass M[S]. Thus, triplets derived from the same peptide, but in different precursor charge states, were still merged. Replicate triplet spectra exist in the data set for two major reasons. First, given the small number of proteins in the sample and the rapid acquisition rate of the mass spectrometer, the dynamic exclusion time for triggering repeat acquisition of a particular precursor m/z was set to 1/2 the chromatographic peak width to maximize the chance of collecting MS/MS near each peptides chromatographic apex. Second, some of the same peptides can be produced by digestion with two different enzymes. For example some tryptic peptides are also produced by Lys-C or Arg-C digestion. The clustered set of merged PRM spectra was then run through the Meta-SPS pipeline illustrated in Figure 3.1, which involves two stages of alignment/assembly. PRM spectra were first aligned and assembled into contigs

90 77 (sets of spectra from overlapping peptides) [2], which were further connected to form meta-contigs (sets of overlapping contigs) (Chapter 2). Figure 3.3 illustrates a resulting de novo protein sequence extracted from the highest-scoring consensus interpretation of a meta-contig. This updated Meta-SPS pipeline along with the newly trained PepNovo + scoring models are available at Results The performance of Meta-SPS on CID/HCD/ETD triplets was assessed in terms of de novo sequencing length, coverage, and accuracy. Coverage and length was determined via modification-tolerant alignment of de novo sequences to the reference protein sequences [3]. Sequencing accuracy was also computed as done previously (Chapter 2): MS-GFDB peptide-spectrum matches were transferred to PRM spectra and then meta-contigs. A sequence call (mass of one or more possibly modified amino acids) was labeled correct if its consecutive flanking peaks are annotated by a MS-GFDB peptide match in the same ion series in the same identified spectrum (i.e., both are annotated as PRMs or SRMs from MS-GFDBs peptide match). All noncorrect sequence calls from identified spectra are labeled incorrect. Remaining sequence calls whose flanking peaks are not from identified spectra are labeled unannotated. See Supporting Information for details regarding the MS-GFDB searches used to compute performance metrics in Figure 3.2 and Table 3.1. Figure 3.2a shows MS/MS ion statistics over all identified CID/HCD/ETD triplets and Figure 3.2c shows the numbers of identified spectra and peptides for all combinations of CID/HCD/ETD. Table 3.1a details the spectrum coverage by MS-GFDB (percent of protein sequence covered by identified peptides) for different combinations of fragmentation methods and Table 3.1b details coverage of all six proteins. Since Meta-SPS sequencing errors are usually distributed toward the ends of

Figure 3.3. Assembled meta-contig of CID/HCD/ETD triplets. The topmost sequence is the myoglobin sequence as it is aligned to the de novo sequence below it.

91 Figure 3.3. Assembled meta-contig of CID/HCD/ETD triplets. The topmost sequence is the myoglobin sequence as it is aligned to the de novo sequence below it. Each row denotes a merged PRM spectrum from one or more CID/HCD/ETD triplets where peaks not aligned to other merged PRM spectra from overlapping peptides are removed [2]. Red peaks indicate PRMs supporting the de novo sequence and green arrows between red peaks denote 1 2 AA mass differences supporting the consensus de novo sequence. Red vertical dotted lines connect assembled PRMs to each de novo sequence call; black peaks were not assembled into the consensus. Blue bars denote spectrum end points (at mass 0 and parent mass M[S]). The height of each peak corresponds to the merged PRM score from CID, HCD, and ETD. The red labels [+0.98] and [+16.00] indicate post-translational modification masses that were tolerated during alignment/assembly (without knowing of them in advance). All de novo sequence calls, except the R at the end, were verified by database search. 78

92 79 Table 3.1. De novo Sequencing Length, Coverage, and Accuracy for Alternative Minimum Meta-contig Size (κ) Cutoffs. A) Sequencing results per combination of fragmentation modes: Spectrum Coverage is the percent of amino acids in all proteins covered by peptides identified by MS-GFDB at 1% FDR. Sequencing Coverage is the percent of amino acids in all proteins covered by at least one aligned de novo sequence. Average Seq. Length is the average number of amino acids covered by each aligned de novo sequence and Longest Sequence is the maximum number of amino acids covered by a single de novo sequence. Sequencing Accuracy is the percentage of all annotated sequence calls that were labeled correct. Un-annotated Seq. Calls is the percentage of sequence calls that were un-annotated. Each column indicates which combination of MS/MS spectra was used as input to Meta-SPS and database search. B) Sequencing results per protein (using CID/HCD/ETD fragmentation): The same metrics in the top are shown for each protein in the CID/HCD/ETD data set (cumulative results over all six proteins are shown in the first column of the top).

93 80

94 81 sequences (Chapter 2) we removed the first and last sequence calls from every de novo sequence before computing coverage and accuracy. Resulting meta-contigs were binned by κ, the minimum allowable number of combined SPS contigs per meta-contig, and results are reported for κ 1, κ 2, and κ 5. κ 5 yields the longest and most accurate subset of meta-contig sequences because each of these must be supported by at least 5 SPS contig sequences, whereas κ 1 retains unmerged SPS contigs with meta-contigs of all sizes to yield the highest sequencing coverage. At κ 5, 19 de novo sequences assembling CID/HCD/ETD triplets were returned by Meta-SPS, all of which matched to the reference (with at most two modifications per match) and covered 71% of all six proteins at average length 66 AA (Table 3.1a). At κ 1 and κ 2, minimal losses in sequencing accuracy were sustained (98%) to achieve sequencing coverage (80% and 84%, respectively) closer to the coverage of database search (88%) at 1% FDR. The longest sequence spanned 194 AA and is shown in Figure 3.4 along with the longest sequences covering each of the six proteins. Although sequences from CID/ETD pairs only (i.e., no HCD) were not as long at the maximum (125 AA), they were still longer than 50 AA on average (at κ 5) and covered 67 81% of target proteins depending on κ (Table 3.1a). HCD/ETD pairs exhibited roughly the same sequence coverage and length as CID/ETD (65 82% coverage, 131 AA maximum length, and 49 AA average length). The highest sequencing accuracy was observed for CID/ETD pairs and CID/HCD/ETD triplets at 99.5% and 98.9%, respectively, while HCD/ETD pairs gave 96.5% accuracy. ETD provides a significant increase in interpretable MS/MS fragmentation of long, highly charged peptides as well as a gain in PRM scores given to corroborating peaks in CID/ETD and HCD/ETD (Figure 3.2). Corroborating evidence was a very significant feature of peptide fragmentation as 91.8% of PRM scores was found in true PRMs after stage 1 4 merging. As a result, the combinations of CID/ETD, HCD/ETD,

95 82 Figure 3.4. De novo sequencing coverage of six target proteins at κ 5. Every coloured row corresponds to a de novo sequence as separately mapped to the reference protein sequence (information not used by Meta-SPS); each row in the coverage map spans at most 85 AA. Regions of each sequence that were mapped to the reference with unknown modifications have Xs in place of AA letter codes. Below each protein map is the longest de novo sequence covering that protein (also indicated in bold boxes in the coverage maps) following removal of first/last sequence calls. Blue letters correspond to calls that span 2 or more AA in the reference. Red letters indicate incorrect sequence calls as aligned to the reference. Remaining uncoloured AA represent sequence calls that match reference amino acid masses. Regions where lack of de novo sequencing coverage was expected (due to lack of coverage by database search) are indicated with a dashed red line. As mentioned in the Results section, these lapses in coverage likely occur because of known cleavage of signal peptides and glycosylation sites.

96 83

97 84 and CID/HCD/ETD gave the highest quality PRM spectra from long peptides, which are especially useful for assembly because they enable the extension of de novo sequences into regions that might not contain overlapping coverage of shorter peptides with precursor charge 2/3 due to either over digestion or incomplete enzyme digestion. The quality of PRM spectra from long peptides was also improved by training PepNovo + on high-resolution CID, HCD, and ETD MS/MS spectra (Figure 3.2b). Of the 6 proteins analyzed in this work, leptin and GroEL were produced recombinantly in E. coli while kallikrein-related peptidase, aprotinin, myoglobin, and peroxidase were isolated from natural sources. As documented in UniProt, leptin, kallikrein-related peptidase, aprotinin, and peroxidase are each known to contain N-terminal signal peptides that target the proteins for secretion from their cells of origin. Aprotinin and peroxidase further contain propeptide sequences that are cleaved upon activation. While the signal and pro-peptides would be missing from the proteins we analyzed, in Table 3.1 and Figure 3.4 we have used the full length gene sequence when calculating coverage by the assembled MS/MS spectra. Leptin contains a signal peptide (amino acids 1 21), that is lacking in the recombinant material obtained from Sigma-Aldrich. Kallikrein-related peptidase contains a signal peptide (amino acids 1 17), a propeptide (amino acids 17 24), and known N-linked glycosylation at amino acid 69. Aprotinin contains a signal peptide (amino acids 1 21), and propeptides (amino acids and ). Peroxidase contains a signal peptide (amino acids 1 30), a propeptide (amino acids ), and known N-linked glycosylation sites at (amino acids 43, 87, 188, 216, 228, 244, 285, and 298). The sugar microheterogeneity at N-linked glycosylation sites will tend to render any individual proteolytically generated peptide containing that amino acid much less concentrated in the digestion mixture, and if subjected to MS/MS much less likely to yield interpretable fragmentation. These modifications, along with incomplete peptide sampling by the instrument, likely explain why 12% of protein sequences were not

98 85 covered by database search. Remaining losses of coverage from de novo sequencing can be attributed to lack of spectra from overlapping peptides with sufficient fragmentation. To determine whether all enzymes were necessary to achieve quality sequencing, seven data sets were generated such that spectra from each of the seven enzymes were separately excluded from the CID/HCD/ETD data. De novo sequencing length, coverage, and accuracy from these runs are shown in Table 3.1a. At κ = 1, each of these data sets exhibited roughly the same sequencing accuracy (97 98%), high maximum sequence length ( AA), yet varying levels of de novo sequencing coverage. All runs yielded 79 83% sequencing coverage except when CNBr spectra were excluded, in which case sequencing coverage dropped to 64%. Table 3.2a shows that CNBr did not yield spectrum coverage that was missed by other enzymes (MS-GFDB coverage only dropped from 88.3 to 87.4%). However CNBr contributed the most unique peptides from highly charged precursors (Table 3.2b). Although the most abundant precursor ions in our CNBr data are derived from peptides that span the distance between two methionine residues, much of the data instead consists of peptides bounded by a Metspecific cleavage on one end and a nonspecific hydrolysis cleavage on the other end. This yields sets of overlapping peptides that differ only by short AA truncations on either end. Altogether, these features result in CNBr outperforming Lys-C, Arg-C, Asp-N, and Glu-C digests in terms of generating the most precursors from long overlapping peptides, which are valuable to Meta-SPS for assembling long de novo sequences with high sequencing coverage. 3.4 Discussion Multispectrum acquisition of high resolution CID, HCD, and ETD coupled with the proposed improvements to Meta-SPS enable near full-length automated de novo sequencing of simple protein mixtures at 99% sequencing accuracy. To the best

99 Table 3.2. De novo Sequencing and Database Search Results by Enzyme. A) Sequencing results per excluded enzyme (using CID/HCD/ETD fragmentation): Each column indicates which spectra (acquired from a specific enzyme digestion) were removed from the full set of triplet spectra. The same metrics as in Table 3.1a are shown for all contigs (i.e., including meta-contigs and un-merged SPS contigs, κ = 1). Digestion by CNBr was found to contribute the most de novo sequencing coverage to the combined CID/HCD/ETD analysis. B) MS-GFDB results per enzyme (using CID/HCD/ETD fragmentation): Each column indicates which set of spectra were identified against the six proteins at 1% spectrum-level FDR. Spectrum Coverage is defined in Table 3.1a; Unique Peptides is the number of unique identified peptide sequences (considering PTMs); Charge >3 Peptides is the percentage of unique peptides that were identified by at least one spectrum with precursor charge >3. CNBr was found to contribute the largest set of unique peptide IDs while also having the largest composition of identified peptides from highly charged precursors, which indicates why removing CNBr from the analysis shown in the top yielded the least de novo sequencing coverage. 86

100 87 of our knowledge, these are the longest and most accurate de novo sequences ever reported by an automated approach. Although this approach still falls short of fully reconstructing a complete protein, the average sequence length was greater than 60 AA long and approached 200 AA at the maximum, which should potentially enable automated sequencing of small proteins such as venom toxins[2, 8] and the variable CDR regions of monoclonal antibodies [13, 3, 11]. Related methods for de novo sequencing with complementary fragmentation methods do not consider spectra from overlapping peptides, which limits sequencing length ( 10 AA on average), accuracy (<95%), and coverage (<70%) [22, 89]. Still, results could possibly improve from devising more robust probabilistic scoring functions for paired CID/ETD and HCD/ETD MS/MS spectra than described here. Possible ways to do this include the Bayesian networks approach in Spectrum Fusion[22] or extensions of the scoring functions used in popular de novo tools like PepNovo + and PEAKS. Although our high-resolution MS/MS acquisition enabled ±10 ppm mass tolerance, a fixed 0.04 Da tolerance was used because PepNovo+ and SPS do not yet support ppm tolerance. Allowing for ±0.04 Da mass errors is equivalent to the diminishing mass error tolerance of ppm over the increasing mass range of m/z. Implementing ppm tolerance in the Meta-SPS pipeline might allow for reduction alignment thresholds in SPS and Meta-SPS, as the probability of random high scoring matches between spectra from nonoverlapping peptides diminishes with tighter mass tolerance. It would also enable resolving ambiguous interpretations of near isobaric masses (K-Q= , K-GA= , F-Mox= , VS-W= , and W- DA= ), which is a common limitation of proteomics mass spectrometry. Other ambiguities, such I/L interpretations, cannot be resolved by mass alone but may be resolved by examination of amino acid-specific fragmentation patterns [37]. Here we can report sequencing accuracy because de novo sequencing was done

101 88 on a set of known proteins. When this method is applied to unknown complex samples, sequencing accuracy may still be approximated with a subset of identified spectra. If the sample is completely unknown, one could anticipate spiking the set of input spectra with a set of spectra acquired under the same experimental conditions from a few known proteins that have no homology to those in the unknown sample. Although this may capture cases where spectra from completely unrelated proteins are assembled into the same meta-contig, it will fail to capture cases where spectra from homologous proteins are combined due to sequence similarity. It remains an open problem to determine whether such sequencing errors and/or false discovery rates can be estimated by de novo assembly of MS/MS spectra. This approach is mainly limited by instrument peptide sampling bias as a result of hydrophobicity, ionizability, and locations of basic amino acids, which leads to incomplete MS/MS coverage. This can significantly affect the performance of assembly based approaches where full peptide coverage is not usable without sufficient overlap between peptides. As a result, Meta-SPS is currently optimized for data sets where the experimental protocol is expected to yield a high fraction of spectra from overlapping peptides. While this is currently easiest for simple protein mixtures, we would expect that the same methods would apply to more complex samples as long as enough mass spectrometry runs are used to acquire spectra from overlapping peptides. In addition, analysis of more complex mixtures would benefit from faster MS/MS scan rates or analysis of multiple fractions to yield enough coverage with multiple overlapping peptide sequences. The slower scan rate of ETD ( 2/3 the rate of HCD) may further limit coverage, but our results suggest that ETD coupled with CID and/or HCD yields much longer and more accurate de novo sequencing than CID or HCD alone (even when considering that more precursors are subjected to MS/MS when fewer dissociation methods are employed), and thus the gains in sequencing outweigh the losses in peptide

102 89 sampling. We further anticipate improvements in the quality of ETD spectra collected in the CID/HCD/ETD triplet configuration upon revision of the instrument control software to allow for separate AGC targets for each dissociation mode. Currently, we set the ETD AGC target 4-fold lower than optimal so as not to overly compromise CID and HCD performance. 3.5 Acknowledgements Chapter 3, in full, is a reprint of the material as it appears in the Journal of Proteome Research Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides. J Proteome Res Jun 7;12(6): [41]

103 Chapter 4 De novo sequencing of polyclonal antibodies from serum Monoclonal antibodies (mabs) currently dominate the therapeutic antibody drug market due to their high specificity, however, should a disease evolve in such a way that alters the epitope targeted by the mab (called an escape variant), the therapy will fail. To address this, mixtures of mabs and polyclonal antibodies (pabs) which target multiple epitopes on the antigen(s) are being investigated. In fact, some of these antibody cocktails outperform their mab counterparts [91, 47]. Furthermore, development costs remain high in existing mab development pipelines, requiring screening thousands of circulating polyclonal antibodies (pabs) from immunized animal(s), isolating B cells that produce the specific mab(s), and engineering the mab(s) for human use. This method involves producing a hybridoma by fusing an antibody-producing B-cell with an immortal cell lacking antibody genes. The resulting hybrid cell produces a single antibody that binds a specific target antigen. If the genomic material is available, as in the case of a hybridoma, next generation sequencing (NGS) can be used to determine the nucleotide sequence of the antibody. But there is a risk of B-cell attrition during fusion with immortal myeloma cells[109] and there is strong evidence that hybridomas are genetically unstable [50]. It is almost guaranteed, 90

104 91 then, that the antibodies produced by hybridomas do not fully represent the repertoire of antibodies present in a patients blood. In response to the 2014 Ebola virus outbreak, a recent letter[7] from prominent scientists, among them Nobel Laureates James Watson, David Baltimore, and Jim Simons, to the Department of Health and Human Services, members of Congress, and biotechnology companies encourages the use of passive immunity, which uses mixtures of antibodies derived from the antibodies present in the blood of survivors. A similar approach was used to successfully develop treatments for HIV [111]. These methods take a rather brute-force approach by magnetically sorting memory B-cells into 30,000 or more cultures, separately cloning each population, and screening the pabs produced by each culture for affinity to the antigen. This weeds out B-cells producing undesired antibodies and targets NGS on the desired population, allowing identification of native pabs years after initial infection. Such methods involving B-cell immortalization (or single-cell sorting and molecular cloning) have become increasingly useful, but they are often labor intensive/time consuming and the antibodies they identify do not necessarily represent the actual circulating pab repertoire [98]. Directly sequencing pabs from the blood of disease-resistant humans by tandem mass spectrometry (MS/MS) promises to reduce drug development costs dramatically and enable development of more effective therapies that harness our natural immune response. Such work has already begun, first with the identification of antigen responsive pabs from rabbits[14], and later from humans[87]. These approaches utilize standard protocols to isolate pabs from blood serum and purify out particular pabs that bind to a desired antigen, which may be displayed by cancerous tissue, a harmful virus (such as Ebola), or some other pathogen. Antibody amino acid sequences are then identified by MS/MS, but due to limitations of traditional de novo sequencing approaches (briefly discussed in sections 2.1 and 3.1), the nucleotide sequences of these pabs must first be uncovered by

105 92 next-generation sequencing of the particular B-cells producing the antibodies of interest. Even when using state-of-the-art NGS protocols, the full repertoire of B-cell transcript material (encoding millions of antibodies with high sequence similarity) cannot yet be sequenced with reasonable resources. The main limitations being that 1) B-cells cannot be purified by antigen affinity, and 2) the majority of B-cells are typically found in the spleen, lymph nodes, and bone marrow, with roughly 2% found in peripheral blood. As a result, these groups must either harvest spleen tissue (as done with rabbits[14]), or obtain B-cells from blood serum at high enough abundance during peak pab production between 7-10 days after initial disease exposure [87]. Thus, these methods are mainly limited in humans to studying immune response to known vaccines. In the preliminary work described here, circulating IgG pabs were taken from human blood serum and affinity purified against a known antigen on the cytomegalovirus (CMV). The resulting mixture of antigen-responsive pabs was identified by MS/MS de novo sequencing (MS/MS protocols and data analysis described in section 3.2). The significant aspect of this approach is that it does not rely on DNA/RNA sequencing of B- cells, a process which, to the best of our knowledge, has not yet been proven to be viable in humans more than 10 days after initial infection without lab-intensive screening/cloning of B-cells via hybridoma or cell-sorting. We discuss remaining limitations with applying Meta-SPS related techniques towards de novo sequencing of mixtures of highly related proteins and define some computational problems involved. 4.1 Preliminary Results A sample of circulating pabs was taken from a cytomegalovirus (CMV)-infected individual and affinity purified against the gb antigen. The intact mass spectra obtained from MALDI TOF analysis indicated dominant heavy and light chains at 8-10x abundance of other IgGs (Figure 4.1). Multi-enzyme digests coupled with triplet

106 93 CID/HCD/ETD fragmentation enabled unprecedented de novo sequencing depth of polyclonal IgGs using our Meta-SPS software. Individual de novo sequences extended up to 50 AA and in dozens of cases spanned CDRs and adjoining framework regions, which allowed entire V-D-J framework regions to be assembled by ordering de novo sequences following BLASTp alignment to known IgGs (Figure 4.2). To sequence highly variable regions (e.g. CDR3s) that Meta-SPS could not sequence due to polyclonal diversity, an interactive tool, StarAlign, was developed. This software extracts tags from MS/MS spectra, aligns them to an input framework amino acid sequence (such as YYCAR adjoining an un-sequenced CDR3), and de novo sequences the remaining portion of each spectrum that extends into the un-sequenced region. CDRs and polymorphisms from variant IgGs could then be sequenced from unidentified MS/MS spectra in a semi-automated manner, resulting in over 6 CDR1s, 90 CDR2s, and 11 CDR3s from the light chain, and 2 CDR1s, 10 CDR2s, and 3 CDR3s from the heavy chain. Each CDR was supported by at least two spectra from overlapping peptides, and CDRs from more abundant IgGs tended to exhibit greater coverage. Three light chain clonotypes were ultimately sequenced, two of which matched the intact masses of the most abundant light chain IgGs. The complete heavy chain sequence also matched the intact mass of the most abundant heavy chain IgG. These sequences are currently being engineered to verify reactivity against the antigen. Given that Meta-SPS could not automatically sequence all regions of each IgG, future efforts should be focused on fully automating the process. In the next section we define the unique computational problem involved and explain why it was impossible for Meta-SPS to fully succeed in sequencing the complete IgGs from this sample. Since StarAlign proved instrumental in the semi-automated sequencing process, we end by describing how it can be extended into a fully-automated tool to address the described problem.

107 Figure 4.1. Intact mass measurement and relative abundances of antigen-responsive pabs. Following affinity purification of circulating IgG pabs, the intact masses of the most most reactive antibodies are visualized with MALDI-TOF analysis. This depicts roughly 3 light chains (A) and 2 heavy chains (B) at high affinity to the antigen. 94

108 Figure 4.2. Preliminary results from sequencing an unknown pab sample. A) Sequencing pabs from unidentified MS/MS spectra. At the top is a consensus de novo sequence automatically generated by Meta-SPS that covers the CDR2 and framework regions of the heavy chain of a novel antibody. The consensus was built from 467 unidentified MS/MS spectra (11 of which are reproduced here). The peptide RIFTDGSVNYNPSLKSR (in red) likely covers the CDR2 of a less abundant antibody variant. Although the most abundant sequence (at the top) is automatically reported by Meta-SPS, variant sequences had to be manually extracted from assembled spectra since Meta-SPS is currently configured to report only one de novo sequence per group of assembled spectra. B) Fully sequenced pabs. Variable regions of the heavy and light chains were fully sequenced via homologyassisted mapping of consensus de novo sequences. Below each sequence are variants originating from pabs in the same sample, where each variant is observed in at least two manually sequenced spectra. 95

Supplemental Materials

Supplemental Materials MSGFDB Parameters For all MSGFDB searches, the following parameters were used: 30 ppm precursor mass tolerance, enable target-decoy search, enzyme specific scoring (for Arg-C, Asp-N,