Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space

Size: px

Start display at page:

Download "Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space"

Julie Hampton
6 years ago
Views:

1 Page 1 of 51 Research Article Proteins: Structure, Function and Bioinformatics DOI 1.12/prot.2228 Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space Menachem Fromer 1,ChenYanover 2 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel 2 Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America Running Title: Exploring Near-Optimal Sequence Space Keywords: Protein Design, Protein Energetics, Structural Sequence Space, Probabilistic Graphical Models, Belief Propagation, Combinatorial Optimization, Approximate Inference, Maximum-a-posteriori Estimation Institutions at which the Work was Performed: The Hebrew University of Jerusalem, Fred Hutchinson Cancer Research Center Contact Information for the Corresponding Author: Name: Menachem Fromer Address: School of Computer Science and Engineering The Hebrew University of Jerusalem Jerusalem 9194, Israel Phone: Fax: fromer@cs.huji.ac.il To whom correspondence should be addressed. fromer@cs.huji.ac.il 1 28 Wiley-Liss, Inc. Received 22-May-28; Revised 28-Aug-28; Accepted 12-Sep-28

2 Page 2 of 51 Abstract The task of engineering a protein to assume a target three-dimensional structure is known as protein design. Computational search algorithms are devised to predict a minimal energy amino acid sequence for a particular structure. In practice, however, an ensemble of low energy sequences is often sought. Primarily, this is performed since an individual predicted low energy sequence may not necessarily fold to the target structure due to both inaccuracies in modeling protein energetics and the non-optimal nature of search algorithms employed. Additionally, some low energy sequences may be overly stable and thus lack the dynamic flexibility required for biological functionality. Furthermore, the investigation of low energy sequence ensembles will provide crucial insights into the pseudo-physical energy force fields that have been derived to describe structural energetics for protein design. Significantly, numerous studies have predicted low energy sequences, which were subsequently synthesized and demonstrated to fold to desired structures. However, the characterization of the sequence space defined by such energy functions as compatible with a target structure has not been performed in full detail. This issue is critical for protein design scientists to successfully continue using these force fields at an everincreasing pace and scale. In this paper, we present a conceptually novel algorithm that rapidly predicts the set of lowest energy sequences for a given structure. Based on the theory of probabilistic graphical models, it performs efficient inspection and partitioning of the near-optimal sequence space, without making any assumptions of positional independence. We benchmark its performance on a diverse set of relevant protein design examples and show that it consistently yields sequences of lower energy than those derived from state-of-the-art techniques. Thus, we find that previously presented search techniques do not fully depict the low energy space as precisely. Examination of the predicted ensembles indicates that, for each structure, the amino acid identity at a majority of positions must be chosen extremely selectively so as to not incur significant energetic penalties. We investigate this high degree of similarity and demonstrate how more diverse near-optimal sequences can be predicted in order to systematically overcome this bottleneck for computational design. Finally, we exploit this in-depth analysis of a collection of the lowest energy sequences to suggest an explanation for previously observed experimental design results. The novel methodologies introduced here accurately portray 2

3 Page 3 of 51 the sequence space compatible with a protein structure and further supply a scheme to yield heterogeneous low energy sequences, thus providing a powerful instrument for future work on protein design. Introduction The objective of constructing a protein to perform a specific biological function is termed functional protein design [1, 2]. Potential applications of design include modifications of existing proteins to affect such characteristics as stability or binding affinity [3]. A more ambitious goal is to design protein sequences that will assume novel structures [4] or acquire new functionalities. Such functionalities may be therapeutic (e.g. [5], where an HIV inhibitor was designed) or industrial (e.g. [6], which discusses the design of new biocatalysts). Additionally, the outcomes of such design experiments will critically assess our comprehension of protein architecture and stability (e.g. [7]). A commonly used paradigm for computational protein design casts the functional design problem as a structural one, and, typically, assumes a fixed protein backbone [8]. In addition, the amino acid side chain conformations are not permitted to move continuously in space; rather, the allowed conformations are discretely clustered around a library of distinct, energetically favorable empirical observations ( rotamers ) [9]. Lastly, pairwise atomic energy functions are used to assign pseudo-physical energetic values to pairs of atoms [1]. Protein Design Formulation The input to the protein design problem consists of a three-dimensional protein backbone structure, the N sequence positions to be designed, a group of amino acids (and their respective rotamers) permitted at each position, and an atomic energy function. Formally, we denote by Rots i the set of all possible rotamers at position i (for all amino acid types); let r =(r 1,...,r N ) denote an assignment of rotamers for all N positions. For a given pairwise atomic energy function, the energy of assignment r, E(r), is the sum of the interaction energies between: 1. rotamer r i Rots i and the fixed structural template (backbone and stationary residues) [E i (r i )] 2. rotamers r i Rots i and r j Rots j for neighboring (interacting) positions i, j [E ij (r i,r j )] 3

4 Page 4 of 51 E(r) = i E i (r i )+ i,j E ij (r i,r j ) (1) Let us denote by T (k) the amino acid type of rotamer k and let T (r 1,...,r N )=(T (r 1 ),...,T (r N )). Let S =(S 1,...,S N ) denote an assignment of amino acids for all positions. Also, we define Rots i a as the set of rotamers in Rots i of amino acid type a. Computational protein design attempts to find the sequence S of minimal energy. Mathematically, S =arg min S E(S), where: E(S) = min E(r) (2) r:t (r)=s is the minimal rotamer assignment energy for sequence S. The double minimization problem (over the sequence space and over the per-sequence rotamer space) is combined as: S = T (arg min E(r)) (3) r Goal: Find Multiple Sequences Within the framework defined thus far, the goal of protein design is to predict an amino acid sequence of minimal energy for the target structure. However, there are various reasons why it is desirable to predict a set of low energy sequences ( top sequences) in addition to the single lowest sequence [11 13]. Inaccuracies in modeling and searching: Inaccuracies in quantifying protein energetics [1], the assumption of a fixed backbone and discretized side chains, and the absence of accounting for competing target structures [7, 14] imply that the sequence-energy landscape may not be modeled with sufficient rigor. Thus, the sequence with the lowest predicted energy may not in fact fold to the target structure. Also, even if the energies of all possible sequences are modeled exactly, non-optimality of the search algorithm used may prevent the lowest energy sequence from being found. Preliminary work noted in [4] provides an example of both of these possibilities, where using slightly modified energy functions or search protocols yielded sequences that were energetically stable and yet did not assume their target structures (instead possessing molten globular character). Selection among low energy sequences: Even if a predicted low energy sequence will experimentally obtain low physical energies, it may not satisfy other constraints relevant for ultimately defining a sequence 4

5 Page 5 of 51 as desirable. Particular consideration of the foremost such constraint that of biological functionality arises from the fact that some low energy sequences may be overly stable at standard temperatures and thus lack the kinetic flexibility required to function [6]. Therefore, the biological feasibility of the predicted sequence-structure models can be used to select, from a generated ensemble, the most promising candidate sequences with which to proceed to more expensive and time-consuming experimental validation techniques, such as structural determination or functional assays. This feasibility can be based on techniques (somewhat) orthogonal to ranking using atomic energy functions; such techniques include manual visual inspection or automatic validation, e.g. MolProbity [15]. Another approach for the re-ranking of top-scoring sequence results may include the use of more physically realistic (but computationally heavier) conformational search algorithms that account for molecular flexibility: either of the amino acid side chains (e.g. [16, 17]) or of the protein backbone (e.g. [18, 19]); see below for a more rigorous review of these flexible-molecule search methods. In a similar vein, molecular dynamics simulations can be used to refine the protein structures for each of the predicted low energy sequences (e.g. [2]). There also exist other pertinent requirements that cannot be imposed on the sequence search problem from the outset. For instance, the imperative for few mutations from the wild-type sequence cannot necessarily be well-defined without having first observed the trade-off between mutation number and energy within the computational sequence space, since we do not want to arbitrarily limit the minimal energy obtainable. Thus, consideration of a collection of low energy sequences is often invaluable. Nonetheless, we do note that certain sequence conditions can be applied directly during the search stage. For example, the work in [21] demonstrates, with relative success, the capability of enforcing fixed amino acid composition during the search (useful for keeping reference state energies relatively constant). Low energy sequence profiles: In effect, a set of low energy sequences for the target structure characterizes sequences well-suited to fold to the structure. This information can be summarized in a sequence profile (position-specific scoring matrix, PSSM) that tabulates the positional amino acid probabilities for sequences predicted to fold to the structure. The profile can then be utilized, for example, to build experimental protein design libraries [1, 22], used to biologically screen large numbers of relevant sequences. Alternatively, such profiles can be systematically compared to evolutionary sequence data [11 13]. Although generating such profiles is not necessarily the ultimate goal of this work (in contrast to our previous work [23]), it is a 5

6 Page 6 of 51 beneficial byproduct of the investigation of the near-optimal sequence space. The issues outlined above clearly justify the need for an efficient algorithm to directly determine an ensemble of low energy sequences. Additionally, we are interested in exploring the near-optimal sequence space induced by a widely-used energy function (in this case, the Rosetta function [24]), in an attempt to comprehend the predictive consequences of using such energy functions for protein design. In this paper, we describe a novel algorithm to quickly predict the set of low energy sequences for a target structure. Using probabilistic graphical modeling, we efficiently determine minimal energy sequences when the underlying search space includes numerous rotamers for each amino acid type, without considering any sequence more than once (see Figure 1). Using a diverse dataset of protein design problems, we benchmark its efficacy in yielding lower energy sequences, as compared to previous methods. We observe that for cases when the complete set of lowest energy sequences can be exhaustively enumerated, the algorithm empirically obtains this set. Moreover, we find the set of near-optimal sequences to have an extremely high degree of sequence and biochemical similarity and provide a practical technique to increase sequence variation. Previous Work The methods previously applied to predict a set of M low energy sequences (or for similar tasks) can be categorized as provably exact approaches, statistical methods, and sampling techniques; we now provide a detailed review of these methods. For the reader interested in proceeding directly to our novel approach (subsequent sections), we suffice to say that the main computational challenge is outlined in Figure 1. Exact Methods Before detailing the exact methods, we highlight one of their deficiencies. Even when adapted to obtain successive low energy rotamer assignments, they have not been constructed to provide successive low energy sequences, skipping over low energy rotamer assignments of sequences previously observed. Thus, the naive application of these algorithms is required, in some sense, to iterate over successive rotamer assignments but only output newly observed sequence assignments. However, this is quite computationally costly since each sequence typically has an exponential number of corresponding rotamer assignments, many of which with 6

7 Page 7 of 51 A Position #2 Position #1 aa G 1 G 2 aa rot. g 11 g 12 g 21 g 22 h H h h H h B r E(r) T (r) (g 11,h 11 ) 15 (G 1,H 1 ) (g 11,h 12 ) 14 (G 1,H 1 ) (g 12,h 22 ) 13 (G 1,H 2 ) (g 11,h 22 ) 12 (G 1,H 2 ) (g 12,h 11 ) 11 (G 1,H 1 ) (g 12,h 12 ) 1 (G 1,H 1 ) (g 12,h 21 ) 9 (G 1,H 2 ) (g 11,h 21 ) 8 (G 1,H 2 ) (g 21,h 12 ) 7 (G 2,H 1 ) (g 21,h 11 ) 6 (G 2,H 1 ) (g 22,h 21 ) 5 (G 2,H 2 ) (g 21,h 22 ) 4 (G 2,H 2 ) (g 22,h 11 ) 3 (G 2,H 1 ) (g 22,h 12 ) 2 (G 2,H 1 ) (g 22,h 22 ) 1 (G 2,H 2 ) (g 21,h 21 ) (G 2,H 2 ) Figure 1: Toy Protein Design Problem Toy example to demonstrate the need for an algorithm that yields distinct low energy sequences within the rotamer space. (A) Pairwise rotamer energies. The minimal rotameric energy for each sequence appears in boldface. (B) All rotamer assignments in order of increasing energy. Naively, the 11 lowest energy rotamer assignments have to be examined to yield all 4 sequences. Lower case letters (e.g. g 11 ) stand for rotamer configurations, upper case letters (e.g. G 1 ) for amino acid types (that is, T (g 11 )=G 1 ). For simplicity, only pairwise energies are considered. 7

8 Page 8 of 51 potentially lower energy than the next lowest energy sequence, thus yielding repeated sequence assignments (Figure 1). DEE, A*: Dead-end elimination (DEE) is an iterative positional rotamer elimination technique that is guaranteed to remove only rotamers that do not participate in the lowest energy rotamer assignment (and hence lowest energy sequence assignment) [25, 26]. This method was applied in a number of ground-breaking studies (e.g. [3, 22, 27]). However, large protein design problems require increasingly stringent DEE criteria (with large increases in computation time) to yield a small enough search space to be considered exhaustively and uniquely determine the minimal energy rotamer assignment [28 3]. Furthermore, DEE must be adapted in order to obtain successive low energy rotamer assignments. One such adaptation is the generalized DEE/A* method [31]. First, a relaxed DEE criterion is applied such that, for a given energetic threshold ε, only rotamers that do not participate in a rotamer assignment with energy within ε of the minimal rotamer assignment will be eliminated. Note that, for relevant ε >, the resulting rotamer space cannot contain only a single rotamer assignment. In fact, this space is often quite large, so the A* artificial intelligence technique is used to systematically enumerate the rotamer assignments in guaranteed order of increasing energy. A critical drawback of this method is that it often requires extremely long computation times and extensive amounts of computer memory, so that in larger problems it may not be at all feasible [31]. Another problem encountered by the DEE/A* approach is that the initial choice of ε, necessary to maintain the required number of sequences and yet prune as much of the rotamer space as possible, is not trivial. If the threshold chosen is too low, then some rotamer assignments, required to obtain the M lowest energy sequences, will have been eliminated; on the other hand, larger thresholds will be unable to trim the rotamer space as successfully. A more recently devised method for providing successive low energy assignments is that of X-DEE [32], which successively applies DEE to disjoint sub-spaces until a specified number of lowest energy rotamer assignments are found. However, X-DEE requires very many runs of DEE on large sub-spaces, and for cases where the search space is very large as for protein design (unlike the case of two-state variables tested in [32]), this could well be an insurmountable computational hurdle for this method. Flexible-molecule DEE: Recently, there exists a growing trend of algorithms that allow for flexibility of the amino acid side chains [16] and the protein backbone [4, 13, 19, 33 35] during the structural modeling 8

9 Page 9 of 51 involved in the protein design process. The relaxation of the requirements for a fixed backbone and fixed rotamers for protein design has been shown to produce more natural side chain and amino acid variability [13, 16, 19, 35]. Among these algorithms, however, the procedures developed by Donald and colleagues are exceptional, in that they provably provide low energy sequences for protein design, while incorporating flexible side chains [17, 36], global backbone flexibility [18], and local backbone flexibility (e.g. backrub ) [37]. Briefly, these exact DEE-based procedures calculate lower and upper bounds on the energetic terms in Eq. 1 and then use these bounds to run a generalized DEE algorithm and prune the conformational space. The remaining space is subsequently searched using a slightly modified A* algorithm to obtain the lowest energy sequence(s). We note that for any given protein design problem, providing for conformational flexibility (side chain or backbone) in an exact method has the advantage that it will typically find a larger number of lower energy sequences, e.g. below a certain acceptable energetic threshold. Specifically, since non-flexibility is always an option, the structurally flexible minimal energy sequence will have a lower energy than that found without flexibility. On the other hand, the use of such bounded intervals for the energy terms (instead of specific energy values) results in less efficient pruning of the conformational space and thus longer running times for the A* search [18]. Therefore, convergence of these algorithms within a reasonable time frame is not at all guaranteed for larger protein design cases (though it has been calculated that using an exhaustive search to produce comparable results would require two orders of magnitude longer run-times [37]). Additionally, the problems encountered when running standard DEE, e.g. the choice of ε for ensuring that the M lowest energy sequences can be found, are also relevant here. LP: An additional method guaranteed to obtain successive low energy rotamer assignments is the LP/ILP approach of [38], where the search for the minimal energy rotamer assignment is structured as an integer linear programming (ILP) problem. Under certain conditions, the linear programming (LP) relaxation can be efficiently run to yield the solution; otherwise, a computationally more intensive ILP solver is run. However, in practice, even when using a simple energy function (van der Waals interactions and a statistical rotamer self-energy), the ILP solver was required for most design problems tested, resulting in large run-times to predict even the single lowest energy rotamer assignment. Generating M sequences would entail running the 9

10 Page 1 of 51 method (at least) M times, making a typical protein design ensemble search essentially infeasible. Finally, we do note that the X-DEE and LP/ILP methods could theoretically be generalized to obtain successive low energy sequences, by partitioning the sub-spaces by amino acid type instead of rotamer at each position (as in the tbmmf algorithm presented below). But, the methods would still suffer from the other drawbacks outlined above. In any case, such generalizations would have to be formulated carefully so as to prevent an exponential number of X-DEE search bases (ILP inequality constraints), deriving from the fact that different rotamers of the same amino acid may be chosen for different sequences. Statistical Methods Mean Field and its generalizations: Self-consistent mean field theory [39] provides a framework for the prediction of positional amino acid probabilities for the design of a target protein structure, by making certain simplifying statistical independence assumptions. The SCADS (Statistical Computationally Assisted Design Strategy) method [4] generalizes mean field theory and applies a statistical entropy approach to atomic level protein design in order to predict these probabilities, in a way that is robust to minor backbone changes. The formulation of these methods is intended to produce site-specific amino acid probabilities, which can be used to build PSSM as described above. Nonetheless, they have been adapted to yield low energy sequences, often by independently choosing the most probable amino acids at each position. However, such strategies have been met with limited success in finding even the single lowest energy sequence [41]. Furthermore, the sub-optimality of standard mean field theory was also found to hold for the comparison of the probabilities it predicted with that of simulated, evolutionary, and experimental probability data [23]. Finally, since the goal herein is to predict an ensemble of low energy sequences, without having to assume any specific independence between positions (beyond that already defined by the pairwise energy function), we do not investigate these methods. Sampling Methods Monte Carlo Simulated Annealing: Simulated annealing (SA) [22, 24] attempts to solve the global minimization problem by starting from an initial sequence assignment (possibly random). At each step, the 1

11 Page 11 of 51 sequence is slightly modified using a random mutation rule. This new sequence is retained if its energy is lower than that of the previous sequence; otherwise, it is randomly accepted with a probability that is a function of its energy and the current temperature of the SA system. Typically, the temperature starts at a high value (permitting all transitions) and the system is slowly cooled until convergence. A major disadvantage of SA is that convergence to the minimum is not guaranteed within a finite number of steps or within the framework of any given cooling schedule. In [22], SA was applied to the design of the active site of the β-lactamase structure and the M =1, lowest energy sequences observed during the SA run were output. Probabilistic Graphical Modeling of Protein Design The principal computational tool we utilize in this paper is the representation of the protein design energy optimization problem as a probabilistic graphical model (graphical model, for short) and subsequent application of the loopy belief propagation algorithm. We briefly summarize the theory of graphical models and belief propagation in the context of protein design using pairwise energy functions (for in-depth reviews, see [42, 43]). The formulation used here for predicting the lowest energy rotamer assignment for protein design is a generalization of that used in [44] to determine the minimal energy rotameric state for protein side chain prediction. It is also conceptually similar to that used in [45] to find the single lowest energy sequence for protein design and the approach in [46] applied to calculate the free energies of particular sequences. We have also previously used a similar formulation to predict positional amino acid probabilities over the entire sequence space for protein design problems [23]. Graphical modeling: A graphical model is a compact representation of a probability distribution that is well-suited to describe conditional independencies between variables. For protein design, we define a random variable for each design position, whose values represent the rotameric choices (including amino acid) at that position. We then build a graph, wherein each node corresponds to a variable and the node s values correspond to the variable s values (Figure 2A,B). An assignment for all variables is equivalent to rotamer choices for all design positions; we thus use the terms design position, variable, and node interchangeably. 11

-4 9-1 -4-5 -2 9-8 -2 1-4 -4-7 -9-4 -6-5 -2-5 -2 9-6 -2-2 3 4-9 8-5 -2 9 6-4 -4-4 -4 12-4 2-4 3 7 1 6-2 5-2 -2-2 5-2 4 Page 12 of 51 A A C B A12 A15 A16 C11 R R C16 T A11 T T I L V C15 GR G R C11 R

..... A11 C12 C C11 I C11 II A12 C11 III C11 IV A12 A16 I IV II A16 C16 C15 C12 A15 A16 A11 A11 C16 C15 C12 A15 A16 A11 C12 A11 III C12 m A16 C11 (r C11 ) m C11 A11 (r A11 ) m A11 C12 (r C12 ) m C12

12 Page 12 of 51 A A C B A12 A15 A16 C11 R R C16 T A11 T T I L V C15 GR G R C11 R RT T T I L V A11 C12 C C11 I C11 II A12 C11 III C11 IV A12 A16 I IV II A16 C16 C15 C12 A15 A16 A11 A11 C16 C15 C12 A15 A16 A11 C12 A11 III C12 m A16 C11 (r C11 ) m C11 A11 (r A11 ) m A11 C12 (r C12 ) m C12 A16 (r A16 ) Figure 2: Probabilistic Graphical Modeling of Protein Design (A) SspB dimer interface, with C α coordinates of design positions (12, 15, 16, 11 on monomers A, C) marked as spheres. Interactions between positions, as determined by the Rosetta energy function, are marked as edges; for simplicity of exposition, we ignore all intra-monomeric edges. The resulting graphical model is shown (B), where each edge contains the pre-calculated rotamer-rotamer energy matrix, demonstrated for A11 and C11. (C) The messages passed in one direction by loopy belief propagation on the cycle: A16, C11, A11, C12. Each position calculates the outgoing message (solid arrow) by combining (Eq. 9) the interaction energies and the current incoming messages from all other nodes (dashed arrows). Positions that do not interact (e.g. A16 and A11) are nonetheless mutually dependent through common interactions (C12 and C11). The pre-calculated rotamer energies taken as input to the problem (see Introduction) are utilized to define probabilistic potential functions,orprobabilistic factors, in the following manner. The singleton energies specify probabilistic factors describing the self-interactions of the positions in their permitted rotamer states: ψ i (r i ) = e E i (r i ) T (4) And, the pairwise energies define probabilistic factors describing the direct pairwise interactions for pairs of positions and their possible rotamers: ψ ij (r i,r j ) = e E ij (r i,r j ) T (5) where T is the system temperature (taken to be the equivalent of 37 C). For a pair of variables i, j, the matrix of pairwise probabilistic factors (ψ ij ) corresponds to an edge between them in the graph (Figure 2B). Since energy functions typically used for design essentially ignore 12

13 Page 13 of 51 interactions occurring between atoms more distant than a certain threshold, this implies that the design graph will often have a large number of missing edges (positions too distant to directly interact). Thus, the locality of spatial interactions in the protein structure induces path separation in the graph and conditional independence in the probability distribution of the variables. Mathematically, the probability distribution for the rotamer assignment (r 1,...,r N ) decomposes into a product of the singleton and pair probabilistic factors: Pr(r 1,...,r N ) = 1 ψ i (r i ) ψ ij (r i,r j ) (6) Z i i,j = 1 Z e E(r) T (7) where Z is the probability normalization factor (partition function), and Eq. 7 derives from substitution of Eqs. 4, 5 into Eq. 6 and the pairwise energy decomposition of Eq. 1. Thus, minimization of rotamer energy (Eq. 3) is equivalent to maximization of rotamer probability (a probabilistic inference task): S = T (arg max Pr(r)) (8) r All the same, the size of the rotamer space is exponential in protein length, making protein design computationally difficult even for small proteins (see [47, 48] for a rigorous handling of the subject). Thus, exhaustive calculation of an exact maximum for Eq. 8 is no less computationally infeasible than minimization of Eq. 3. Nevertheless, having formulated the protein design problem as an inference problem on a graphical model avails us of a wide array of effective approximate inference techniques. Belief propagation: Max-product belief propagation (BP) [42] is a message passing algorithm that efficiently utilizes the inherent locality in the graphical model representation. Messages are passed between neighboring (interacting) variables, where the message vector describes one variable s belief about its neighbor that is, the relative likelihood of each allowed state for the neighbor. A message vector to be passed from one position to its neighbor is calculated using their pairwise interaction probabilistic factor and the current input of other messages regarding the likelihood of the rotamer states for the position (Figure 2C). Formally, at a given iteration, the message passed from variable i to variable j regarding j s rotameric 13

14 Page 14 of 51 state (r j )is: m i j (r j )=max r i e E i (r i ) E ij (r i,r j ) T k N(i)\j m k i (r i ) (9) where N(i) is the set of nodes neighboring variable i. Note that m i j is, in essence, a message vector of relative probabilities for all possible rotamers r j, as determined at a specific iteration of the algorithm. In detail, messages are typically initialized uniformly. Next, messages are calculated using Eq. 9. Now, for each position for which the input message vectors have changed, its output messages are recalculated (Eq. 9) and passed on to its neighbors. This procedure continues in an iterative manner until numeric convergence of all messages, or a predetermined number of messages has been passed. Finally, max-marginal (MM) belief vectors are calculated as the product of all incoming message vectors and the singleton probabilistic factor: MM i (r i )=e E i (r i ) T k N(i) m k i (r i ) (1) where MM i (r i ) is the max-marginal belief of a particular rotamer r i Rots i at position i. In this paper, we apply max-product loopy belief propagation (BP) to find the maximal probability rotamer (and sequence) assignments (Eq. 8). Specifically, we use the max-marginal (MM) beliefs obtained by BP (Eq. 1) as approximates of the exact max-marginal probability values: Pr i (r i )= max Pr(r ) (11) r : r i =r i for which it can be shown (Lemma A.1) that assignment of: r i =arg max r i Rots i Pr i (r i ) (12) yields the most probable rotamer assignment r. The belief propagation algorithm was originally formulated for the case where the graphical model is a tree graph (i.e. no loops exist). However, since typical protein design problems will have numerous cycles (Figure 2B), we thus obtain (possibly) inexact MM and the sequence results are not guaranteed to be optimal. Nonetheless, loopy BP has been shown to be empirically successful in converging to optimal 14

15 Page 15 of 51 solutions when run on non-tree graphs (e.g. [44]). Furthermore, loopy BP has conceptual advantages over related (statistical) inference techniques, since it does not assume independence between design positions and yet largely prevents self-reinforcing feedback cycles that may lead to illogical or trivial fixed points. On the other hand, for example, self-consistent mean field is forced to make certain positional independence assumptions [39, 41]. tbmmf: Prediction of the M Minimal Energy Sequences The novel algorithm described herein exploits the formulation for protein design described in the previous section and generalizes the BMMF (Best Max-Marginal First) algorithm of [49]. Conceptually, it partitions the search space while systematically excluding all previously determined minimal energy sequence assignments (Figure 4). In cases where loopy belief propagation (BP) yields exact max-marginal (MM) probabilities, the algorithm is guaranteed to find the top M sequences for the protein design problem (Theorem A.5). We designate this algorithm as tbmmf (type-specific Best Max-Marginal First). We define amino acid type constraints such that a position i can either be unconstrained (r i is allowed to assume all rotamers in Rots i ) or constrained to rotamers of specific aa (amino acids). Thus, for a given aa type a, a constraint can be positive (r i must be a rotamer of aa a: r i Rots i a ) or negative (r i must be a rotamer of an aa other than a: r i / Rots i a ). For a set of constraints C, we denote MM p (r p ) C as the max-marginal belief of rotamer r p at position p obtained when enforcing the constraints in C. In practice, to constrain a position to a specific subset of rotamers, we zero out its singleton probabilistic factor for all other rotamers. Pseudocode for the novel tbmmf algorithm is presented in Figure 3 and demonstrated in Figure 4. Intuitively, at iteration m, the next lowest energy sequence must differ from all previous low energy sequences in at least one position. Consequently, we examine the constrained sub-spaces from which these sequences were derived. For each such space, we calculate the highest relative positional MM probability ( best MM, BMM), while excluding amino acids from the corresponding low energy sequence; thisisperformed to determine the next lowest energy sequence within each space, while excluding previously determined sequences. We then consider the constrained sub-space (t m ) for which the maximal BMM (BMM tm )was 15

16 Page 16 of for m 1 to M do if m =1then Cons m else /* t m, p tm, q tm are the sub-space, position, rotamer to yield the next lowest energy sequence */ t m arg max BMM m m <m a T(q tm ) // aa type of q tm // Add pos. constraint to Cons m : Cons m Cons tm {r p t m Rots p t m a } // Add neg. constraint to Cons tm : Cons tm Cons tm {r p t m / Rots p t m a } Run BP to obtain: MM p (q) Cons t m CalcBMM(t m ) end // calculate BMM tm Run BP to obtain: MM p (q) Cons m for i 1 to N do r m i arg max MM i (r i ) Cons m r i Rots i Si m T(ri m) // ith aa of m th seq. end CalcBMM(m) end return {S m } M m=1 // calculate BMM m /* Use MM p (q) Cons n to calculate the BMM for constrained sub-space n */ Function CalcBMM(n) (p n,q n ) arg max MM p (q) Cons n p,q: T (q) Sp n BMM n MM p n(q n ) Cons n end Figure 3: The tbmmf algorithm The type-specific Best Max-Marginal First (tbmmf) algorithm for calculating the M lowest energy protein sequences: {S m }. Cons m denotes the constraint set that defines the sub-space from which S m was derived as the minimal energy sequence. BMM = best max-marginal. 16

17 Page 17 of 51 computed, as well as the maximizing position (p tm )androtamer(q tm ) associated with this BMM. The definition of max-marginal probabilities (Eq. 11) implies that BMM tm corresponds to the energy of the next lowest energy sequence. Moreover, it guarantees that this sequence can be found by choosing rotamer q tm at position p tm, along with the constraints present in sub-space t m. Therefore, this space is partitioned into two mutually exclusive sub-spaces: 1. The maximizing position (p tm ) is constrained to be of the maximizing rotamer s (q tm ) amino acid type. We determine the next lowest energy sequence (S m ) and its next best MM (BMM m ) by running BP on this sub-space (m). 2. Position p tm is constrained to not be of the maximizing amino acid type. We run BP on this sub-space (t m ) to update its next best MM (BMM tm ), to be considered by subsequent iterations. Runs of BP are as described in Eqs. 9, 1 and as illustrated in Figure 2C. In Figure 4, the tbmmf algorithm is simulated for the protein design example from Figure 1: m = 1: The lowest energy rotamer assignment, (g 11,h 11 ) [circle number 1], and its corresponding aa sequence, S 1 =(G 1,H 1 ), are determined by running BP on the full rotamer space; this sequence has an energy of 15. The max-marginals calculated from this run of BP indicate that the next lowest energy sequence (marked by a star) can be obtained by constraining position 2 to rotamers of amino acid H 2 (marked in red) and will have an energy of 13 (since the BMM is proportional to e 13 ). m = 2: The above constraint is added to the rotamer space and the next lowest energy sequence, S 2 =(G 1,H 2 ), is calculated. Within this positively constrained sub-space, the next lowest energy sequence can be generated by constraining position 1 to amino acid G 2 and will have an energy of 5. The original space is now negatively constrained to exclude amino acid H 2 at position 2. The previously observed sequence S 1 is excluded, so that the next lowest energy sequence in this sub-space would be derived from constraining position 1 to amino acid G 2 and have an energy of 7. m = 3: From among the two choices of lowest sequence energies ( 5 and 7) available in the constrained sub-spaces, S 3 =(G 2,H 1 ) is found by choosing the sub-space (t 3 = 1) and corresponding constraint (amino acid G 2 at position 1) that yield the sequence of lower energy. 17

18 Page 18 of 51 Figure 4: tbmmf Run on the Example in Figure 1 At each iteration (horizontal axis), the selected sub-space (pairwise energy matrix) is partitioned into two complementary ones. For each sub-space in the hierarchy, rotamer assignments forbidden by its derived constraints are grayed out. In positively constrained spaces, numbered circles denote the sequence chosen at that iteration. In a given sub-space, the amino acid in red is that required to be positively constrained in order to yield the next lowest energy sequence; a star denotes that sequence. tbmmf parameters are depicted in mint (for simplicity, T = 1 and BMMs are unnormalized). 18

19 Page 19 of 51 A Small Medium Large 1 Large 2 Num. Positions (Chains a ) Search Space Cardinality (log 1 ) Rotamer Library Design Shell b Sequence Rotamer td-dee c Read d Added e prion 7 (A) 7 (B) Full χ 1, χ 2 SspB 8 (A,C) Full χ 1, χ 2 hgh-hghr 1 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 2 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 3 5 (A) 136 (A,B) Full χ 1, χ 2 hgh-hghr 4 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 5 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 6 6 (A) 135 (A,B) Full χ 1, χ 2 CaM-smMLCK 24 (A) 19 (B) Limited χ 1 CaM-skMLCK 24 (A) 19 (B) Limited χ 1 hgh-hghr 35 (A) 16 (A,B) Limited χ 1 Top7 92 (A) Limited a Peptide chains to which the corresponding positions belong, labeled arbitrarily. b Non-designed, conformationally varying positions. c Rotamer space cardinality after application of type-dependent Goldstein DEE. d Full: all rotamers read from library; Limited: highest probability rotamers read. e Side-chain angles around which additional rotamers were super-sampled from library rotamers. B prion SspB hgh-hghr CaM-smMLCK CaM-skMLCK Top7 1I4M 1OU9 3HHR 1CDL 2BBM 1QYS (a+b) all beta all alpha all alpha all alpha (a+b) Figure 5: Benchmark Dataset of Protein Design Test Cases of Varying Characteristics (A) Protein design data used for benchmarking. (B) The designed protein structures: designed residues are colored blue, conformationally varying positions yellow, and all others red. PDB identifiers and SCOP [53] structural classes are as marked. Results To assess the tbmmf algorithm in relation to the state-of-the-art techniques previously available to solve this problem, we investigated 12 protein design problems of various sizes and qualities considered in earlier computational and experimental protein design studies [3, 4, 23, 5 52]. Firstly, we determine that tbmmf outperforms all other methods analyzed (see below) in yielding low energy sequence ensembles. Furthermore, we find the space of near-optimal sequences to be highly homogeneous and demonstrate how to circumvent this self-similarity and provide a more diverse sequence ensemble. The test problems are delineated in Figure 5 and detailed in Methods. All energy calculations were performed using the Rosetta design energy function [24]. We also note that the rotamer set employed for each problem was the maximal set (while super-sampling rotamer configurations) under which Rosetta energy calculations remained feasible. 19

20 Page 2 of 51 For all problems, we first applied the pre-processing of the rotamer space provided by type-dependent DEE [5]. This reduced rotamer space was used as input to all algorithms, except Ros (which was run directly within the Rosetta package, see below). We then applied each of the following algorithms to the protein design problems to predict the 1 lowest energy sequences: tbmmf A*: Generalized DEE/A* [31] Ros: Rotamer space Monte Carlo simulated annealing (Rosetta, default parameters) [24] SA: Monte Carlo simulated annealing over the sequence space, with inner loops of per-sequence rotamer space simulated annealing [22] For A*, the principal steps of the HERO algorithm s protocol [3] were followed. Briefly, Goldstein DEE [26] was followed by 1-split and 2-split DEE, and then Magic Bullet 3-split and 4-split DEE [28]. For Ros and SA, the method was randomly run multiple times (for random initial sequences), such that the total run-time would be comparable to that of tbmmf. See Methods for the full details on all runs. tbmmf Obtains Lower Energy Sequences In Figure 6, success in recovering minimal energy sequences is demonstrated. Results are typical for their respective size categories; fuller results can be found in Table I. For all cases assessed, tbmmf decidedly outperforms the other algorithms in predicting a set of low energy sequences. The assessment was performed in the following manner. For each protein design problem, each algorithm predicted its top M = 1 sequences. In addition, for the M sequences predicted by Ros and SA, we subsequently ran belief propagation (BP) to determine the corresponding minimal rotameric energies (Eq. 2). We denote the calculation of these per-sequence minimal rotamer energies as Ros + and SA +, respectively. These BP calculations were performed to under-penalize the Ros and SA sampling algorithms in cases where they may have in fact found low energy sequences without actually having found the minimal energy rotamer assignment for the sequence (e.g. see Figure 7). Finally, the top 1 sequence results output by the algorithm runs were pooled, and each such sequence was ranked according to the minimum rotameric energy it obtained 2

21 Page 21 of 51 Small Medium Large 1 Large 2 prion hgh-hghr 1 CaM-smMLCK hgh-hghr % Top Sequences 5 tbmmf A* Ros SA 5 tbmmf Ros SA 5 tbmmf Ros SA 5 tbmmf Ros SA Figure 6: Assessment Results for Representative Protein Design Test Cases Results of the tbmmf, A* (where feasible), Rosetta rotamer space simulated annealing (Ros), and sequence space simulated annealing (SA) algorithms for representative protein design problems. Note that A* was only feasible for the prion case. For each algorithm, the bar denotes the percentage of the top 1 sequences (output by any algorithm) obtained. For Ros and SA, the results are combined from multiple runs; see text for details. by any of the algorithms, including Ros + and SA +. Subsequently, only the top 1 sequences from this pool are considered in the success rates, in which we measure what fraction of these 1 sequences were discovered by any given algorithm run (Figure 6). Only in the prion protein design problem was A* feasible, provably finding all 1 minimal energy sequences. tbmmf and SA also yielded this optimal set, while Ros fared reasonably, discovering 86% of the minimal energy sequences. Thus, although not mathematically guaranteed, tbmmf obtained the complete set of lowest energy sequences in the case where such a set could be calculated by A*. For all other problems, however, A* was not feasible, so we do not have exact results with which to compare. Nonetheless, tbmmf was by far the most adept at finding the largest number of low energy sequences for these more realistic protein design problems. Table I depicts the full assessment results for all 12 test cases (columns marked Top ) and the computational run-times of the algorithms (columns marked Time ); all algorithms were run on dual-cpu Linux machines. For the Small cases, Ros and SA performed comparably as well as tbmmf. However, for all larger cases, the deterministic tbmmf algorithm vastly outperforms both Ros and SA in obtaining a larger number of lower energy solutions. Moreover, for all except the Small cases, Ros and SA were allowed to run significantly longer than tbmmf. Being random sampling algorithms, they could theoretically be run even longer to possibly achieve better results. But since tbmmf already provides superior results in less time (hours vs. days), it is clearly preferable. 21

22 Page 22 of 51 Small Medium Large 1 Large 2 tbmmf Ros SA A* (A* Rotamer Space) Top Time Top Time Top Time Top a Time td-dee b DEE c prion 1% 58.9 m 86% 9.3 h 1% 12 h 1% 3.4 m SspB 1% 11 h 1% 11.4 h 97% 9.6 h 1% d 3 d hgh-hghr 1 88% 13.4 h 3% 2.1 d 2% 7.3 d failed 12 d hgh-hghr 2 6% 7.6 h 5% 2 d % 5.9 d failed 12 d hgh-hghr 3 1% 4.1 h 73% 1.7 d % 5.9 d failed 12 d hgh-hghr 4 1% 8.5 h 22% 2.1 d % 7.4 d failed 12 d hgh-hghr 5 1% 2.9 h 27% 2 d % 5.8 d failed 12 d hgh-hghr 6 1% 8.5 h 42% 2.2 d % 6.1 d failed 12 d CaM-smMLCK 73% 1.6 h 18% 18 h 23% 1 d failed 12 d CaM-skMLCK 1% 2 h % 1.7 h % 2.7 h failed 7.2 d hgh-hghr 1% 17.6 h % 2 d % 2.3 d failed 12 d Top7 69% 7.1 h 31% 1.5 d % 1.7 d failed 12 d a failed: DEE calculations were terminated after a time limit of 12 CPU days and/or the A* algorithm was terminated due to a lack of computer memory (4 GB limit). b Rotamer space cardinality (log 1 ) after pre-processing by type-dependent Goldstein DEE. c Rotamer space cardinality after application of generalized DEE (as part of DEE/A*). d Using a DEE threshold of ; non-zero thresholds failed. Table I: Assessment and Analysis of the Algorithms Tested Fraction of top sequences obtained and CPU run-times for all protein design test cases. For each design scenario, the highest fraction obtained is marked in boldface. For Ros and SA, the run-times are summed over all randomized runs (see Methods). m = minutes; h = hours; d = days. The rightmost columns indicate the reduction in the rotamer space achieved by the application of DEE in the A* algorithm. Note that for all but the prion case, a generalized DEE threshold of was applied; see text for the details of the successive DEE criteria applied. In the single case where we found A* to be successful in finding the top 1 sequences (the prion problem), a DEE threshold of.33 energy units (energy units approximate kcal/mol) was used. On the other hand, in all other cases even a threshold of (i.e. standard DEE) did not suffice to make DEE/A* empirically feasible. Specifically, in these cases, we found that most often the HERO DEE protocol [3] did not terminate within the 12 day time limit imposed (in which case, termination was forced). And, in any case, the A* algorithm was not empirically feasible for these cases, i.e. there was insufficient computer memory to obtain even the single lowest energy sequence; see Methods for full implementation details. The only exception was the Small SspB case, where the top sequence was found by A* after DEE pruning with a threshold of ; larger DEE thresholds were not feasible (for the DEE and A* stages). Overall, we conclude that for the larger protein design problems, the DEE-based methods presented in [31, 32] are not applicable, since even finding the single lowest energy rotamer assignment for these problems using the sophisticated DEE criteria [28,29] of the HERO algorithm was not possible within a more than reasonable amount of time. We also emphasize that, since we actually wish to obtain the top 1 sequences, a non-zero DEE threshold would 22

23 Page 23 of 51 still be required for these problems, which would leave the rotamer space resulting from the application of DEE even larger than that listed in Table I ( DEE column), i.e. making A* even less likely to be feasible. Note that, for the prion case, the DEE threshold was chosen based on the tbmmf-output sequence energies, as the minimal threshold sufficient to maintain 1 distinct sequences within the rotamer space. Thus, the apparent run-time advantage of the A* algorithm is artificial in the sense that it is highly dependent on the choice of threshold; for example, without the information from tbmmf, one may need to run the algorithm multiple times, each time performing the DEE with an increasing threshold until 1 sequences can be provably found by A*. Thus, the only relevant conclusion for this case is that tbmmf recovers all solutions found by the exact A* approach. In the case of the SspB design, A* took 3 days to output the single lowest energy sequence, while all other methods predicted virtually all of the 1 lowest energy sequences in less than half a day. It could be hypothesized that it may be relevant to utilize the power of even more stringent DEE criteria (e.g. the full suite of techniques in the HERO protocol [3]) to obtain the single lowest energy rotamer assignment, which could be used to seed the SA search for 1 sequences (as in [22]). However, there are several major hurdles to this approach. Firstly, the DEE stage would clearly require an inordinate run-time (possibly weeks to months). Furthermore, we have observed that even in cases when the lowest energy sequence was encountered during the SA run, the resulting sequence ensemble is still far from optimal (e.g. the CaM-smMLCK case in Figure 7). Similarly, it has been shown [54] that such an optimal seed for sampling algorithms does not provide significant improvement; intuitively, this occurs since the SA system quickly diverges from the initial sequence, especially due to the high temperatures that exist for the initial SA stages. As a final benchmark comparison, we applied the provably exact globally-flexible backbone DEE method of BD (with default parameters) [18]. However, even when using a cluster of 2 processors, the protein design calculations did not terminate for the Small or Large 1 cases after a time limit of 12 days. This lack of convergence for BD was not fully surprising, since the conformational spaces for which BD was previously shown to be successful [18] were orders of magnitude smaller (1 18 ) than those here (e.g. 1 2 for even the Small prion case). More importantly, the conformational space remaining after BD (input to the A* algorithm) in [18] was also much smaller than those here (1 1 vs ). Furthermore, the considerable 23

Protein design. CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror

Protein design. CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror Protein design CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror 1 Optional reading on course website From cs279.stanford.edu These reading materials are optional. They are intended to (1) help answer