Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space

Size: px
Start display at page:

Download "Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space"

Transcription

1 Page 1 of 51 Research Article Proteins: Structure, Function and Bioinformatics DOI 1.12/prot.2228 Accurate Prediction for Atomic-Level Protein Design and its Application in Diversifying the Near-Optimal Sequence Space Menachem Fromer 1,ChenYanover 2 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel 2 Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America Running Title: Exploring Near-Optimal Sequence Space Keywords: Protein Design, Protein Energetics, Structural Sequence Space, Probabilistic Graphical Models, Belief Propagation, Combinatorial Optimization, Approximate Inference, Maximum-a-posteriori Estimation Institutions at which the Work was Performed: The Hebrew University of Jerusalem, Fred Hutchinson Cancer Research Center Contact Information for the Corresponding Author: Name: Menachem Fromer Address: School of Computer Science and Engineering The Hebrew University of Jerusalem Jerusalem 9194, Israel Phone: Fax: fromer@cs.huji.ac.il To whom correspondence should be addressed. fromer@cs.huji.ac.il 1 28 Wiley-Liss, Inc. Received 22-May-28; Revised 28-Aug-28; Accepted 12-Sep-28

2 Page 2 of 51 Abstract The task of engineering a protein to assume a target three-dimensional structure is known as protein design. Computational search algorithms are devised to predict a minimal energy amino acid sequence for a particular structure. In practice, however, an ensemble of low energy sequences is often sought. Primarily, this is performed since an individual predicted low energy sequence may not necessarily fold to the target structure due to both inaccuracies in modeling protein energetics and the non-optimal nature of search algorithms employed. Additionally, some low energy sequences may be overly stable and thus lack the dynamic flexibility required for biological functionality. Furthermore, the investigation of low energy sequence ensembles will provide crucial insights into the pseudo-physical energy force fields that have been derived to describe structural energetics for protein design. Significantly, numerous studies have predicted low energy sequences, which were subsequently synthesized and demonstrated to fold to desired structures. However, the characterization of the sequence space defined by such energy functions as compatible with a target structure has not been performed in full detail. This issue is critical for protein design scientists to successfully continue using these force fields at an everincreasing pace and scale. In this paper, we present a conceptually novel algorithm that rapidly predicts the set of lowest energy sequences for a given structure. Based on the theory of probabilistic graphical models, it performs efficient inspection and partitioning of the near-optimal sequence space, without making any assumptions of positional independence. We benchmark its performance on a diverse set of relevant protein design examples and show that it consistently yields sequences of lower energy than those derived from state-of-the-art techniques. Thus, we find that previously presented search techniques do not fully depict the low energy space as precisely. Examination of the predicted ensembles indicates that, for each structure, the amino acid identity at a majority of positions must be chosen extremely selectively so as to not incur significant energetic penalties. We investigate this high degree of similarity and demonstrate how more diverse near-optimal sequences can be predicted in order to systematically overcome this bottleneck for computational design. Finally, we exploit this in-depth analysis of a collection of the lowest energy sequences to suggest an explanation for previously observed experimental design results. The novel methodologies introduced here accurately portray 2

3 Page 3 of 51 the sequence space compatible with a protein structure and further supply a scheme to yield heterogeneous low energy sequences, thus providing a powerful instrument for future work on protein design. Introduction The objective of constructing a protein to perform a specific biological function is termed functional protein design [1, 2]. Potential applications of design include modifications of existing proteins to affect such characteristics as stability or binding affinity [3]. A more ambitious goal is to design protein sequences that will assume novel structures [4] or acquire new functionalities. Such functionalities may be therapeutic (e.g. [5], where an HIV inhibitor was designed) or industrial (e.g. [6], which discusses the design of new biocatalysts). Additionally, the outcomes of such design experiments will critically assess our comprehension of protein architecture and stability (e.g. [7]). A commonly used paradigm for computational protein design casts the functional design problem as a structural one, and, typically, assumes a fixed protein backbone [8]. In addition, the amino acid side chain conformations are not permitted to move continuously in space; rather, the allowed conformations are discretely clustered around a library of distinct, energetically favorable empirical observations ( rotamers ) [9]. Lastly, pairwise atomic energy functions are used to assign pseudo-physical energetic values to pairs of atoms [1]. Protein Design Formulation The input to the protein design problem consists of a three-dimensional protein backbone structure, the N sequence positions to be designed, a group of amino acids (and their respective rotamers) permitted at each position, and an atomic energy function. Formally, we denote by Rots i the set of all possible rotamers at position i (for all amino acid types); let r =(r 1,...,r N ) denote an assignment of rotamers for all N positions. For a given pairwise atomic energy function, the energy of assignment r, E(r), is the sum of the interaction energies between: 1. rotamer r i Rots i and the fixed structural template (backbone and stationary residues) [E i (r i )] 2. rotamers r i Rots i and r j Rots j for neighboring (interacting) positions i, j [E ij (r i,r j )] 3

4 Page 4 of 51 E(r) = i E i (r i )+ i,j E ij (r i,r j ) (1) Let us denote by T (k) the amino acid type of rotamer k and let T (r 1,...,r N )=(T (r 1 ),...,T (r N )). Let S =(S 1,...,S N ) denote an assignment of amino acids for all positions. Also, we define Rots i a as the set of rotamers in Rots i of amino acid type a. Computational protein design attempts to find the sequence S of minimal energy. Mathematically, S =arg min S E(S), where: E(S) = min E(r) (2) r:t (r)=s is the minimal rotamer assignment energy for sequence S. The double minimization problem (over the sequence space and over the per-sequence rotamer space) is combined as: S = T (arg min E(r)) (3) r Goal: Find Multiple Sequences Within the framework defined thus far, the goal of protein design is to predict an amino acid sequence of minimal energy for the target structure. However, there are various reasons why it is desirable to predict a set of low energy sequences ( top sequences) in addition to the single lowest sequence [11 13]. Inaccuracies in modeling and searching: Inaccuracies in quantifying protein energetics [1], the assumption of a fixed backbone and discretized side chains, and the absence of accounting for competing target structures [7, 14] imply that the sequence-energy landscape may not be modeled with sufficient rigor. Thus, the sequence with the lowest predicted energy may not in fact fold to the target structure. Also, even if the energies of all possible sequences are modeled exactly, non-optimality of the search algorithm used may prevent the lowest energy sequence from being found. Preliminary work noted in [4] provides an example of both of these possibilities, where using slightly modified energy functions or search protocols yielded sequences that were energetically stable and yet did not assume their target structures (instead possessing molten globular character). Selection among low energy sequences: Even if a predicted low energy sequence will experimentally obtain low physical energies, it may not satisfy other constraints relevant for ultimately defining a sequence 4

5 Page 5 of 51 as desirable. Particular consideration of the foremost such constraint that of biological functionality arises from the fact that some low energy sequences may be overly stable at standard temperatures and thus lack the kinetic flexibility required to function [6]. Therefore, the biological feasibility of the predicted sequence-structure models can be used to select, from a generated ensemble, the most promising candidate sequences with which to proceed to more expensive and time-consuming experimental validation techniques, such as structural determination or functional assays. This feasibility can be based on techniques (somewhat) orthogonal to ranking using atomic energy functions; such techniques include manual visual inspection or automatic validation, e.g. MolProbity [15]. Another approach for the re-ranking of top-scoring sequence results may include the use of more physically realistic (but computationally heavier) conformational search algorithms that account for molecular flexibility: either of the amino acid side chains (e.g. [16, 17]) or of the protein backbone (e.g. [18, 19]); see below for a more rigorous review of these flexible-molecule search methods. In a similar vein, molecular dynamics simulations can be used to refine the protein structures for each of the predicted low energy sequences (e.g. [2]). There also exist other pertinent requirements that cannot be imposed on the sequence search problem from the outset. For instance, the imperative for few mutations from the wild-type sequence cannot necessarily be well-defined without having first observed the trade-off between mutation number and energy within the computational sequence space, since we do not want to arbitrarily limit the minimal energy obtainable. Thus, consideration of a collection of low energy sequences is often invaluable. Nonetheless, we do note that certain sequence conditions can be applied directly during the search stage. For example, the work in [21] demonstrates, with relative success, the capability of enforcing fixed amino acid composition during the search (useful for keeping reference state energies relatively constant). Low energy sequence profiles: In effect, a set of low energy sequences for the target structure characterizes sequences well-suited to fold to the structure. This information can be summarized in a sequence profile (position-specific scoring matrix, PSSM) that tabulates the positional amino acid probabilities for sequences predicted to fold to the structure. The profile can then be utilized, for example, to build experimental protein design libraries [1, 22], used to biologically screen large numbers of relevant sequences. Alternatively, such profiles can be systematically compared to evolutionary sequence data [11 13]. Although generating such profiles is not necessarily the ultimate goal of this work (in contrast to our previous work [23]), it is a 5

6 Page 6 of 51 beneficial byproduct of the investigation of the near-optimal sequence space. The issues outlined above clearly justify the need for an efficient algorithm to directly determine an ensemble of low energy sequences. Additionally, we are interested in exploring the near-optimal sequence space induced by a widely-used energy function (in this case, the Rosetta function [24]), in an attempt to comprehend the predictive consequences of using such energy functions for protein design. In this paper, we describe a novel algorithm to quickly predict the set of low energy sequences for a target structure. Using probabilistic graphical modeling, we efficiently determine minimal energy sequences when the underlying search space includes numerous rotamers for each amino acid type, without considering any sequence more than once (see Figure 1). Using a diverse dataset of protein design problems, we benchmark its efficacy in yielding lower energy sequences, as compared to previous methods. We observe that for cases when the complete set of lowest energy sequences can be exhaustively enumerated, the algorithm empirically obtains this set. Moreover, we find the set of near-optimal sequences to have an extremely high degree of sequence and biochemical similarity and provide a practical technique to increase sequence variation. Previous Work The methods previously applied to predict a set of M low energy sequences (or for similar tasks) can be categorized as provably exact approaches, statistical methods, and sampling techniques; we now provide a detailed review of these methods. For the reader interested in proceeding directly to our novel approach (subsequent sections), we suffice to say that the main computational challenge is outlined in Figure 1. Exact Methods Before detailing the exact methods, we highlight one of their deficiencies. Even when adapted to obtain successive low energy rotamer assignments, they have not been constructed to provide successive low energy sequences, skipping over low energy rotamer assignments of sequences previously observed. Thus, the naive application of these algorithms is required, in some sense, to iterate over successive rotamer assignments but only output newly observed sequence assignments. However, this is quite computationally costly since each sequence typically has an exponential number of corresponding rotamer assignments, many of which with 6

7 Page 7 of 51 A Position #2 Position #1 aa G 1 G 2 aa rot. g 11 g 12 g 21 g 22 h H h h H h B r E(r) T (r) (g 11,h 11 ) 15 (G 1,H 1 ) (g 11,h 12 ) 14 (G 1,H 1 ) (g 12,h 22 ) 13 (G 1,H 2 ) (g 11,h 22 ) 12 (G 1,H 2 ) (g 12,h 11 ) 11 (G 1,H 1 ) (g 12,h 12 ) 1 (G 1,H 1 ) (g 12,h 21 ) 9 (G 1,H 2 ) (g 11,h 21 ) 8 (G 1,H 2 ) (g 21,h 12 ) 7 (G 2,H 1 ) (g 21,h 11 ) 6 (G 2,H 1 ) (g 22,h 21 ) 5 (G 2,H 2 ) (g 21,h 22 ) 4 (G 2,H 2 ) (g 22,h 11 ) 3 (G 2,H 1 ) (g 22,h 12 ) 2 (G 2,H 1 ) (g 22,h 22 ) 1 (G 2,H 2 ) (g 21,h 21 ) (G 2,H 2 ) Figure 1: Toy Protein Design Problem Toy example to demonstrate the need for an algorithm that yields distinct low energy sequences within the rotamer space. (A) Pairwise rotamer energies. The minimal rotameric energy for each sequence appears in boldface. (B) All rotamer assignments in order of increasing energy. Naively, the 11 lowest energy rotamer assignments have to be examined to yield all 4 sequences. Lower case letters (e.g. g 11 ) stand for rotamer configurations, upper case letters (e.g. G 1 ) for amino acid types (that is, T (g 11 )=G 1 ). For simplicity, only pairwise energies are considered. 7

8 Page 8 of 51 potentially lower energy than the next lowest energy sequence, thus yielding repeated sequence assignments (Figure 1). DEE, A*: Dead-end elimination (DEE) is an iterative positional rotamer elimination technique that is guaranteed to remove only rotamers that do not participate in the lowest energy rotamer assignment (and hence lowest energy sequence assignment) [25, 26]. This method was applied in a number of ground-breaking studies (e.g. [3, 22, 27]). However, large protein design problems require increasingly stringent DEE criteria (with large increases in computation time) to yield a small enough search space to be considered exhaustively and uniquely determine the minimal energy rotamer assignment [28 3]. Furthermore, DEE must be adapted in order to obtain successive low energy rotamer assignments. One such adaptation is the generalized DEE/A* method [31]. First, a relaxed DEE criterion is applied such that, for a given energetic threshold ε, only rotamers that do not participate in a rotamer assignment with energy within ε of the minimal rotamer assignment will be eliminated. Note that, for relevant ε >, the resulting rotamer space cannot contain only a single rotamer assignment. In fact, this space is often quite large, so the A* artificial intelligence technique is used to systematically enumerate the rotamer assignments in guaranteed order of increasing energy. A critical drawback of this method is that it often requires extremely long computation times and extensive amounts of computer memory, so that in larger problems it may not be at all feasible [31]. Another problem encountered by the DEE/A* approach is that the initial choice of ε, necessary to maintain the required number of sequences and yet prune as much of the rotamer space as possible, is not trivial. If the threshold chosen is too low, then some rotamer assignments, required to obtain the M lowest energy sequences, will have been eliminated; on the other hand, larger thresholds will be unable to trim the rotamer space as successfully. A more recently devised method for providing successive low energy assignments is that of X-DEE [32], which successively applies DEE to disjoint sub-spaces until a specified number of lowest energy rotamer assignments are found. However, X-DEE requires very many runs of DEE on large sub-spaces, and for cases where the search space is very large as for protein design (unlike the case of two-state variables tested in [32]), this could well be an insurmountable computational hurdle for this method. Flexible-molecule DEE: Recently, there exists a growing trend of algorithms that allow for flexibility of the amino acid side chains [16] and the protein backbone [4, 13, 19, 33 35] during the structural modeling 8

9 Page 9 of 51 involved in the protein design process. The relaxation of the requirements for a fixed backbone and fixed rotamers for protein design has been shown to produce more natural side chain and amino acid variability [13, 16, 19, 35]. Among these algorithms, however, the procedures developed by Donald and colleagues are exceptional, in that they provably provide low energy sequences for protein design, while incorporating flexible side chains [17, 36], global backbone flexibility [18], and local backbone flexibility (e.g. backrub ) [37]. Briefly, these exact DEE-based procedures calculate lower and upper bounds on the energetic terms in Eq. 1 and then use these bounds to run a generalized DEE algorithm and prune the conformational space. The remaining space is subsequently searched using a slightly modified A* algorithm to obtain the lowest energy sequence(s). We note that for any given protein design problem, providing for conformational flexibility (side chain or backbone) in an exact method has the advantage that it will typically find a larger number of lower energy sequences, e.g. below a certain acceptable energetic threshold. Specifically, since non-flexibility is always an option, the structurally flexible minimal energy sequence will have a lower energy than that found without flexibility. On the other hand, the use of such bounded intervals for the energy terms (instead of specific energy values) results in less efficient pruning of the conformational space and thus longer running times for the A* search [18]. Therefore, convergence of these algorithms within a reasonable time frame is not at all guaranteed for larger protein design cases (though it has been calculated that using an exhaustive search to produce comparable results would require two orders of magnitude longer run-times [37]). Additionally, the problems encountered when running standard DEE, e.g. the choice of ε for ensuring that the M lowest energy sequences can be found, are also relevant here. LP: An additional method guaranteed to obtain successive low energy rotamer assignments is the LP/ILP approach of [38], where the search for the minimal energy rotamer assignment is structured as an integer linear programming (ILP) problem. Under certain conditions, the linear programming (LP) relaxation can be efficiently run to yield the solution; otherwise, a computationally more intensive ILP solver is run. However, in practice, even when using a simple energy function (van der Waals interactions and a statistical rotamer self-energy), the ILP solver was required for most design problems tested, resulting in large run-times to predict even the single lowest energy rotamer assignment. Generating M sequences would entail running the 9

10 Page 1 of 51 method (at least) M times, making a typical protein design ensemble search essentially infeasible. Finally, we do note that the X-DEE and LP/ILP methods could theoretically be generalized to obtain successive low energy sequences, by partitioning the sub-spaces by amino acid type instead of rotamer at each position (as in the tbmmf algorithm presented below). But, the methods would still suffer from the other drawbacks outlined above. In any case, such generalizations would have to be formulated carefully so as to prevent an exponential number of X-DEE search bases (ILP inequality constraints), deriving from the fact that different rotamers of the same amino acid may be chosen for different sequences. Statistical Methods Mean Field and its generalizations: Self-consistent mean field theory [39] provides a framework for the prediction of positional amino acid probabilities for the design of a target protein structure, by making certain simplifying statistical independence assumptions. The SCADS (Statistical Computationally Assisted Design Strategy) method [4] generalizes mean field theory and applies a statistical entropy approach to atomic level protein design in order to predict these probabilities, in a way that is robust to minor backbone changes. The formulation of these methods is intended to produce site-specific amino acid probabilities, which can be used to build PSSM as described above. Nonetheless, they have been adapted to yield low energy sequences, often by independently choosing the most probable amino acids at each position. However, such strategies have been met with limited success in finding even the single lowest energy sequence [41]. Furthermore, the sub-optimality of standard mean field theory was also found to hold for the comparison of the probabilities it predicted with that of simulated, evolutionary, and experimental probability data [23]. Finally, since the goal herein is to predict an ensemble of low energy sequences, without having to assume any specific independence between positions (beyond that already defined by the pairwise energy function), we do not investigate these methods. Sampling Methods Monte Carlo Simulated Annealing: Simulated annealing (SA) [22, 24] attempts to solve the global minimization problem by starting from an initial sequence assignment (possibly random). At each step, the 1

11 Page 11 of 51 sequence is slightly modified using a random mutation rule. This new sequence is retained if its energy is lower than that of the previous sequence; otherwise, it is randomly accepted with a probability that is a function of its energy and the current temperature of the SA system. Typically, the temperature starts at a high value (permitting all transitions) and the system is slowly cooled until convergence. A major disadvantage of SA is that convergence to the minimum is not guaranteed within a finite number of steps or within the framework of any given cooling schedule. In [22], SA was applied to the design of the active site of the β-lactamase structure and the M =1, lowest energy sequences observed during the SA run were output. Probabilistic Graphical Modeling of Protein Design The principal computational tool we utilize in this paper is the representation of the protein design energy optimization problem as a probabilistic graphical model (graphical model, for short) and subsequent application of the loopy belief propagation algorithm. We briefly summarize the theory of graphical models and belief propagation in the context of protein design using pairwise energy functions (for in-depth reviews, see [42, 43]). The formulation used here for predicting the lowest energy rotamer assignment for protein design is a generalization of that used in [44] to determine the minimal energy rotameric state for protein side chain prediction. It is also conceptually similar to that used in [45] to find the single lowest energy sequence for protein design and the approach in [46] applied to calculate the free energies of particular sequences. We have also previously used a similar formulation to predict positional amino acid probabilities over the entire sequence space for protein design problems [23]. Graphical modeling: A graphical model is a compact representation of a probability distribution that is well-suited to describe conditional independencies between variables. For protein design, we define a random variable for each design position, whose values represent the rotameric choices (including amino acid) at that position. We then build a graph, wherein each node corresponds to a variable and the node s values correspond to the variable s values (Figure 2A,B). An assignment for all variables is equivalent to rotamer choices for all design positions; we thus use the terms design position, variable, and node interchangeably. 11

12 Page 12 of 51 A A C B A12 A15 A16 C11 R R C16 T A11 T T I L V C15 GR G R C11 R RT T T I L V A11 C12 C C11 I C11 II A12 C11 III C11 IV A12 A16 I IV II A16 C16 C15 C12 A15 A16 A11 A11 C16 C15 C12 A15 A16 A11 C12 A11 III C12 m A16 C11 (r C11 ) m C11 A11 (r A11 ) m A11 C12 (r C12 ) m C12 A16 (r A16 ) Figure 2: Probabilistic Graphical Modeling of Protein Design (A) SspB dimer interface, with C α coordinates of design positions (12, 15, 16, 11 on monomers A, C) marked as spheres. Interactions between positions, as determined by the Rosetta energy function, are marked as edges; for simplicity of exposition, we ignore all intra-monomeric edges. The resulting graphical model is shown (B), where each edge contains the pre-calculated rotamer-rotamer energy matrix, demonstrated for A11 and C11. (C) The messages passed in one direction by loopy belief propagation on the cycle: A16, C11, A11, C12. Each position calculates the outgoing message (solid arrow) by combining (Eq. 9) the interaction energies and the current incoming messages from all other nodes (dashed arrows). Positions that do not interact (e.g. A16 and A11) are nonetheless mutually dependent through common interactions (C12 and C11). The pre-calculated rotamer energies taken as input to the problem (see Introduction) are utilized to define probabilistic potential functions,orprobabilistic factors, in the following manner. The singleton energies specify probabilistic factors describing the self-interactions of the positions in their permitted rotamer states: ψ i (r i ) = e E i (r i ) T (4) And, the pairwise energies define probabilistic factors describing the direct pairwise interactions for pairs of positions and their possible rotamers: ψ ij (r i,r j ) = e E ij (r i,r j ) T (5) where T is the system temperature (taken to be the equivalent of 37 C). For a pair of variables i, j, the matrix of pairwise probabilistic factors (ψ ij ) corresponds to an edge between them in the graph (Figure 2B). Since energy functions typically used for design essentially ignore 12

13 Page 13 of 51 interactions occurring between atoms more distant than a certain threshold, this implies that the design graph will often have a large number of missing edges (positions too distant to directly interact). Thus, the locality of spatial interactions in the protein structure induces path separation in the graph and conditional independence in the probability distribution of the variables. Mathematically, the probability distribution for the rotamer assignment (r 1,...,r N ) decomposes into a product of the singleton and pair probabilistic factors: Pr(r 1,...,r N ) = 1 ψ i (r i ) ψ ij (r i,r j ) (6) Z i i,j = 1 Z e E(r) T (7) where Z is the probability normalization factor (partition function), and Eq. 7 derives from substitution of Eqs. 4, 5 into Eq. 6 and the pairwise energy decomposition of Eq. 1. Thus, minimization of rotamer energy (Eq. 3) is equivalent to maximization of rotamer probability (a probabilistic inference task): S = T (arg max Pr(r)) (8) r All the same, the size of the rotamer space is exponential in protein length, making protein design computationally difficult even for small proteins (see [47, 48] for a rigorous handling of the subject). Thus, exhaustive calculation of an exact maximum for Eq. 8 is no less computationally infeasible than minimization of Eq. 3. Nevertheless, having formulated the protein design problem as an inference problem on a graphical model avails us of a wide array of effective approximate inference techniques. Belief propagation: Max-product belief propagation (BP) [42] is a message passing algorithm that efficiently utilizes the inherent locality in the graphical model representation. Messages are passed between neighboring (interacting) variables, where the message vector describes one variable s belief about its neighbor that is, the relative likelihood of each allowed state for the neighbor. A message vector to be passed from one position to its neighbor is calculated using their pairwise interaction probabilistic factor and the current input of other messages regarding the likelihood of the rotamer states for the position (Figure 2C). Formally, at a given iteration, the message passed from variable i to variable j regarding j s rotameric 13

14 Page 14 of 51 state (r j )is: m i j (r j )=max r i e E i (r i ) E ij (r i,r j ) T k N(i)\j m k i (r i ) (9) where N(i) is the set of nodes neighboring variable i. Note that m i j is, in essence, a message vector of relative probabilities for all possible rotamers r j, as determined at a specific iteration of the algorithm. In detail, messages are typically initialized uniformly. Next, messages are calculated using Eq. 9. Now, for each position for which the input message vectors have changed, its output messages are recalculated (Eq. 9) and passed on to its neighbors. This procedure continues in an iterative manner until numeric convergence of all messages, or a predetermined number of messages has been passed. Finally, max-marginal (MM) belief vectors are calculated as the product of all incoming message vectors and the singleton probabilistic factor: MM i (r i )=e E i (r i ) T k N(i) m k i (r i ) (1) where MM i (r i ) is the max-marginal belief of a particular rotamer r i Rots i at position i. In this paper, we apply max-product loopy belief propagation (BP) to find the maximal probability rotamer (and sequence) assignments (Eq. 8). Specifically, we use the max-marginal (MM) beliefs obtained by BP (Eq. 1) as approximates of the exact max-marginal probability values: Pr i (r i )= max Pr(r ) (11) r : r i =r i for which it can be shown (Lemma A.1) that assignment of: r i =arg max r i Rots i Pr i (r i ) (12) yields the most probable rotamer assignment r. The belief propagation algorithm was originally formulated for the case where the graphical model is a tree graph (i.e. no loops exist). However, since typical protein design problems will have numerous cycles (Figure 2B), we thus obtain (possibly) inexact MM and the sequence results are not guaranteed to be optimal. Nonetheless, loopy BP has been shown to be empirically successful in converging to optimal 14

15 Page 15 of 51 solutions when run on non-tree graphs (e.g. [44]). Furthermore, loopy BP has conceptual advantages over related (statistical) inference techniques, since it does not assume independence between design positions and yet largely prevents self-reinforcing feedback cycles that may lead to illogical or trivial fixed points. On the other hand, for example, self-consistent mean field is forced to make certain positional independence assumptions [39, 41]. tbmmf: Prediction of the M Minimal Energy Sequences The novel algorithm described herein exploits the formulation for protein design described in the previous section and generalizes the BMMF (Best Max-Marginal First) algorithm of [49]. Conceptually, it partitions the search space while systematically excluding all previously determined minimal energy sequence assignments (Figure 4). In cases where loopy belief propagation (BP) yields exact max-marginal (MM) probabilities, the algorithm is guaranteed to find the top M sequences for the protein design problem (Theorem A.5). We designate this algorithm as tbmmf (type-specific Best Max-Marginal First). We define amino acid type constraints such that a position i can either be unconstrained (r i is allowed to assume all rotamers in Rots i ) or constrained to rotamers of specific aa (amino acids). Thus, for a given aa type a, a constraint can be positive (r i must be a rotamer of aa a: r i Rots i a ) or negative (r i must be a rotamer of an aa other than a: r i / Rots i a ). For a set of constraints C, we denote MM p (r p ) C as the max-marginal belief of rotamer r p at position p obtained when enforcing the constraints in C. In practice, to constrain a position to a specific subset of rotamers, we zero out its singleton probabilistic factor for all other rotamers. Pseudocode for the novel tbmmf algorithm is presented in Figure 3 and demonstrated in Figure 4. Intuitively, at iteration m, the next lowest energy sequence must differ from all previous low energy sequences in at least one position. Consequently, we examine the constrained sub-spaces from which these sequences were derived. For each such space, we calculate the highest relative positional MM probability ( best MM, BMM), while excluding amino acids from the corresponding low energy sequence; thisisperformed to determine the next lowest energy sequence within each space, while excluding previously determined sequences. We then consider the constrained sub-space (t m ) for which the maximal BMM (BMM tm )was 15

16 Page 16 of for m 1 to M do if m =1then Cons m else /* t m, p tm, q tm are the sub-space, position, rotamer to yield the next lowest energy sequence */ t m arg max BMM m m <m a T(q tm ) // aa type of q tm // Add pos. constraint to Cons m : Cons m Cons tm {r p t m Rots p t m a } // Add neg. constraint to Cons tm : Cons tm Cons tm {r p t m / Rots p t m a } Run BP to obtain: MM p (q) Cons t m CalcBMM(t m ) end // calculate BMM tm Run BP to obtain: MM p (q) Cons m for i 1 to N do r m i arg max MM i (r i ) Cons m r i Rots i Si m T(ri m) // ith aa of m th seq. end CalcBMM(m) end return {S m } M m=1 // calculate BMM m /* Use MM p (q) Cons n to calculate the BMM for constrained sub-space n */ Function CalcBMM(n) (p n,q n ) arg max MM p (q) Cons n p,q: T (q) Sp n BMM n MM p n(q n ) Cons n end Figure 3: The tbmmf algorithm The type-specific Best Max-Marginal First (tbmmf) algorithm for calculating the M lowest energy protein sequences: {S m }. Cons m denotes the constraint set that defines the sub-space from which S m was derived as the minimal energy sequence. BMM = best max-marginal. 16

17 Page 17 of 51 computed, as well as the maximizing position (p tm )androtamer(q tm ) associated with this BMM. The definition of max-marginal probabilities (Eq. 11) implies that BMM tm corresponds to the energy of the next lowest energy sequence. Moreover, it guarantees that this sequence can be found by choosing rotamer q tm at position p tm, along with the constraints present in sub-space t m. Therefore, this space is partitioned into two mutually exclusive sub-spaces: 1. The maximizing position (p tm ) is constrained to be of the maximizing rotamer s (q tm ) amino acid type. We determine the next lowest energy sequence (S m ) and its next best MM (BMM m ) by running BP on this sub-space (m). 2. Position p tm is constrained to not be of the maximizing amino acid type. We run BP on this sub-space (t m ) to update its next best MM (BMM tm ), to be considered by subsequent iterations. Runs of BP are as described in Eqs. 9, 1 and as illustrated in Figure 2C. In Figure 4, the tbmmf algorithm is simulated for the protein design example from Figure 1: m = 1: The lowest energy rotamer assignment, (g 11,h 11 ) [circle number 1], and its corresponding aa sequence, S 1 =(G 1,H 1 ), are determined by running BP on the full rotamer space; this sequence has an energy of 15. The max-marginals calculated from this run of BP indicate that the next lowest energy sequence (marked by a star) can be obtained by constraining position 2 to rotamers of amino acid H 2 (marked in red) and will have an energy of 13 (since the BMM is proportional to e 13 ). m = 2: The above constraint is added to the rotamer space and the next lowest energy sequence, S 2 =(G 1,H 2 ), is calculated. Within this positively constrained sub-space, the next lowest energy sequence can be generated by constraining position 1 to amino acid G 2 and will have an energy of 5. The original space is now negatively constrained to exclude amino acid H 2 at position 2. The previously observed sequence S 1 is excluded, so that the next lowest energy sequence in this sub-space would be derived from constraining position 1 to amino acid G 2 and have an energy of 7. m = 3: From among the two choices of lowest sequence energies ( 5 and 7) available in the constrained sub-spaces, S 3 =(G 2,H 1 ) is found by choosing the sub-space (t 3 = 1) and corresponding constraint (amino acid G 2 at position 1) that yield the sequence of lower energy. 17

18 Page 18 of 51 Figure 4: tbmmf Run on the Example in Figure 1 At each iteration (horizontal axis), the selected sub-space (pairwise energy matrix) is partitioned into two complementary ones. For each sub-space in the hierarchy, rotamer assignments forbidden by its derived constraints are grayed out. In positively constrained spaces, numbered circles denote the sequence chosen at that iteration. In a given sub-space, the amino acid in red is that required to be positively constrained in order to yield the next lowest energy sequence; a star denotes that sequence. tbmmf parameters are depicted in mint (for simplicity, T = 1 and BMMs are unnormalized). 18

19 Page 19 of 51 A Small Medium Large 1 Large 2 Num. Positions (Chains a ) Search Space Cardinality (log 1 ) Rotamer Library Design Shell b Sequence Rotamer td-dee c Read d Added e prion 7 (A) 7 (B) Full χ 1, χ 2 SspB 8 (A,C) Full χ 1, χ 2 hgh-hghr 1 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 2 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 3 5 (A) 136 (A,B) Full χ 1, χ 2 hgh-hghr 4 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 5 6 (A) 135 (A,B) Full χ 1, χ 2 hgh-hghr 6 6 (A) 135 (A,B) Full χ 1, χ 2 CaM-smMLCK 24 (A) 19 (B) Limited χ 1 CaM-skMLCK 24 (A) 19 (B) Limited χ 1 hgh-hghr 35 (A) 16 (A,B) Limited χ 1 Top7 92 (A) Limited a Peptide chains to which the corresponding positions belong, labeled arbitrarily. b Non-designed, conformationally varying positions. c Rotamer space cardinality after application of type-dependent Goldstein DEE. d Full: all rotamers read from library; Limited: highest probability rotamers read. e Side-chain angles around which additional rotamers were super-sampled from library rotamers. B prion SspB hgh-hghr CaM-smMLCK CaM-skMLCK Top7 1I4M 1OU9 3HHR 1CDL 2BBM 1QYS (a+b) all beta all alpha all alpha all alpha (a+b) Figure 5: Benchmark Dataset of Protein Design Test Cases of Varying Characteristics (A) Protein design data used for benchmarking. (B) The designed protein structures: designed residues are colored blue, conformationally varying positions yellow, and all others red. PDB identifiers and SCOP [53] structural classes are as marked. Results To assess the tbmmf algorithm in relation to the state-of-the-art techniques previously available to solve this problem, we investigated 12 protein design problems of various sizes and qualities considered in earlier computational and experimental protein design studies [3, 4, 23, 5 52]. Firstly, we determine that tbmmf outperforms all other methods analyzed (see below) in yielding low energy sequence ensembles. Furthermore, we find the space of near-optimal sequences to be highly homogeneous and demonstrate how to circumvent this self-similarity and provide a more diverse sequence ensemble. The test problems are delineated in Figure 5 and detailed in Methods. All energy calculations were performed using the Rosetta design energy function [24]. We also note that the rotamer set employed for each problem was the maximal set (while super-sampling rotamer configurations) under which Rosetta energy calculations remained feasible. 19

20 Page 2 of 51 For all problems, we first applied the pre-processing of the rotamer space provided by type-dependent DEE [5]. This reduced rotamer space was used as input to all algorithms, except Ros (which was run directly within the Rosetta package, see below). We then applied each of the following algorithms to the protein design problems to predict the 1 lowest energy sequences: tbmmf A*: Generalized DEE/A* [31] Ros: Rotamer space Monte Carlo simulated annealing (Rosetta, default parameters) [24] SA: Monte Carlo simulated annealing over the sequence space, with inner loops of per-sequence rotamer space simulated annealing [22] For A*, the principal steps of the HERO algorithm s protocol [3] were followed. Briefly, Goldstein DEE [26] was followed by 1-split and 2-split DEE, and then Magic Bullet 3-split and 4-split DEE [28]. For Ros and SA, the method was randomly run multiple times (for random initial sequences), such that the total run-time would be comparable to that of tbmmf. See Methods for the full details on all runs. tbmmf Obtains Lower Energy Sequences In Figure 6, success in recovering minimal energy sequences is demonstrated. Results are typical for their respective size categories; fuller results can be found in Table I. For all cases assessed, tbmmf decidedly outperforms the other algorithms in predicting a set of low energy sequences. The assessment was performed in the following manner. For each protein design problem, each algorithm predicted its top M = 1 sequences. In addition, for the M sequences predicted by Ros and SA, we subsequently ran belief propagation (BP) to determine the corresponding minimal rotameric energies (Eq. 2). We denote the calculation of these per-sequence minimal rotamer energies as Ros + and SA +, respectively. These BP calculations were performed to under-penalize the Ros and SA sampling algorithms in cases where they may have in fact found low energy sequences without actually having found the minimal energy rotamer assignment for the sequence (e.g. see Figure 7). Finally, the top 1 sequence results output by the algorithm runs were pooled, and each such sequence was ranked according to the minimum rotameric energy it obtained 2

21 Page 21 of 51 Small Medium Large 1 Large 2 prion hgh-hghr 1 CaM-smMLCK hgh-hghr % Top Sequences 5 tbmmf A* Ros SA 5 tbmmf Ros SA 5 tbmmf Ros SA 5 tbmmf Ros SA Figure 6: Assessment Results for Representative Protein Design Test Cases Results of the tbmmf, A* (where feasible), Rosetta rotamer space simulated annealing (Ros), and sequence space simulated annealing (SA) algorithms for representative protein design problems. Note that A* was only feasible for the prion case. For each algorithm, the bar denotes the percentage of the top 1 sequences (output by any algorithm) obtained. For Ros and SA, the results are combined from multiple runs; see text for details. by any of the algorithms, including Ros + and SA +. Subsequently, only the top 1 sequences from this pool are considered in the success rates, in which we measure what fraction of these 1 sequences were discovered by any given algorithm run (Figure 6). Only in the prion protein design problem was A* feasible, provably finding all 1 minimal energy sequences. tbmmf and SA also yielded this optimal set, while Ros fared reasonably, discovering 86% of the minimal energy sequences. Thus, although not mathematically guaranteed, tbmmf obtained the complete set of lowest energy sequences in the case where such a set could be calculated by A*. For all other problems, however, A* was not feasible, so we do not have exact results with which to compare. Nonetheless, tbmmf was by far the most adept at finding the largest number of low energy sequences for these more realistic protein design problems. Table I depicts the full assessment results for all 12 test cases (columns marked Top ) and the computational run-times of the algorithms (columns marked Time ); all algorithms were run on dual-cpu Linux machines. For the Small cases, Ros and SA performed comparably as well as tbmmf. However, for all larger cases, the deterministic tbmmf algorithm vastly outperforms both Ros and SA in obtaining a larger number of lower energy solutions. Moreover, for all except the Small cases, Ros and SA were allowed to run significantly longer than tbmmf. Being random sampling algorithms, they could theoretically be run even longer to possibly achieve better results. But since tbmmf already provides superior results in less time (hours vs. days), it is clearly preferable. 21

22 Page 22 of 51 Small Medium Large 1 Large 2 tbmmf Ros SA A* (A* Rotamer Space) Top Time Top Time Top Time Top a Time td-dee b DEE c prion 1% 58.9 m 86% 9.3 h 1% 12 h 1% 3.4 m SspB 1% 11 h 1% 11.4 h 97% 9.6 h 1% d 3 d hgh-hghr 1 88% 13.4 h 3% 2.1 d 2% 7.3 d failed 12 d hgh-hghr 2 6% 7.6 h 5% 2 d % 5.9 d failed 12 d hgh-hghr 3 1% 4.1 h 73% 1.7 d % 5.9 d failed 12 d hgh-hghr 4 1% 8.5 h 22% 2.1 d % 7.4 d failed 12 d hgh-hghr 5 1% 2.9 h 27% 2 d % 5.8 d failed 12 d hgh-hghr 6 1% 8.5 h 42% 2.2 d % 6.1 d failed 12 d CaM-smMLCK 73% 1.6 h 18% 18 h 23% 1 d failed 12 d CaM-skMLCK 1% 2 h % 1.7 h % 2.7 h failed 7.2 d hgh-hghr 1% 17.6 h % 2 d % 2.3 d failed 12 d Top7 69% 7.1 h 31% 1.5 d % 1.7 d failed 12 d a failed: DEE calculations were terminated after a time limit of 12 CPU days and/or the A* algorithm was terminated due to a lack of computer memory (4 GB limit). b Rotamer space cardinality (log 1 ) after pre-processing by type-dependent Goldstein DEE. c Rotamer space cardinality after application of generalized DEE (as part of DEE/A*). d Using a DEE threshold of ; non-zero thresholds failed. Table I: Assessment and Analysis of the Algorithms Tested Fraction of top sequences obtained and CPU run-times for all protein design test cases. For each design scenario, the highest fraction obtained is marked in boldface. For Ros and SA, the run-times are summed over all randomized runs (see Methods). m = minutes; h = hours; d = days. The rightmost columns indicate the reduction in the rotamer space achieved by the application of DEE in the A* algorithm. Note that for all but the prion case, a generalized DEE threshold of was applied; see text for the details of the successive DEE criteria applied. In the single case where we found A* to be successful in finding the top 1 sequences (the prion problem), a DEE threshold of.33 energy units (energy units approximate kcal/mol) was used. On the other hand, in all other cases even a threshold of (i.e. standard DEE) did not suffice to make DEE/A* empirically feasible. Specifically, in these cases, we found that most often the HERO DEE protocol [3] did not terminate within the 12 day time limit imposed (in which case, termination was forced). And, in any case, the A* algorithm was not empirically feasible for these cases, i.e. there was insufficient computer memory to obtain even the single lowest energy sequence; see Methods for full implementation details. The only exception was the Small SspB case, where the top sequence was found by A* after DEE pruning with a threshold of ; larger DEE thresholds were not feasible (for the DEE and A* stages). Overall, we conclude that for the larger protein design problems, the DEE-based methods presented in [31, 32] are not applicable, since even finding the single lowest energy rotamer assignment for these problems using the sophisticated DEE criteria [28,29] of the HERO algorithm was not possible within a more than reasonable amount of time. We also emphasize that, since we actually wish to obtain the top 1 sequences, a non-zero DEE threshold would 22

23 Page 23 of 51 still be required for these problems, which would leave the rotamer space resulting from the application of DEE even larger than that listed in Table I ( DEE column), i.e. making A* even less likely to be feasible. Note that, for the prion case, the DEE threshold was chosen based on the tbmmf-output sequence energies, as the minimal threshold sufficient to maintain 1 distinct sequences within the rotamer space. Thus, the apparent run-time advantage of the A* algorithm is artificial in the sense that it is highly dependent on the choice of threshold; for example, without the information from tbmmf, one may need to run the algorithm multiple times, each time performing the DEE with an increasing threshold until 1 sequences can be provably found by A*. Thus, the only relevant conclusion for this case is that tbmmf recovers all solutions found by the exact A* approach. In the case of the SspB design, A* took 3 days to output the single lowest energy sequence, while all other methods predicted virtually all of the 1 lowest energy sequences in less than half a day. It could be hypothesized that it may be relevant to utilize the power of even more stringent DEE criteria (e.g. the full suite of techniques in the HERO protocol [3]) to obtain the single lowest energy rotamer assignment, which could be used to seed the SA search for 1 sequences (as in [22]). However, there are several major hurdles to this approach. Firstly, the DEE stage would clearly require an inordinate run-time (possibly weeks to months). Furthermore, we have observed that even in cases when the lowest energy sequence was encountered during the SA run, the resulting sequence ensemble is still far from optimal (e.g. the CaM-smMLCK case in Figure 7). Similarly, it has been shown [54] that such an optimal seed for sampling algorithms does not provide significant improvement; intuitively, this occurs since the SA system quickly diverges from the initial sequence, especially due to the high temperatures that exist for the initial SA stages. As a final benchmark comparison, we applied the provably exact globally-flexible backbone DEE method of BD (with default parameters) [18]. However, even when using a cluster of 2 processors, the protein design calculations did not terminate for the Small or Large 1 cases after a time limit of 12 days. This lack of convergence for BD was not fully surprising, since the conformational spaces for which BD was previously shown to be successful [18] were orders of magnitude smaller (1 18 ) than those here (e.g. 1 2 for even the Small prion case). More importantly, the conformational space remaining after BD (input to the A* algorithm) in [18] was also much smaller than those here (1 1 vs ). Furthermore, the considerable 23

Protein design. CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror

Protein design. CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror Protein design CS/CME/Biophys/BMI 279 Oct. 20 and 22, 2015 Ron Dror 1 Optional reading on course website From cs279.stanford.edu These reading materials are optional. They are intended to (1) help answer

More information

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror Protein design CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror 1 Outline Why design proteins? Overall approach: Simplifying the protein design problem Protein design methodology Designing the backbone

More information

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror Protein design CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror 1 Outline Why design proteins? Overall approach: Simplifying the protein design problem < this step is really key! Protein design methodology

More information

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Conformational Analysis 2 Conformational Analysis Properties of molecules depend on their three-dimensional

More information

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Introduction to Artificial Intelligence Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Chapter 9 Evolutionary Computation Introduction Intelligence can be defined as the capability of a system to

More information

Metaheuristics. Approximate. Metaheuristics used for. Math programming LP, IP, NLP, DP. Heuristics

Metaheuristics. Approximate. Metaheuristics used for. Math programming LP, IP, NLP, DP. Heuristics Metaheuristics Meta Greek word for upper level methods Heuristics Greek word heuriskein art of discovering new strategies to solve problems. Exact and Approximate methods Exact Math programming LP, IP,

More information

PERFORMANCE, PROCESS, AND DESIGN STANDARDS IN ENVIRONMENTAL REGULATION

PERFORMANCE, PROCESS, AND DESIGN STANDARDS IN ENVIRONMENTAL REGULATION PERFORMANCE, PROCESS, AND DESIGN STANDARDS IN ENVIRONMENTAL REGULATION BRENT HUETH AND TIGRAN MELKONYAN Abstract. This papers analyzes efficient regulatory design of a polluting firm who has two kinds

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

Modeling of competition in revenue management Petr Fiala 1

Modeling of competition in revenue management Petr Fiala 1 Modeling of competition in revenue management Petr Fiala 1 Abstract. Revenue management (RM) is the art and science of predicting consumer behavior and optimizing price and product availability to maximize

More information

CHAPTER 5 SUPPLIER SELECTION BY LEXICOGRAPHIC METHOD USING INTEGER LINEAR PROGRAMMING

CHAPTER 5 SUPPLIER SELECTION BY LEXICOGRAPHIC METHOD USING INTEGER LINEAR PROGRAMMING 93 CHAPTER 5 SUPPLIER SELECTION BY LEXICOGRAPHIC METHOD USING INTEGER LINEAR PROGRAMMING 5.1 INTRODUCTION The SCMS model is solved using Lexicographic method by using LINGO software. Here the objectives

More information

Protein Structure Prediction

Protein Structure Prediction Homology Modeling Protein Structure Prediction Ingo Ruczinski M T S K G G G Y F F Y D E L Y G V V V V L I V L S D E S Department of Biostatistics, Johns Hopkins University Fold Recognition b Initio Structure

More information

Deposited on: 13 July 2009

Deposited on: 13 July 2009 Kim, J. and Kim, Y. (2009) Optimal circular flight of multiple UAVs for target tracking in urban areas. In: Lazinica, A. (ed.) Intelligent Aerial Vehicles. IN-TECH, Vienna, Austia. ISBN 9789537619411 http://eprints.gla.ac.uk/6253/

More information

Getting Started with OptQuest

Getting Started with OptQuest Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Network Flows. 7. Multicommodity Flows Problems. Fall 2010 Instructor: Dr. Masoud Yaghini

Network Flows. 7. Multicommodity Flows Problems. Fall 2010 Instructor: Dr. Masoud Yaghini In the name of God Network Flows 7. Multicommodity Flows Problems 7.1 Introduction Fall 2010 Instructor: Dr. Masoud Yaghini Introduction Introduction In many application contexts, several physical commodities,

More information

Finding Compensatory Pathways in Yeast Genome

Finding Compensatory Pathways in Yeast Genome Finding Compensatory Pathways in Yeast Genome Olga Ohrimenko Abstract Pathways of genes found in protein interaction networks are used to establish a functional linkage between genes. A challenging problem

More information

Models in Engineering Glossary

Models in Engineering Glossary Models in Engineering Glossary Anchoring bias is the tendency to use an initial piece of information to make subsequent judgments. Once an anchor is set, there is a bias toward interpreting other information

More information

Structure-Guided Deimmunization CMPS 3210

Structure-Guided Deimmunization CMPS 3210 Structure-Guided Deimmunization CMPS 3210 Why Deimmunization? Protein, or biologic therapies are proving to be useful, but can be much more immunogenic than small molecules. Like a drug compound, a biologic

More information

PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP

PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP 1 PRIMER SELECTION METHODS FOR DETECTION OF GENOMIC INVERSIONS AND DELETIONS VIA PAMP B. DASGUPTA Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607-7053 E-mail: dasgupta@cs.uic.edu

More information

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the

More information

Immune Programming. Payman Samadi. Supervisor: Dr. Majid Ahmadi. March Department of Electrical & Computer Engineering University of Windsor

Immune Programming. Payman Samadi. Supervisor: Dr. Majid Ahmadi. March Department of Electrical & Computer Engineering University of Windsor Immune Programming Payman Samadi Supervisor: Dr. Majid Ahmadi March 2006 Department of Electrical & Computer Engineering University of Windsor OUTLINE Introduction Biological Immune System Artificial Immune

More information

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Mahsa Naseri and Simone A. Ludwig Abstract In service-oriented environments, services with different functionalities are combined

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Genetic Algorithm for Predicting Protein Folding in the 2D HP Model

Genetic Algorithm for Predicting Protein Folding in the 2D HP Model Genetic Algorithm for Predicting Protein Folding in the 2D HP Model A Parameter Tuning Case Study Eyal Halm Leiden Institute of Advanced Computer Science, University of Leiden Niels Bohrweg 1 2333 CA Leiden,

More information

STATISTICAL TECHNIQUES. Data Analysis and Modelling

STATISTICAL TECHNIQUES. Data Analysis and Modelling STATISTICAL TECHNIQUES Data Analysis and Modelling DATA ANALYSIS & MODELLING Data collection and presentation Many of us probably some of the methods involved in collecting raw data. Once the data has

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

CHAPTER 4 PROPOSED HYBRID INTELLIGENT APPROCH FOR MULTIPROCESSOR SCHEDULING

CHAPTER 4 PROPOSED HYBRID INTELLIGENT APPROCH FOR MULTIPROCESSOR SCHEDULING 79 CHAPTER 4 PROPOSED HYBRID INTELLIGENT APPROCH FOR MULTIPROCESSOR SCHEDULING The present chapter proposes a hybrid intelligent approach (IPSO-AIS) using Improved Particle Swarm Optimization (IPSO) with

More information

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU 2017 2nd International Conference on Artificial Intelligence: Techniques and Applications (AITA 2017 ISBN: 978-1-60595-491-2 A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi

More information

Combinatorial Auctions

Combinatorial Auctions T-79.7003 Research Course in Theoretical Computer Science Phase Transitions in Optimisation Problems October 16th, 2007 Combinatorial Auctions Olli Ahonen 1 Introduction Auctions are a central part of

More information

Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example.

Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example. Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example Amina Lamghari COSMO Stochastic Mine Planning Laboratory! Department

More information

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine R. Sathya Assistant professor, Department of Computer Science & Engineering Annamalai University

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

TRANSPORTATION PROBLEM AND VARIANTS

TRANSPORTATION PROBLEM AND VARIANTS TRANSPORTATION PROBLEM AND VARIANTS Introduction to Lecture T: Welcome to the next exercise. I hope you enjoyed the previous exercise. S: Sure I did. It is good to learn new concepts. I am beginning to

More information

Worker Skill Estimation from Crowdsourced Mutual Assessments

Worker Skill Estimation from Crowdsourced Mutual Assessments Worker Skill Estimation from Crowdsourced Mutual Assessments Shuwei Qiang The George Washington University Amrinder Arora BizMerlin Current approaches for estimating skill levels of workforce either do

More information

Constraint-based Preferential Optimization

Constraint-based Preferential Optimization Constraint-based Preferential Optimization S. Prestwich University College Cork, Ireland s.prestwich@cs.ucc.ie F. Rossi and K. B. Venable University of Padova, Italy {frossi,kvenable}@math.unipd.it T.

More information

Inference and computing with decomposable graphs

Inference and computing with decomposable graphs Inference and computing with decomposable graphs Peter Green 1 Alun Thomas 2 1 School of Mathematics University of Bristol 2 Genetic Epidemiology University of Utah 6 September 2011 / Bayes 250 Green/Thomas

More information

Spatial Information in Offline Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochastic Requests

Spatial Information in Offline Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochastic Requests 1 Spatial Information in Offline Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochastic Requests Ansmann, Artur, TU Braunschweig, a.ansmann@tu-braunschweig.de Ulmer, Marlin W., TU

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

Predicting ratings of peer-generated content with personalized metrics

Predicting ratings of peer-generated content with personalized metrics Predicting ratings of peer-generated content with personalized metrics Project report Tyler Casey tyler.casey09@gmail.com Marius Lazer mlazer@stanford.edu [Group #40] Ashish Mathew amathew9@stanford.edu

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY 72 CHAPTER 3 RESEARCH METHODOLOGY Inventory management is considered to be an important field in Supply chain management. Once the efficient and effective management of inventory is carried out throughout

More information

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center Comparative Modeling Part 1 Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center Function is the most important feature of a protein Function is related to structure Structure is

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

Genetic Programming for Symbolic Regression

Genetic Programming for Symbolic Regression Genetic Programming for Symbolic Regression Chi Zhang Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA Email: czhang24@utk.edu Abstract Genetic

More information

Structured System Analysis Methodology for Developing a Production Planning Model

Structured System Analysis Methodology for Developing a Production Planning Model Structured System Analysis Methodology for Developing a Production Planning Model Mootaz M. Ghazy, Khaled S. El-Kilany, and M. Nashaat Fors Abstract Aggregate Production Planning (APP) is a medium term

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Preference Elicitation for Group Decisions

Preference Elicitation for Group Decisions Preference Elicitation for Group Decisions Lihi Naamani-Dery 1, Inon Golan 2, Meir Kalech 2, and Lior Rokach 1 1 Telekom Innovation Laboratories at Ben-Gurion University, Israel 2 Ben Gurion University,

More information

Scheduling and Coordination of Distributed Design Projects

Scheduling and Coordination of Distributed Design Projects Scheduling and Coordination of Distributed Design Projects F. Liu, P.B. Luh University of Connecticut, Storrs, CT 06269-2157, USA B. Moser United Technologies Research Center, E. Hartford, CT 06108, USA

More information

LOGISTICAL ASPECTS OF THE SOFTWARE TESTING PROCESS

LOGISTICAL ASPECTS OF THE SOFTWARE TESTING PROCESS LOGISTICAL ASPECTS OF THE SOFTWARE TESTING PROCESS Kazimierz Worwa* * Faculty of Cybernetics, Military University of Technology, Warsaw, 00-908, Poland, Email: kazimierz.worwa@wat.edu.pl Abstract The purpose

More information

University Question Paper Two Marks

University Question Paper Two Marks University Question Paper Two Marks 1. List the application of Operations Research in functional areas of management. Answer: Finance, Budgeting and Investment Marketing Physical distribution Purchasing,

More information

Burstiness-aware service level planning for enterprise application clouds

Burstiness-aware service level planning for enterprise application clouds Youssef and Krishnamurthy Journal of Cloud Computing: Advances, Systems and Applications (2017) 6:17 DOI 10.1186/s13677-017-0087-y Journal of Cloud Computing: Advances, Systems and Applications RESEARCH

More information

Comparison of a Job-Shop Scheduler using Genetic Algorithms with a SLACK based Scheduler

Comparison of a Job-Shop Scheduler using Genetic Algorithms with a SLACK based Scheduler 1 Comparison of a Job-Shop Scheduler using Genetic Algorithms with a SLACK based Scheduler Nishant Deshpande Department of Computer Science Stanford, CA 9305 nishantd@cs.stanford.edu (650) 28 5159 June

More information

Using Decision Tree to predict repeat customers

Using Decision Tree to predict repeat customers Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the

More information

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE ABSTRACT Robert M. Saltzman, San Francisco State University This article presents two methods for coordinating

More information

9. Verification, Validation, Testing

9. Verification, Validation, Testing 9. Verification, Validation, Testing (a) Basic Notions (b) Dynamic testing. (c) Static analysis. (d) Modelling. (e) Environmental Simulation. (f) Test Strategies. (g) Tool support. (h) Independent Verification

More information

Reaction Paper Influence Maximization in Social Networks: A Competitive Perspective

Reaction Paper Influence Maximization in Social Networks: A Competitive Perspective Reaction Paper Influence Maximization in Social Networks: A Competitive Perspective Siddhartha Nambiar October 3 rd, 2013 1 Introduction Social Network Analysis has today fast developed into one of the

More information

Irrigation network design and reconstruction and its analysis by simulation model

Irrigation network design and reconstruction and its analysis by simulation model SSP - JOURNAL OF CIVIL ENGINEERING Vol. 9, Issue 1, 2014 DOI: 10.2478/sspjce-2014-0001 Irrigation network design and reconstruction and its analysis by simulation model Milan Čistý, Zbynek Bajtek, Anna

More information

The Interaction-Interaction Model for Disease Protein Discovery

The Interaction-Interaction Model for Disease Protein Discovery The Interaction-Interaction Model for Disease Protein Discovery Ken Cheng December 10, 2017 Abstract Network medicine, the field of using biological networks to develop insight into disease and medicine,

More information

Generative Models for Networks and Applications to E-Commerce

Generative Models for Networks and Applications to E-Commerce Generative Models for Networks and Applications to E-Commerce Patrick J. Wolfe (with David C. Parkes and R. Kang-Xing Jin) Division of Engineering and Applied Sciences Department of Statistics Harvard

More information

On Optimal Tiered Structures for Network Service Bundles

On Optimal Tiered Structures for Network Service Bundles On Tiered Structures for Network Service Bundles Qian Lv, George N. Rouskas Department of Computer Science, North Carolina State University, Raleigh, NC 7695-86, USA Abstract Network operators offer a

More information

Ant Colony Optimisation

Ant Colony Optimisation Ant Colony Optimisation Alexander Mathews, Angeline Honggowarsito & Perry Brown 1 Image Source: http://baynature.org/articles/the-ants-go-marching-one-by-one/ Contents Introduction to Ant Colony Optimisation

More information

Rank hotels on Expedia.com to maximize purchases

Rank hotels on Expedia.com to maximize purchases Rank hotels on Expedia.com to maximize purchases Nishith Khantal, Valentina Kroshilina, Deepak Maini December 14, 2013 1 Introduction For an online travel agency (OTA), matching users to hotel inventory

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

MBF1413 Quantitative Methods

MBF1413 Quantitative Methods MBF1413 Quantitative Methods Prepared by Dr Khairul Anuar 1: Introduction to Quantitative Methods www.notes638.wordpress.com Assessment Two assignments Assignment 1 -individual 30% Assignment 2 -individual

More information

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 19 Spring 2004-1- Protein structure Primary structure of protein is determined by number and order of amino acids within polypeptide chain.

More information

Book Outline. Software Testing and Analysis: Process, Principles, and Techniques

Book Outline. Software Testing and Analysis: Process, Principles, and Techniques Book Outline Software Testing and Analysis: Process, Principles, and Techniques Mauro PezzèandMichalYoung Working Outline as of March 2000 Software test and analysis are essential techniques for producing

More information

College of information technology Department of software

College of information technology Department of software University of Babylon Undergraduate: third class College of information technology Department of software Subj.: Application of AI lecture notes/2011-2012 ***************************************************************************

More information

Clock-Driven Scheduling

Clock-Driven Scheduling NOTATIONS AND ASSUMPTIONS: UNIT-2 Clock-Driven Scheduling The clock-driven approach to scheduling is applicable only when the system is by and large deterministic, except for a few aperiodic and sporadic

More information

Disentangling Prognostic and Predictive Biomarkers Through Mutual Information

Disentangling Prognostic and Predictive Biomarkers Through Mutual Information Informatics for Health: Connected Citizen-Led Wellness and Population Health R. Randell et al. (Eds.) 2017 European Federation for Medical Informatics (EFMI) and IOS Press. This article is published online

More information

Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage?

Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage? 2016 International Conference on Artificial Intelligence and Computer Science (AICS 2016) ISBN: 978-1-60595-411-0 Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage? Chen CHEN

More information

Software Next Release Planning Approach through Exact Optimization

Software Next Release Planning Approach through Exact Optimization Software Next Release Planning Approach through Optimization Fabrício G. Freitas, Daniel P. Coutinho, Jerffeson T. Souza Optimization in Software Engineering Group (GOES) Natural and Intelligent Computation

More information

Computational Methods for Protein Structure Prediction

Computational Methods for Protein Structure Prediction Computational Methods for Protein Structure Prediction Ying Xu 2017/12/6 1 Outline introduction to protein structures the problem of protein structure prediction why it is possible to predict protein structures

More information

Fixed vs. Self-Adaptive Crossover-First Differential Evolution

Fixed vs. Self-Adaptive Crossover-First Differential Evolution Applied Mathematical Sciences, Vol. 10, 2016, no. 32, 1603-1610 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2016.6377 Fixed vs. Self-Adaptive Crossover-First Differential Evolution Jason

More information

PERFORMANCE EVALUATION OF GENETIC ALGORITHMS ON LOADING PATTERN OPTIMIZATION OF PWRS

PERFORMANCE EVALUATION OF GENETIC ALGORITHMS ON LOADING PATTERN OPTIMIZATION OF PWRS International Conference Nuclear Energy in Central Europe 00 Hoteli Bernardin, Portorož, Slovenia, September 0-3, 00 www: http://www.drustvo-js.si/port00/ e-mail: PORT00@ijs.si tel.:+ 386 588 547, + 386

More information

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,

More information

Protein Structure Prediction. christian studer , EPFL

Protein Structure Prediction. christian studer , EPFL Protein Structure Prediction christian studer 17.11.2004, EPFL Content Definition of the problem Possible approaches DSSP / PSI-BLAST Generalization Results Definition of the problem Massive amounts of

More information

Generational and steady state genetic algorithms for generator maintenance scheduling problems

Generational and steady state genetic algorithms for generator maintenance scheduling problems Generational and steady state genetic algorithms for generator maintenance scheduling problems Item Type Conference paper Authors Dahal, Keshav P.; McDonald, J.R. Citation Dahal, K. P. and McDonald, J.

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: May 1, 2009 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information

Molecular Structures

Molecular Structures Molecular Structures 1 Molecular structures 2 Why is it important? Answers to scientific questions such as: What does the structure of protein X look like? Can we predict the binding of molecule X to Y?

More information

Suppl. Figure 1: RCC1 sequence and sequence alignments. (a) Amino acid

Suppl. Figure 1: RCC1 sequence and sequence alignments. (a) Amino acid Supplementary Figures Suppl. Figure 1: RCC1 sequence and sequence alignments. (a) Amino acid sequence of Drosophila RCC1. Same colors are for Figure 1 with sequence of β-wedge that interacts with Ran in

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

A TABU SEARCH METAHEURISTIC FOR ASSIGNMENT OF FLOATING CRANES

A TABU SEARCH METAHEURISTIC FOR ASSIGNMENT OF FLOATING CRANES 1 st Logistics International Conference Belgrade, Serbia 28 - November 13 A TABU SEARCH METAHEURISTIC FOR ASSIGNMENT OF FLOATING CRANES Dragana M. Drenovac * University of Belgrade, Faculty of Transport

More information

Technical Bulletin Comparison of Lossy versus Lossless Shift Factors in the ISO Market Optimizations

Technical Bulletin Comparison of Lossy versus Lossless Shift Factors in the ISO Market Optimizations Technical Bulletin 2009-06-03 Comparison of Lossy versus Lossless Shift Factors in the ISO Market Optimizations June 15, 2009 Comparison of Lossy versus Lossless Shift Factors in the ISO Market Optimizations

More information

Consumer Referral in a Small World Network

Consumer Referral in a Small World Network Consumer Referral in a Small World Network Tackseung Jun 1 Beom Jun Kim 2 Jeong-Yoo Kim 3 August 8, 2004 1 Department of Economics, Kyung Hee University, 1 Hoegidong, Dongdaemunku, Seoul, 130-701, Korea.

More information

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE CHAPTER1 ROAD TO STATISTICAL BIOINFORMATICS Jae K. Lee Department of Public Health Science, University of Virginia, Charlottesville, Virginia, USA There has been a great explosion of biological data and

More information

Learning by Observing

Learning by Observing Working Papers in Economics Learning by Observing Efe Postalcı, zmir University of Economics Working Paper #10/07 December 2010 Izmir University of Economics Department of Economics Sakarya Cad. No:156

More information

TIMETABLING EXPERIMENTS USING GENETIC ALGORITHMS. Liviu Lalescu, Costin Badica

TIMETABLING EXPERIMENTS USING GENETIC ALGORITHMS. Liviu Lalescu, Costin Badica TIMETABLING EXPERIMENTS USING GENETIC ALGORITHMS Liviu Lalescu, Costin Badica University of Craiova, Faculty of Control, Computers and Electronics Software Engineering Department, str.tehnicii, 5, Craiova,

More information

Gene Expression Data Analysis

Gene Expression Data Analysis Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009 Gene expression technologies (summary) Hybridization-based

More information

A simulation-based risk analysis technique to determine critical assets in a logistics plan

A simulation-based risk analysis technique to determine critical assets in a logistics plan 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 A simulation-based risk analysis technique to determine critical assets in

More information

Inventory Lot Sizing with Supplier Selection

Inventory Lot Sizing with Supplier Selection Inventory Lot Sizing with Supplier Selection Chuda Basnet Department of Management Systems The University of Waikato, Private Bag 315 Hamilton, New Zealand chuda@waikato.ac.nz Janny M.Y. Leung Department

More information

Optimizing multiple spaced seeds for homology search

Optimizing multiple spaced seeds for homology search Optimizing multiple spaced seeds for homology search Jinbo Xu, Daniel G. Brown, Ming Li, and Bin Ma School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada j3xu,browndg,mli

More information

Evolutionary Algorithms

Evolutionary Algorithms Evolutionary Algorithms Evolutionary Algorithms What is Evolutionary Algorithms (EAs)? Evolutionary algorithms are iterative and stochastic search methods that mimic the natural biological evolution and/or

More information

Drift versus Draft - Classifying the Dynamics of Neutral Evolution

Drift versus Draft - Classifying the Dynamics of Neutral Evolution Drift versus Draft - Classifying the Dynamics of Neutral Evolution Alison Feder December 3, 203 Introduction Early stages of this project were discussed with Dr. Philipp Messer Evolutionary biologists

More information

Genetic Algorithms For Protein Threading

Genetic Algorithms For Protein Threading From: ISMB-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Genetic Algorithms For Protein Threading Jacqueline Yadgari #, Amihood Amir #, Ron Unger* # Department of Mathematics

More information

Molecular Structures

Molecular Structures Molecular Structures 1 Molecular structures 2 Why is it important? Answers to scientific questions such as: What does the structure of protein X look like? Can we predict the binding of molecule X to Y?

More information

Exploring Long DNA Sequences by Information Content

Exploring Long DNA Sequences by Information Content Exploring Long DNA Sequences by Information Content Trevor I. Dix 1,2, David R. Powell 1,2, Lloyd Allison 1, Samira Jaeger 1, Julie Bernal 1, and Linda Stern 3 1 Faculty of I.T., Monash University, 2 Victorian

More information

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination! Dan Gusfield NCBS CS and BIO Meeting December 19, 2016 !2 SNP Data A SNP is a Single Nucleotide Polymorphism

More information

Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels

Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels W O R K I N G P A P E R S A N D S T U D I E S ISSN 1725-4825 Revision confidence limits for recent data on trend levels, trend growth rates and seasonally adjusted levels Conference on seasonality, seasonal

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

Distributed Algorithms for Resource Allocation Problems. Mr. Samuel W. Brett Dr. Jeffrey P. Ridder Dr. David T. Signori Jr 20 June 2012

Distributed Algorithms for Resource Allocation Problems. Mr. Samuel W. Brett Dr. Jeffrey P. Ridder Dr. David T. Signori Jr 20 June 2012 Distributed Algorithms for Resource Allocation Problems Mr. Samuel W. Brett Dr. Jeffrey P. Ridder Dr. David T. Signori Jr 20 June 2012 Outline Survey of Literature Nature of resource allocation problem

More information

Experimental design of RNA-Seq Data

Experimental design of RNA-Seq Data Experimental design of RNA-Seq Data RNA-seq course: The Power of RNA-seq Thursday June 6 th 2013, Marco Bink Biometris Overview Acknowledgements Introduction Experimental designs Randomization, Replication,

More information