[4] SCHEMA-Guided Protein Recombination

Size: px

Start display at page:

Download "[4] SCHEMA-Guided Protein Recombination"

Chastity Baker
6 years ago
Views:

1 [4] SCHEMA-guided protein recombination 35 [4] SCHEMA-Guided Protein Recombination By Jonathan J. Silberg, Jeffrey B. Endelman, and Frances H. Arnold Introduction SCHEMA is a scoring function that predicts which elements in homologous proteins can be swapped without disturbing the integrity of the structure. 1 Using the structural coordinates of the parent proteins, SCHEMA identifies pairs of residues that are interacting and determines the number of interactions, E, that are broken when a chimeric protein inherits portions of its sequence from different parents. E appears to be a good metric for anticipating structural conservation when homologous proteins are recombined. Analysis of well-defined libraries of -lactamase chimeras revealed that chimeras with low E retained function with higher probability than chimeras with the same effective level of mutation but higher E or chosen at random. 2 Another study also showed that E is a useful measure for anticipating disruption in chimeras of a larger, cofactor-containing protein, cytochrome P Using SCHEMA, libraries of chimeras can be compared in silico to determine which one is expected to contain the highest fraction of folded (and potentially interesting) sequences for laboratory evolution studies. 2 4 These libraries can be synthesized in vitro using site-directed recombination methods (see Fig. 1), which allow for the simultaneous recombination of two or more parents at specified locations. 2,5 This approach can be used to make chimeric libraries from any parent sequences. In addition, the sequence diversity of folded and functional chimeras encoded in the library can be controlled, i.e., the number of possible unique sequences and the average level of mutation of chimeras predicted to retain structure, can be 1 C. A. Voigt, C. Martinez, Z. G. Wang, S. L. Mayo, and F. H. Arnold, Nature Struct. Biol. 9, 553 (2002). 2 M. M. Meyer, J. J. Silberg, C. A. Voigt, J. B. Endelman, S. L. Mayo, Z. G. Wang, and F. H. Arnold, Protein Sci. 12, 1686 (2003). 3 C. R. Otey, J. J. Silberg, C. A. Voigt, J. B. Endelman, G. Bandara, and F. H. Arnold, Chem. Biol. 11, 309 (2004). 4 D. A. Drummond, J. J. Silberg, J. B. Endelman, C. A. Wilke, and F. H. Arnold, submitted for publication. 5 K. Hiraga and F. H. Arnold, J. Mol. Biol. 330, 287 (2003). Copyright 2004, Elsevier Inc. All rights reserved. METHODS IN ENZYMOLOGY, VOL /04 $35.00

2 36 methodology [4] Fig. 1. Library synthesis by site-directed recombination. Sequence elements encoding structurally related polypeptides are swapped at defined locations (dashed lines) in two or more homologous proteins. This yields a library containing h y h unique chimeras, where h is the number of parents recombined and y is the number of sequence elements that are exchanged. used to guide the selection of crossover locations and crossover number. In contrast, annealing-based recombination or DNA shuffling techniques, such as Stemmer shuffling, 6,7 StEP, 8 and in vivo methods, 9 generate crossovers only in regions of sequence identity and therefore can not generate diverse libraries from more distant parent sequences. The sequenceindependent random recombination methods now available (SHIPREC, 10 ITCHY, 11 or SCRATCHY 12 ) do not make multiple crossovers efficiently and therefore create libraries of very limited diversity. This article outlines the procedure used for calculating E for a chimera and discusses ideas for optimizing the design of combinatorial libraries for directed evolution. 6 W. P. Stemmer, Proc. Natl. Acad. Sci. USA 91, (1994). 7 W. P. Stemmer, Nature 370, 389 (1994). 8 H. Zhao, L. Giver, Z. Shao, J. A. Affholter, and F. H. Arnold, Nature Biotechnol. 16, 258 (1998). 9 A. A. Volkov, Z. Shao, and F. H. Arnold, Nucleic Acids Res. 27, e18 (1999). 10 V. Sieber, A. Pluckthun, and F. X. Schmid, Nature Biotechnol. 16, 955 (1998). 11 M. Ostermeier, A. E. Nixon, and S. J. Benkovic, Bioorg. Med. Chem. 7, 2139 (1999). 12 S. Lutz, M. Ostermeier, G. L. Moore, C. D. Maranas, and S. J. Benkovic, Proc. Natl. Acad. Sci. USA 98, (2001).

3 [4] SCHEMA-guided protein recombination 37 Methods Calculating SCHEMA Disruption Based on the structure of the parent proteins, SCHEMA determines which residues are interacting, defined as those residues within a cutoff distance, and generates a contact matrix. 1 When recombining two parents, the contacts are scaled by the sequence identity of the parents being recombined, i.e., all contacts that cannot be broken by recombination are removed from the matrix. E is determined by counting the number of contacts broken when a chimeric protein inherits portions of its sequence from different parents. The SCHEMA disruption E of a chimeric sequence s, made by recombining sequence elements from h homologous proteins, is given by E ¼ XN X N C ij Pði; j; s i ; s j Þ; (1) i¼1 j¼iþ1 where N is the number of residues that have defined coordinates in the parental structure, C ij ¼ 1 if residues i and j are within the cutoff distance d c (otherwise C ij ¼ 0), and s i designates the parent incorporated at position i in the chimera (e.g., s i ¼ 1 if the sequence is derived from parent #1, s i ¼ 2if derived from parent #2, etc.). A(s i, k) is the identity of the residue in parent s i at position k within the parental amino acid sequence, and P(i, j, s i, s j ) ¼ 0 if any parent has residue A(s i, i) at position i and residue A(s j, j) at position j [otherwise P(i, j, s i, s j ) ¼ 1]. It is essential that structurally related residues in each parent are numbered identically, e.g., A(1, k) and A(2, k) should represent structurally related residues in each parent, to ensure that sequence identities among the parents are properly accounted for when calculating E. Typically, we use d c ¼ 4.5 Å, and hydrogen, backbone nitrogen, and backbone oxygen atoms are excluded from the calculation of E. Small deviations from this value of d c or the use of all atoms to calculate C ij does not significantly affect the relative E of chimeras being compared, although the magnitude of E changes. When cofactor-containing proteins are recombined, contacts between the cofactor and residues in the proteins are also excluded from calculation of E. 3 In this simple model, contacts between the cofactor and the protein cannot be broken by the recombination of related proteins. Ideally, PDB coordinates for all the parent sequences are available, and a structure-based alignment is performed. For parents whose sequences differ in length, this ensures that structurally related residues in each parent are numbered identically. This can be done using free software packages

4 38 methodology [4] Fig. 2. Treatment of gaps in sequence alignments. A hypothetical sequence alignment used in SCHEMA calculations is shown. PDB coordinates corresponding to the top sequence are being used by SCHEMA to calculate C ij, and the bottom sequence represents a homologous protein for which no structural information is available. At position 1, the atomic coordinates of glycine are defined in the PDB file, so the gap in the second parent is treated as a mutation relative to G when computing E and m. Because there are no coordinates for position 2, it is ignored when computing E and m. such as SwissProt or the combinatorial-extension algorithm. 13,14 If structural coordinates are available for different conformational states of the parents being recombined, it is best to assess E using the coordinates for each conformation to ensure that both states of the chimeras are likely to exhibit similar low disruption. When the structure of only one parent is available, sequence alignments can be performed using the BLAST algorithm. 15 Often alignment of the parents requires the insertion of gaps within the primary amino acid sequence of one or more of the parents (see Fig. 2). When gaps are introduced into the parent whose structural coordinates are being used to generate the contact matrix C ij, the residues found in the other parents are ignored when calculating E because there is no corresponding structural information. In contrast, when gaps occur in any parent other than the one used for structural information, they are treated like real residues that differ in identity from the residues in the other parents. From Disruption to Probabilities The fraction of chimeras retaining function has been found to decrease exponentially with E. 2 If we posit that any disrupted contact has a probability f d of yielding a nonfunctional chimera and each contact acts statistically independently of the others, the fraction of chimeras at each E predicted to retain function is given by P f ¼ (1 f d E/N) N. In this case, N equals the total number of contacts that could be broken by recombination. When N becomes large, as with proteins, this equation can be approximated by a simple exponential, P f ¼ e fde. This relationship between E and P f is likely to hold for the recombination of any homologous proteins. However, functional data from different chimeric libraries may yield a range of f d values. We have found that the sensitivity of the 13 N. Guex and M. C. Peitsch, Electrophoresis 18, 2714 (1997). 14 I. N. Shindyalov and P. E. Bourne, Protein Eng. 11, 739 (1998). 15 S. Henikoff and J. G. Henikoff, Proc. Natl. Acad. Sci. USA 89, (1992).

5 [4] SCHEMA-guided protein recombination 39 functional assay alone can significantly affect the P f at each E. A more sensitive functional screen for the conservation of lactamase function, for example, identified more functional chimeras and yielded a higher value of P f at each E (and lower f d ) than a functional selection. 2,5 We also expect f d to depend on the protein scaffold. The value of f d can be calibrated rapidly for any system by analyzing folding or function of a small population of chimeras with known E. 3 From approximately 10 to 20 chimeras that encompass a broad range of disruption, we have observed that those exhibiting the highest levels of E are mostly nonfunctional and those with the lowest E are almost all functional. From such a population of chimeras, the E where chimeras exhibit a P f ¼ 0.5 (designated E 1/2 ) can be estimated, and the f d can be calculated, f d ¼ lnð2þ=e 1=2. SCHEMA-Guided Library Design Because screening and selection strategies can evaluate a limited number of protein variants for altered functions, we would like to make libraries that are enriched in folded chimeras. SCHEMA can help identify such libraries by computing the fraction of chimeras F expected to retain the parental function (and fold) in different libraries arising from recombination of the same parents at different crossover locations. The fraction of folded chimeras in a library is given by F ¼ 1 n X n i¼1 e f de i ; (2) where n is the number of unique chimeras present. Libraries with the highest possible F are not necessarily preferred for directed evolution. An additional issue to consider when choosing a library is the sequence diversity D of the functional chimeras in that library. The D of a library describes the average mutation level m of folded chimeras in that library, where m is the amino acid Hamming distance of each chimera to its closest parent. D is calculated from D ¼ 1 1 X n m i e f de i : (3) F n i¼1 Unfortunately, we know little about the effect of m on the evolution of function. Studies examining the effect of m on the acquisition of novel functional properties in laboratory evolution studies suggest that variants with higher m are more likely to exhibit altered functional properties, whereas those with lower m tend to be more similar to the parents. 3,16,17

6 40 methodology [4] In the case of cytochrome P450 chimeras, chimeras exhibiting altered substrate specificity had an average m of 34, and chimeras that displayed a parent-like substrate specificity had an average m of Because E also depends on m, there is a trade-off between F and D. Chimeras with low E on average also have low m; therefore, libraries that simply maximize F have limited diversity. If one can enumerate all possible libraries, then the library design problem is easy: simply choose the one with the desired combination of D and F. Exhaustive enumeration, however, is not feasible with today s computers for average-sized proteins and the library sizes appropriate for directed evolution (thousands or more chimeras). Consider the design of a 10 crossover library between the -lactamases PSE-4 and TEM-1, which have similar overall structure, similar length (265 residues), and 40% sequence identity. There are possible libraries, each with 2 11 ¼ 2048 unique chimeric sequences too many for exhaustive enumeration. In this case, one can make progress by evaluating D and F for thousands, or even millions of random libraries. Figure 3 shows the F and D for 50,000 randomly chosen libraries. Here we observe libraries with diversities ranging from 7 to 41 (mutations per folded sequence) and F values from 0.5 to 7%. With SCHEMA we can identify the library with the highest F at a particular level of D. For the random sample in Fig. 3, this increases the number of folded chimeras up to fivefold compared to the average for libraries with the same diversity. The random enumeration in Fig. 3 sampled a limited region of the (F, D) plane. Alternate enumeration schemes, e.g., enforcing minimum and maximum fragment sizes, can rapidly explore different combinations of F and D. Choosing different parents will also affect which regions of the (F, D) plane are accessible, even if the parents exhibit similar levels of sequence identities. Decreasing the sequence identity of the parents will, in general, lead to libraries with higher D but lower F values. The same trend is expected if one increases the number of parents recombined. Practical Considerations in Library Design The best methods available for creating the libraries identified by SCHEMA as enriched in folded chimeras are sequence-independent sitedirected chimeragenesis and chemical synthesis. 2,5 These techniques can recombine parents with any level of sequence identity at multiple sites and easily create libraries encoding thousands of chimeras. However, these 16 M. Zaccolo and E. Gherardi, J. Mol. Biol. 285, 775 (1999). 17 P. S. Daugherty, G. Chen, B. L. Iverson, and G. Georgiou, Proc. Natl. Acad. Sci. USA 97, 2029 (2000).

7 [4] SCHEMA-guided protein recombination 41 Fig. 3. SCHEMA analysis of computed libraries. Randomly chosen libraries made by allowing 10 crossovers between -lactamases TEM-1 and PSE-4 were compared. The crossover locations in each library were enumerated by choosing 10 distinct, random integers between 1 and 153 (the length of the structural alignment minus the number of conserved residues minus 1). Each integer k represents a crossover between k th and (k þ 1) th nonidentical residues. This was repeated until 50,000 ten-integer sequences were generated. The structural coordinates of PSE-4 (PDB ¼ 1G68) 18 were used to compute E for every chimera present in each library. F and D (rounded to the nearest integer) were computed for each library according to Eqs. (2) and (3) with f d ¼ The average F at each value of D is shown as a solid line. methods are limited in where they can recombine distantly related proteins without introducing amino acid sequence changes not found in either parent. This happens because both techniques generate chimeras by ligating double-stranded gene modules together (encoding each swapped polypeptide), which requires that parents exhibit 2 to 4 bp of identity at the crossover boundaries. When the parents do not exhibit sufficient identity at the desired crossover locations, synonymous mutations can often be introduced to allow recombination at that site. If synonymous mutations are not sufficient, then other crossover positions should be chosen for the library. SCHEMA only calculates the disruption arising from recombination, not from mutation. 18 D. Lim, F. Sanschagrin, L. Passmore, L. De Castro, R. C. Levesque, and N. C. Strynadka, Biochemistry 40, 395 (2001).

8 42 methodology [5] Program Availability Software for running SCHEMA calculations is available at the Arnold group web site at the California Institute of Technology ( caltech.edu/groups/fha/). [5] Staggered Extension Process In Vitro DNA Recombination By Huimin Zhao Introduction In vitro DNA recombination is an extremely powerful approach for the directed evolution of proteins and nucleic acids. Unlike random mutagenesis methods in which point mutations are introduced randomly into a single parent sequence to produce a library of progeny sequences, DNA recombination methods entail the block-wise exchange of genetic variations among multiple parent sequences created in the laboratory or existing in nature to produce a library of chimeric progeny sequences. The key advantage of DNA recombination is its ability to accumulate beneficial mutations while simultaneously removing deleterious mutations, which may greatly accelerate the evolution of a protein or nucleic acid molecule of interest toward a specific function. It was demonstrated in computational simulation studies that DNA recombination plays a critical role in the evolution of biological systems. 1 In the past decade, in vitro DNA recombination has been used successfully to alter and engineer many types of protein function, such as stability, activity, affinity, selectivity, substrate specificity, and protein folding/solubility. 2,3 The first described in vitro DNA recombination method, or DNA shuffling, was developed by Stemmer in 1994, 4,5 in which DNA fragments generated by the random digestion of parent genes with DNase I are combined and reassembled into full-length chimeric progeny genes in a polymerase chain reaction (PCR)-like process. Since then, a number of in vitro DNA recombination methods have been described, 6 such as 1 S. Forrest, Science 261, 872 (1993). 2 O. Kuchner and F. H. Arnold, Trends Biotechnol. 15, 523 (1997). 3 C. Schmidt-Dannert, Biochemistry 40, (2001). 4 W. P. Stemmer, Proc. Natl. Acad. Sci. USA 91, (1994). 5 W. P. Stemmer, Nature 370, 389 (1994). Copyright 2004, Elsevier Inc. All rights reserved. METHODS IN ENZYMOLOGY, VOL /04 $35.00

On the conservative nature of intragenic recombination

On the conservative nature of intragenic recombination D. Allan Drummond*, Jonathan J. Silberg, Michelle M. Meyer, Claus O. Wilke*, and Frances H. Arnold ** *Program in Computation and Neural Systems,