Modeling antibody hypervariable loops: A combined algorithm

Size: px
Start display at page:

Download "Modeling antibody hypervariable loops: A combined algorithm"

Transcription

1 Proc. Natl. Acad. Sci. USA Vol. 86, pp , December 1989 Biophysics Modeling antibody hypervariable loops: A combined algorithm (loop replacement/antigen combining site/complementarity-determing regions) ANDREW C. R. MARTIN, JANET C. CHEETHAM, AND ANTHONY R. REES*t Laboratory of Molecular Biophysics, University of Oxford, The Rex Richards Building, South Parks Road, Oxford OX1 3QU, United Kingdom Communicated by David Phillips, September 7, 1989 ABSTRACT To be of any value, a predicted model of an antibody combining site should have an accuracy approaching that of antibody structures determined by x-ray crystallography ( A). A number of modeling protocols have been proposed, which fall into two main categories-those that adopt a knowledge-based approach and those that attempt to construct the hypervariable loop regions of the antibody ab initio. Here we present a combined algorithm requiring no arbitrary decisions on the part of the user, which has been successfully applied to the modeling of the individual loops in two systems: the anti-lysozyme antibody HyHel-5, the crystal structure of which is as a complex with lysozyme [Sheriff, S., Silverton, E. W., Padlan, E. A., Cohen, G. H., Smith-GM, S. J., Finzel, B. C. & Davies, D. R. (1987) Proc. Naul. Acad. Sci. USA 84, ], and the free antigen binding fragment (Fab) of the anti-lysozyme peptide antibody, Gloop2. This protocol may be used with a high degree of confidence to model single-loop replacements, insertions, deletions, and side-chain replacements. In addition, it may be used in conjunction with other modeling protocols as a method by which to model particular loops whose conformations are predicted poorly by these methods. The wide range of specificities exhibited by antibodies is a function of the sequence and length variability of six hypervariable loops or complementarity-determining regions (CDRs) (1), which form the antigen combining site. These six CDRs supported on a highly conserved framework region constitute the variable region of the antigen binding fragment (Fab). A knowledge of antibody structure is essential for intelligent design ofantibody enzymes (2), tailoring of affinity (3), and CDR replacement strategies (4). However, sequence information vastly exceeds structural information from x-ray crystallography and, until crystallographic structure determination becomes no less routine than sequencing, modeling of structures is necessary. Since the framework region is conserved, it has proved relatively easy to model, whereas the CDRs, by their very nature (5), present a more challenging problem since accuracy in their modeling is of paramount importance. The approaches taken to modeling the antibody combining site, so far, fall into two groups: knowledge based and ab initio. Knowledge-based approaches have been used to model a number of antibodies, including J539 (6), GLOOP1-5 (5), HyHel-10 (7), and D1.3 (8). Although the methods differ in their detail, the common feature of all the approaches has been to examine only the known antibody crystal structures and select CDRs from these on the basis of length and/or sequence. Although most methods use simple sequence homology to select model loop conformations, Chothia and Lesk (9) have obtained better results by selecting conformations on the basis of the conservation of "key" residues that affect loop packing or conformation. However, the chief The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C solely to indicate this fact. problem with any such method is the limited size of the knowledge base: while this has been extended to include all protein loops in the broader protein modeling field (10), a general data base has not previously been used in antibody modeling. The second approach has been to use ab initio conformational search algorithms to saturate the conformational space available to a loop and select an appropriate structure on the basis of its energy, calculated by using an empirical energy function (11). Whereas this overcomes the limited size of the knowledge base, it fails to make use of the valuable information available in the structural data base and, consequently, is extremely expensive in computer time. Representative are the conformational search methods of Moult and James (12) and Bruccoleri and Karplus (13) and the random conformation dynamics method of Fine et al. (14). To date, the knowledge-based and ab initio approaches have had no more than limited success; they are not routinely able to construct all six CDRs with a high level of accuracy. The knowledge-based model of J539 (6), when compared with the crystal structure (15), shows rms deviations of A (backbone) and A (all atoms). Chothia's model of D1.3 (8) is better, with rms deviations of A (backbone) for five of the six loops; CDR-H1 (H refers to the heavy chain) has an rms of 2.07 A. (All atom rms deviations have not been published.) The conformational search model for Hy- Hel-5 (16) using the program CONGEN (13) shows rms deviations of A (backbone) and A (all atoms), whereas that for McPC603 (16) shows deviations of A (backbone) and A (all atoms). The recent crystal structure for the anti-lysozyme antibody Gloop2 (Phil Jeffrey, Garry Taylor, Robert Griest, Steven Sheriff, and A.R.R., unpublished data) has enabled us to evaluate the published knowledge-based maximum overlap model for Gloop2 from this laboratory (5). In four of the six loops, the agreement between the two is good: A (backbone) and A (all atoms). For the two remaining loops (CDR-H2 and CDR-H3), however, the rms deviations are 1.77 A and 3.61 A (backbone) and 2.98 A and 5.48 A (all atoms), respectively. The poorest loop, CDR-H3, is much shorter than that observed in any of the crystal structures used to create the model and thus required a major deletion to be made during the modeling. It is thus not surprising, given the unreliability of such manual deletions and insertions (17), that the conformation predicted for this loop is wrong. It should also be noted that the loops modeled in Gloop2 follow the sequence definition of Kabat et al. (18) and not the structural definition of Chothia and Lesk (9), which forms the basis of the rms deviations cited for the other models (see Discussion). Thus, it is clear that neither knowledge-based nor ab initio methods, when used alone, allow the accurate construction Abbreviation: CDR, complementarity-determining region. *To whom reprint requests should be sent. tcurrent address (until April 1990): Igen Inc., 1500 East Jefferson Street, Rockville, MD

2 Biophysics: Martin et al. of all six hypervariable loops on a routine basis. We have combined the two methods to produce a protocol that overcomes the limited size of the knowledge base by using conformational searching but does not ignore the information available in the structural data base. METHODS The combined algorithm we have developed (described in detail below) has, in the first instance, concentrated on the accurate modeling of individual hypervariable loops. The entire Brookhaven Protein Data Bank (19) (as opposed to a data base restricted to antibody structures) was searched for loops of similar conformation to those in the known antibody crystal structures by using distance constraints derived from an analysis of these structures. The loops thus selected were overlapped onto the takeoff points on the framework of the loop to be replaced. The sequences of the loops were corrected and hydrogens were added in standard orientations. The portion of the loop most likely to interact with the antigen or that surrounding any insertion, deletion, or mutation was then deleted and reconstructed by using the conformation search program CONGEN (13). The conformations thus generated were then screened by using a modified version of the GROMOS (20) potential and solvent accessibility criteria (21). The overall scheme is shown in Fig. 1. The Data Base Search. The distance constraints used to search the data base were derived from an analysis of available antibody crystal structures. Ca distances within each of the six CDRs of the known antibody structures were calculated from the N- and C-terminal residues (Fig. 2). The ranges used to search the data base are taken as the mean ± 3.5 standard deviations (or). (Three standard deviations cover virtually 100%o of a normal distribution, but, since the sample size is small, 3.5o- was chosen so as not to exclude any distances falling just outside the current distribution.) The data base of Ca distance constraints for all proteins in the Brookhaven Protein Data Bank has been created and, in its uncompacted form, occupies about 16 megabytes of disk space. Two searches were performed for each loop, applying distance constraints from the N- and C-terminal Cas of the loop, respectively. The results from these two searches were then merged to produce a list of loops fitting both sets of constraints, and redundancies resulting from updated entries in the data bank and multiple chains were optionally removed. Processing of the Loops. Each loop identified by the data base search was extracted from the Protein Data Bank and positioned onto the framework takeoff points of the loop being modeled (Fig. 3). The side chains were then replaced with standard conformations. Since these were later repositioned by conformational searching, only the position of the Cal that defines the overall orientation of the side chain is important. Where a Cal is present in the parent side chain, this position is used; when the parent is a glycine, the position of the C13 comes from the template conformation. "Explicit" hydrogens (22) involved in hydrogen bonding were added using standard geometries. After any loops that have bad angles at the framework junction or bad clashes with the rest of the structure were removed, the data base loops were processed into the form of a CONGEN conformations file. Conformational Searching. This uses the conformational search program CONGEN (13), which attempts to saturate the conformational space available to a polypeptide fragment by rotating the 4 and qi torsion angles before constructing side chains by rotating about the side-chain X torsion angles. Our protocol is to reconstruct a segment offive residues by rotating the 4 and 4i torsion angles of the two end residues in steps of 300 and by using the modified (23) Go and Scheraga (24) chain closure algorithm across the three middle residues. In addition, all the side chains in the loop are constructed. The Proc. Natl. Acad. Sci. USA 86 (1989) 9269 Model Loop FIG. 1. The overall scheme of the modeling protocol is illustrated in the flowchart. The variations on the method for short loops are shown. Energies are evaluated with the modified GROMOS potential. five-residue segment is chosen as the middle ofthe loop, in the case of a whole loop replacement, or the residues neighboring a mutation, insertion, or deletion. In all cases, iffewer than 100 conformations are generated, the grid size is reduced to 15 to provide a better sampling of conformational space; if no +3 + (n-3) - (n-3) (n-2) II~~~~~~~~~~ +(n-2) % -2 -+~ ~~n-n n -n FIG. 2. C, distances were calculated from the N terminus and the C terminus as shown and used to search the data base of known protein structures for loops of similar conformation.

3 9270 Biophysics: Martin et al. N' N IC NN' IC N' CC N CC N N' C'C FIG. 3. Overlapping the data base loops onto the framework is done in four stages. The N terminus is overlapped with the N- terminal takeoff point, the C terminus is then rotated onto the vector between the N- and C-terminal takeoff points, the loop is moved along this vector such that it is evenly placed between the two takeoff points and, finally, the loop is rotated such that the plane defined by its two termini and the center of mass of its backbone is coplanar with that defined by the equivalent points on the loop being replaced. conformations are generated, the number of residues constructed is increased until conformations are produced. A variation in the procedure occurs for short loops (seven amino acids or fewer). For seven-amino acid loops, the data base search appears to saturate the conformational space of the backbone satisfactorily, so only side-chain positions are searched with CONGEN. For five-amino acid loops (and shorter), the limited number of distance constraints means that an enormous number of conformations (>10,000) is selected from the data base, and processing this number of loops becomes impractical. Thus CONGEN is used alone in these cases. Screening the Conformations. CONGEN evaluates conformations using the CHARMM (11) potential in vacuo. It has been our experience that this potential rarely ranks the low rms conformations generated among those of low energy (unpublished data). An examination of the low-energy conformations by using molecular graphics shows the poor ranking of conformations to result from an optimization ofthe van der Waals energy of the loops resulting in a densely packed surface. In contrast, the same loops in the crystal structure are less tightly packed (25). This is analogous to the situation seen in a dynamic simulation of T1 ribonuclease (26), where in vacuo the active site is seen to collapse, whereas in the presence of solvent the native active site conformation is retained. Introduction of solvent, however, is impractical since, unless it is simulated dynamically, it is purely a matter of chance as to whether good or bad contacts are made between solvent and protein (25)-dynamic simulation is currently too time-consuming on a system as large as an antibody when it is necessary to screen many hundreds of conformations for each loop. As an alternative, we have made a major modification to the GROMOS (20) potential and used this to screen each conformation. In solvent, an approximately equal attractive force is exerted on the loop by protein and water molecules. Thus the attractive part of the Lennard-Jones potential is removed simulating this condition in vacuo. The high dielectric constant of water would tend to reduce the importance of electrostatics near the protein surface, and thus the electrostatic potential is also removed. Similar "solvent-modified" potentials have been used in molecular dynamics of L- arabinose-binding protein (27) in simulations of carbohydrates (28) and in analysis of misfolded proteins (29). By using this potential, the following general result was obtained: one of the lowest rms deviation loop conformations falls within the five lowest energies for those structures generated. Selection from this group was performed on the basis of Proc. Natl. Acad. Sci. USA 86 (1989) solvent accessibility (21) of hydrophobic atoms (defined for this purpose as all side-chain carbons that are not part of charged or polar groups). The conformation having the lowest hydrophobic exposed area was then selected. RESULTS AND DISCUSSION The method has been applied to modeling individually the six CDRs of the anti-lysozyme antibodies HyHel-5 (30) and Gloop2 (31). For HyHel-5, since the crystal structure is of a complex between antigen binding fragment (Fab) and lysozyme, our modeling has been performed with lysozyme in place. This is the fairest test of protocol since, until a crystal structure of an antibody, both complexed and uncomplexed, becomes available, we do not know to what extent the presence of antigen affects CDR conformation. Two crystal forms of Gloop2 have been solved in space groups P1 and P21 (Phil Jeffrey, Garry Taylor, Robert Griest, Steven Sheriff, and A.R.R., unpublished data). Although at lower resolution (3.5 A), the P21 structure is at a more complete stage of refinement (residual = 19.7%) and was thus chosen for this modeling exercise (P1 is at 2.8 A, residual = 21.4%). The data for the lowest energy and selected conformation in each of the six loops of HyHel-5 and Gloop2 together with the energies calculated for the crystal structure loops are shown in Tables 1 and 2. The conformations selected are shown in Table 3. The rms deviations for the model of HyHel-5 are, in all instances, very low. CDR-L3 (L refers to the light chain) has an rms deviation of 1.37 A for the backbone although the all atom deviation is 2.50 A. This results from the misplacement of the side chain of Trp-90, which is rotated by approximately 1800 about the X1 (CaQ-C) torsion angle. The poor side-chain rms of CDR-H3 results from the misplacement of the side chains of Tyr-100 and Phe-102. Table 1. Conformations of HyHel-5 Energy,* rmst A SAJ CDR kcal/mol All BB A2 PDB Construction Li FBJ SAS[SSVNY]MY FBJ L RHE DTSKLAS imcp L CPP QQ[WGRNP]T HKG H [DYWIE] H i MCP EI[LPGSGS]TN H lpfc GNYDFDG API L3B YHZ QQ[WGRNP]T YHX PDB, Protein Data Bank code of the data base loop. *For each CDR the five lowest-energy conformations (as calculated using the modified version of the GROMOS potential) were selected. Note that these energies are relative and not absolute values. The lowest energy conformations as well as selected conformations are shown. For comparison, the crystal structure for each loop has an energy of 17,104 kcal/mol (except L3B, the energy of which is 11,479 kcal/mol). trms deviations are quoted for all atoms (All) and backbone atoms (BB) defined as N, C0, and C. tfor each conformation, the solvent accessibility (SA) of hydrophobic atoms was calculated, and the conformation with the lowest accessibility was selected. Residues constructed by CONGEN are shown between brackets; side chains were constructed by CONGEN for all residues. ISelected energy conformations. IlConformations for CDR-L3B were calculated in the absence of the antigen, lysozyme.

4 Biophysics: Martin et al. Table 2. Conformations for Gloop2 Energy, A kcal/ ms, CDR mol All BB SA, A2 PDB Construction Li * REI RAS[QEISG]YLS L APP AASTLDS * FB4 L iapr LQ[YLSYP]LT * APR H [TFGIT] * H * FB4 EI[FPGNS]KTY H [REIRY] * The lowest energy and selected conformations (indicated by *) are shown for each CDR of Gloop2. For details, see the legend to Table 1. For comparison, the crystal structure for each loop has an energy of kcal/mol (except for CDR-L1, which was modeled against a later version of the crystal structure with an energy of kcal/mol). PDB, Protein Data Bank code; BB, backbone; All, all atoms. The rms deviations for the model of Gloop2 are also low. The mean error level of in the P21 crystal structure is -0.6 A, and the rms deviations quoted should be compared with this figure. The poor all atom rms for CDR-H2 results from the misplacement of Phe-52 and Tyr-59, which are rotated by approximately 180 about their Xi torsion angles compared with their positions in the P21 structure. However, when compared with the P1 structure, their placement is virtually perfect (all atom rms = A). In CDR-H3 the side chain of Arg-101 is misplaced. A comparison of each loop with the crystal structure for HyHel-5 and Gloop2 is shown in Fig. 4. Individually, we have successfully modeled the six CDRs of both an antibody complexed with its antigen and a free antibody in the absence ofantigen or solvent. The protocol has the advantage over other methods in that it is independent of decisions on the part of the user and thus gives consistently accurate results. Previous knowledge-based modeling of antibody loops has required manual insertions or deletions to be made using molecular graphics (5), while ab initio techniques require arbitrary choices to be made about the way long loops are constructed in order to reduce computer time (16). A Proc. Natl. Acad. Sci. USA 86 (1989) 9271 Table 3. rms deviations for the modeled CDRs of HyHel-5 and Gloop2 HyHel-5 Gloop2 rms rms deviation, A deviation, A CDR Length All BB Length All BB Li L L H H H Mean For Gloop2, the model is compared with P21 crystal structure; when compared with the P1 crystal form, the poor side-chain placement in CDR-H2 is greatly improved. The Kabat and Wu definition of CDR-H2 includes a complete strand of the p3-sheet. Our definition of this loop has its C-terminal residue defined as the strand partner of the N-terminal residue of the loop. This contrasts with the structural definition of only three amino acids. BB, backbone atoms (defined as N, CQ, and C); All, all atoms. Although the rms deviations for the model loops are better than those achieved through other methods (16), a direct comparison of our results with those from other methods is difficult. Unlike other approaches, each loop is modeled in the context of the crystal structure of the remaining loops and (for HyHel-5) in the presence of lysozyme. Our definition of the loops (with the exception of CDR-H2, see the legend to Table 3) follows the Kabat and Wu definitions (18) (i.e., is based on sequence), whereas some other workers use Chothia and Lesk's structural definition (9). In all the loops except CDR-Hi, we define longer sequences, which include the residues defined by the structural definition. Our modeling of the six hypervariable loops of HyHel-5 was carried out in the presence of the antigen, lysozyme. However, CDR-L3 has also been modeled in the absence of the antigen and appears to have a different preferred conformation. Table 1 shows that, when modeled with antigen present, a low rms conformation would be selected in a modeling experiment. However, in the absence of antigen, a conformation of high rms would be chosen. It might be argued that the poor modeling of CDR-L3 in the absence of B L1 L2 LI L2 L3 Hl L3 Hl H2 H3 FIG. 4. The predicted conformation (thin lines) for each CDR is shown superimposed on the crystal structure (thick lines) for HyHel-5 (A) and Gloop2 (B). Note that these are global fits of the antibody structure and not local fits of the loops. In the case of Gloop2, the comparison is made against the P21 crystal form. H2 H3

5 9272 Biophysics: Martin et al. antigen reveals a deficiency in the method, as it requires the conformational constraints placed upon it by antigen. However, the accurate modeling of the free antibody, Gloop2, suggests that this is not the case. The alternative conclusion is that the movement in CDR-L3 is a real one-a conformational change resulting from antigen binding. It has previously been proposed (32) that antigen binding may exclude certain conformations of the CDRs, and we await with interest a crystal structure of the unbound antibody. The approach we have presented for the precise modeling of CDRs in the context of the crystal structure of the other five loops includes elements of both knowledge-based and ab initio methods. This represents an accurate, reliable method for the modeling of loop replacements, insertions, deletions, and mutations. It also provides a method by which to model loops that are known to cause problems with the existing alternative methods. For example, the knowledge-based approach of Chothia et al. (8) is very successful when loops that have the key residues that they define are available in the structural data base. However, if the key residues are not present, their results are poor. The method we present may thus be used to model these "difficult" loops (A.C.R.M., J.C.C., A.R.R., and Cyrus Chothia, unpublished data). A further application of the method is as an aid to crystallography. In the early stages of refinement, one or two loops in the structure may have poor density. A starting conformation for such loops could thus be generated by the modeling protocol presented here and then subjected to further refinement against the x-ray data. The continued development of such an algorithm will permit the accurate construction of antibody combining sites from sequence data alone and culminate in the de novo design of antibodies of predetermined specificity. We thank Robert Bruccoleri for supplying us with CONGEN and its source code, Alexi Finkelstein for useful discussions concerning the potentials, Phil Jeffrey for the coordinates of Gloop2, Steven Sheriff and David Davies for updated HyHel-5 coordinates, the SERC Biotechnology Directorate (J.C.C.), and the Medical Research Council for a studentship to A.C.R.M. We also thank Celltech for supplying a MicroVAX 3000 on which the computations have been performed. 1. Wu, T. T. & Kabat, E. A. (1970) J. Exp. Med. 132, Lerner, R. A. & Benkovic, S. J. (1988) BioEssays 9, Roberts, S., Cheetham, J. C. & Rees, A. R. (1987) Nature (London) 328, Riechmann, L., Clark, M., Waldmann, H. & Winter, G. (1988) Nature (London) 332, de la Paz, P., Sutton, B. J., Darsley, M. J. & Rees, A. R. (1986) EMBO J. 5, Mainhart, C. R., Potter, M. & Feldmann, R. J. (1984) Mol. Immunol. 21, Smith-Gill, S. J., Mainhart, C. R., Lavoie, T. B., Feldmann, Proc. Natl. Acad. Sci. USA 86 (1989) R. J., Drohan, W. & Brooks, B. R. (1987) J. Mol. Biol. 194, Chothia, C., Lesk, A. M., Levitt, M., Amit, A. G., Mariuzza, R. A., Phillips, S. E. V. & Poljak, R. J. (1986) Science 233, Chothia, C. & Lesk, A. M. (1987) J. Mol. Biol. 196, Jones, T. A. & Thirup, S. (1986) EMBO J. 5, Brooks, B., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S. & Karplus, M. (1983) J. Comp. Chem. 4, Moult, J. & James, M. N. G. (1986) Proteins Struct. Funct. Genet. 1, Bruccoleri, R. E. & Karplus, M. (1987) Biopolymers 26, Fine, R. M., Wang, H., Shenkin, P. S., Yarmush, D. L. & Levinthal, C. (1986) J. Mol. Biol. 1, Suh, S. W., Bhat, T. N., Navia, M. A., Cohen, G. H., Rao, D. N., Rudikoff, S. & Davies, D. R. (1986) Proteins Struct. Funct. Genet. 1, Bruccoleri, R. E., Haber, E. & Novotny, J. (1988) Nature (London) 335, Cheetham, J. C., Martin, A. C. R., Roberts, S., Webster, D., Griest, R., Field, H., Hilyard, K. L. & Rees, A. R. Int. Rev. Immunol., in press. 18. Kabat, E. A., Wu, T. T., Reid-Miller, M., Perry, H. M. & Gottesman, K. S. (1987) Sequences ofproteins ofimmunological Interest (U.S. Dept. Health and Human Services, Washington, DC), 4th Ed. 19. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977) J. Mol. Biol. 112, Aqvist, J., Van Gunsteren, W. F., Leifonmark, M. & Tapia, O. (1985) J. Mol. Biol. 183, Lee, B. K. & Richards, F. M. (1971) J. Mol. Biol. 55, McCammon, J. A., Wolynes, P. G. & Karplus, M. (1979) Biochemistry 18, Bruccoleri, R. E. & Karplus, M. (1985) Macromolecules 18, Go, N. & Scheraga, H. A. (1970) Macromolecules 3, Rees, A. R., Martin, A. C. R., Roberts, S. & Cheetham, J. C. (1989) in Proceedings ofthe UCLA Symposia on Molecular and Cellular Biology: Protein and Pharmaceutical Engineering, eds. Craik, C., Fletterick, R., Matthews, C. R. & Wells, J. (Liss, New York), in press. 26. Mackerell, A. D., Jr., Nilsson, L. & Rigler, R. (1988) Biochemistry 27, Mao, B., Pear, M. R., McCammon, J. A. & Quiocho, F. A. (1982) J. Biol. Chem. 257, Bock, K., Meldal, M., Bundle, D. R., Iversen, T., Garegg, P. J., Norberg, T., Lindberg, A. A. & Svenson, S. B. (1984) Carbohydr. Res. 130, Novotny, J., Rashin, A. A. & Bruccoleri, R. E. (1988) Proteins Struct. Funct. Genet. 4, Sheriff, S., Silverton, E. W., Padlan, E. A., Cohen, G. H., Smith-Gill, S. J., Finzel, B. C. & Davies, D. R. (1987) Proc. Nati. Acad. Sci. USA 84, Darsley, M. J. & Rees, A. R. (1985) EMBO J. 4, Hartman, A. B., Mallett, C., Sheriff, S. & Smith-Gill, S. J. (1988) J. Immunol. 141,