Molecular BioSystems PAPER - PDF Free Download

Molecular BioSystems Cite this: Mol. BioSyst., 2012, 8, 2680 2691 www.rsc.org/molecularbiosystems Dynamic Article Links PAPER Functional annotation of the mesophilic-like character of mutants in a cold-adapted enzyme by self-organising map analysis of their molecular dynamicsw Domenico Fraccalvieri, a Matteo Tiberti, b Alessandro Pandini, cd Laura Bonati* a and Elena Papaleo* b Received 16th May 2012, Accepted 16th June 2012 DOI: 10.1039/c2mb25192b Multiple comparison of the Molecular Dynamics (MD) trajectories of mutants in a cold-adapted a-amylase (AHA) could be used to elucidate functional features required to restore mesophilic-like activity. Unfortunately it is challenging to identify the different dynamic behaviors and correctly relate them to functional activity by routine analysis. We here employed a previously developed and robust two-stage approach that combines Self-Organising Maps (SOMs) and hierarchical clustering to compare conformational ensembles of proteins. Moreover, we designed a novel strategy to identify the specific mutations that more efficiently convert the dynamic signature of the psychrophilic enzyme (AHA) to that of the mesophilic counterpart (PPA). The SOM trained on AHA and its variants was used to classify a PPA MD ensemble and successfully highlighted the relationships between the flexibilities of the target enzyme and of the different mutants. Moreover the local features of the mutants that mostly influence their global flexibility in a mesophilic-like direction were detected. It turns out that mutations of the cold-adapted enzyme to hydrophobic and aromatic residues are the most effective in restoring the PPA dynamic features and could guide the design of more mesophilic-like mutants. In conclusion, our strategy can efficiently extract specific dynamic signatures related to function from multiple comparisons of MD conformational ensembles. Therefore, it can be a promising tool for protein engineering. Introduction Molecular Dynamics (MD) simulations provide links between protein structure and dynamics by allowing the exploration of the conformational energy landscape accessible to protein molecules. 1,2 The analysis of the large ensembles of conformations generated by a MD simulation provides information about the average physico-chemical and geometrical properties of a protein and enables the identification of recurring conformations and transitions between them. Moreover, more recently it has been demonstrated that the comparison of MD conformational ensembles of functionally related proteins allows us to highlight the role of flexibility in modulating protein function as well as to analyse the evolutionary a Department of Environmental Sciences, University of Milano-Bicocca, Milan, Italy. E-mail: laura.bonati@unimib.it b Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy. E-mail: elena.papaleo@unimib.it c Randall Division of Cell and Molecular Biophysics, King s College London, London, UK d Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK w Electronic supplementary information (ESI) available. See DOI: 10.1039/c2mb25192b conservation and specialization of protein dynamics across distant homologous proteins. 3 15 Identification of functionally relevant conformations within trajectories of a same system is generally achieved by clustering the raw ensemble of structures that are generated. Geometrical clustering algorithms have been introduced and extensively used in this context, 16 18 based on the assumption that structurally similar conformations lie in the same basin of the free energy surface. However, these analyses are often affected by noise and are difficult to interpret, particularly if the goal is the comparison among trajectories of different systems to highlight subtle similarities and differences in the dynamics related to function. This emphasises the need for more advanced tools to compare conformational ensembles derived from extensive MD simulations of different systems. A suitable alternative to handle multiple comparisons among complex data is the employment of Self-Organising Maps (SOMs). A SOM is a specific architecture of artificial neural networks that projects high-dimensional input data on a low-dimensional grid (map) of so-called neurons, where similar elements are associated to neighbouring neurons. 19 Recently, SOMs were reported performing more accurately and providing more consistent results than traditional clustering 2680 Mol. BioSyst., 2012, 8, 2680 2691 This journal is c The Royal Society of Chemistry 2012

algorithms in various data-mining problems. This approach was also applied to MD data analysis 17,20,21 and a comparison with traditional clustering algorithms identified SOMs among the best performing methods. 17 We have recently developed a new approach to efficiently compare conformational ensembles of proteins, which combines SOMs and hierarchical clustering. 7 In our two-level approach, the SOM captures the relevant features of the large input space in a smaller set of prototype vectors and afterwards clustering algorithms are used to divide the prototypes into groups. The major advantage over classical clustering approaches is the possibility of providing a topological mapping of the conformational space embedded in a simple 2D visualisation. This not only simplifies the identification of differences in the conformational dynamics and the definition of multiple relationships among these, but also helps in the functional annotation. To provide the users with optimal SOM parameters for this specific type of data, we developed a protocol by which a small number of important parameters was identified and their optimal values were derived. 7 Once the optimized parameters were obtained, the SOM approach was applied to the investigation of dynamic properties of a single domain, where ligand binding activity is known to be modulated by single mutations 22 that turned out to greatly affect the conformational dynamics of the protein. Our approach not only allowed a very efficient comparison of the mutants trajectories but also led to a functional interpretation of the observed differences among the mutants. 7 In this paper, a further assessment of the potential of our approach is performed and the extension of the previous proposal in the direction of a predictive tool is designed. In fact the SOM approach is employed to rationalize how single and multiple mutations are able to modify the dynamic properties of a protein to such an extent of restoring the properties of a homologous protein. It is known that specific dynamic signatures are a relevant hallmark of differently temperature adapted proteins, isolated from thermophilic, mesophilic and psychrophilic organisms. 23 27 We also recently demonstrated that a correlation exists between modification in the catalytic and thermodynamic parameters and the modification that the mutations exert on the protein dynamics in a battery of seven mesophilic-like mutations of a cold-adapted a-amylase (namely AHA, Pseudoalteromonas haloplanktis a-amylase). 9 In fact, these mutations are able to partially restore the kinetic and thermodynamic properties of its warm-adapted counterpart, the porcine a-amylase, namely PPA. 28,29 On the other hand, our MD simulation approach identified that the inserted mutations were able to reestablish in AHA a map of correlated motions and electrostatic interactions found in the mesophilic enzyme and, at the same time, to lose dynamic properties typical of the wild-type psychrophilic a-amylase. 9 Interestingly, although each of these mutations exerts long-range effects, the specific effects induced by each mutation and their contribution and efficiency in modulating the dynamic properties were not properly elucidated. In particular, it is not well-clarified the rationale behind the fact that the multiple AHA mutants, in which all the single mutations are combined together, did not feature kinetic and thermodynamic parameters which can account for real additive effects, as well as the mutations so-far applied to AHA are not able to completely convert the coldadapted enzyme in a mesophilic-like variant but exert intermediate effects. In light of the previous observations, the multiple comparison of the MD conformational ensembles of AHA and its mutants turns out to be a very interesting case of study to further assess the potential of our SOM approach. In particular, a first aim of this paper is to demonstrate the efficiency and reliability of this approach in elucidating the differences in flexibility related to function in a case where, due to the complex network of local and distal effects induced by mutations, the multiple comparison of the MD trajectories is particularly demanding. Moreover, a novel strategy is proposed to identify specific mutations able to efficiently convert the AHA dynamic properties to those of the homologous mesophilic system. To this aim the use of the SOM trained with the trajectories of AHA and its mutants to classify the MD conformational ensemble of PPA is proposed. The comparison between conformations sampled by mesophilic-like variants and by the reference PPA enzyme, based on the use of specific local descriptors mapped on the SOM, succeeded in both identifying and designing mutants that are more suitable to restore the PPA characteristics. This result confirmed the ability of our strategy to efficiently extract specific dynamic signatures related to function from multiple comparisons of protein MD ensembles and highlighted the potential of this tool as a support for protein engineering. Materials and methods Self-Organising Map (SOM) approach to compare MD trajectories A SOM is a widely used tool to handle multidimensional input data and visualize them in a grid (map), in which each node is called neuron. 30 During the learning process each neuron wins the data closest to it and modify itself iteratively becoming more similar to the data. The neurons obtained after the learning process are characterized by vectors with the same dimension of the input data (the so called prototype vectors ), representing a particular feature drawn from the original input space. The number of unique features that can be represented on the map increases with the number of employed neurons, along with the computational cost. A SOM can be interpreted as a semantic map where similar samples are mapped close together and dissimilar apart, as well as a discrete approximation of the distribution of training samples in the prototype vectors. The trained SOM can also be employed to assign data not used in the training process to the neurons. In this case, data are only read by and compared with the neurons of the map, and the winning neuron does not modify its prototype vector after the assignment. A two-level approach that combines SOM and hierarchical clustering of the obtained prototype vectors, previously developed to analyse MD data, 7 was employed. In this approach the SOM input data are ensembles of conformations extracted from the MD trajectories of one or more domains, each This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2680 2691 2681

conformation being described by the Cartesian coordinates of the Ca atoms. At the end of the SOM training process, the original conformational ensemble is projected on a bi-dimensional feature map, where a limited number of prototype vectors summarizes the entire sampled space, and similar conformations are associated to neighbour neurons. In the second step of the approach, the prototype vectors are further grouped in a small, but representative, number of clusters by using a hierarchical agglomerative clustering algorithm, the complete linkage. 31 The optimal sampling rate of the MD trajectory, previously set to use the minimum number of conformations while maintaining a reliable picture of the protein dynamics, 7 was employed. It consists of sampling one conformation every 100 ps. Both the learning and topological SOM parameters that were optimized for MD data by experimental design 7 were employed. The optimized SOM learning parameters are: learning algorithm = batch, neighbourhood function = Gaussian, alpha type = linear, radius = 3, training length = 5000 epochs, and initial value of alpha = 0.09. The optimal topological parameters obtained are: map size = 100, lattice shape = hexagonal, and shape = sheet. A first inspection on the number of clusters that best summarize the information in the map was obtained by the Mojena s stopping rule. 32 Then, the number of clusters was refined by a deeper inspection of the meaning of the obtained clusters, in order to increase their interpretability. The quality of the obtained clusters was assessed using the Silhouette index (S) 33 that evaluates their compactness and separation. Its value is in the range [ 1, 1] and is generally interpreted as evidence in support of the cluster structure: 34 strong (S > 0.7), reasonable (0.5 o S r 0.7), weak (0.25 o S r 0.5) or no significant evidence (S r 0.25). The analysis was performed in the MATLAB environment using the SOM Toolbox, 35 the Statistics Toolbox 36 and the Mojena s rule implementation made available by courtesy of Prof. Fernandez (http://ima.udg.edu/bjamf/martin_mf.htm). Results and discussion Summary of structural and dynamical properties of AHA, AHA mutants and PPA The overall fold of AHA (PDB entry 1AQH 37 ) and PPA (PDB entry 1PIF 38 ) is conserved. They exhibit a multi-domain organization and the three characteristic domains of chloridedependent a-amylases: 39 the (b/a) 8 catalytic domain A, a small domain B which protrudes from domain A and includes the calcium binding site, as well as a C-terminal domain (domain C) characterized by b-strands organized in a Greek-key motif (Fig. 1A). In domain A the catalytic triad is located, including the nucleophile Asp174, the proton donor Glu200 and an acidic residue, Asp264, which favours the protonation state of the glutamic acid side chain 39 (AHA numbering). The cold-adapted a-amylase AHA and its mesophilic counterpart PPA share 43% of sequence identity (Fig. 1B), with gaps mostly located in the loop regions in the catalytic domain. Several differences have been highlighted between the two enzymes. In fact, AHA shows fewer intramolecular interactions, a reduced compactness of the hydrophobic core and of the inter-domain interfaces, increased hydrophobicity of the solvent accessible surface and a different architecture of the ion binding site with respect to PPA. 37,38,40 Furthermore, the loops surrounding the active site are generally shorter in AHA than PPA and feature distinct flexibility as judged by previous all-atom explicit solvent MD simulations. 9,41 Fig. 1 Main structural and sequence features of AHA, AHA mutants and PPA. (A) PPA and AHA 3D structures are superimposed (PDB IDs 1AQH and 1PIF). The mutation sites and the catalytic residues of AHA are shown as orange ball and sticks and yellow dots, respectively. The relevant loops for discussion L3, L5 and L7 are highlighted as light blue ribbons. Residues ignored in the SOM classification scheme (i.e. those of domain C and those corresponding to gaps in the structural alignment) are shown in black. (B) Structural alignment between AHA and PPA crystallographic structures. The alignment was calculated using the DaliLite server 52 and then manually adjusted. The intensity of the red background color for each residue is proportional to the RMSF values calculated from MD simulations. Secondary structure elements are shown as calculated by DSSP 53 on the crystal structures. Catalytic residues are highlighted using star symbols, mutation sites are highlighted by black boxes and the relevant loops are enclosed in brackets. 2682 Mol. BioSyst., 2012, 8, 2680 2691 This journal is c The Royal Society of Chemistry 2012

Recently, we also investigated by comparative MD simulations 9 a battery of seven previously experimentally characterized 29,42,43 mesophilic-like mutants of AHA. Particular attention was devoted in this study to the comparison of flexibility, networks of electrostatic interactions, as well as correlated motions in AHA, AHA mutants and PPA. In particular the following AHA mutants were investigated: AHANR (N12R), AHASS (Q58C, A99C), AHAVF (V196F), AHATV (T232V), AHAQI (Q164I), AHA5 (N150D, V196F, T232V, Q164I, K300R), and AHA5SS (Q58C, A99C, N150D, Q164I, V196F, T232V, K300R) (Fig. 1A). The restored weak interactions are able to modify AHA dynamics in a warmadapted-like direction, acting long-range on the loops surrounding the catalytic cleft. 9 In particular, these differences regard loops b 3 -a 3 (L3), b 5 -a 5 (L5), b 7 -a 7 (L7) and b 8 -a 8 (L8). L3, L5 and L7 project toward the catalytic site, forming a three-side cleft on one side of the substrate binding pocket, whereas interactions between L8, which is longer in PPA than AHA, and L7 regulate the conformational properties of L7 in PPA. 41 Furthermore, L3, L5 and L7 were shown to interact with co-crystallized substrate analogues, and the rearrangement of L7 was suggested to have a role in the binding and release of the substrates in a-amylases. 44 46 The differences between PPA and AHA in these loops cause a different distribution in the 3D structure of the most flexible regions, as judged by the analysis of Ca Root-Mean Square Fluctuation (RMSF) profiles (Fig. 1B). 41 More in detail, the L3 region was demonstrated to feature in AHA a longer C-terminal extension (L3 C ) that carries a short and highly mobile a-helical portion. In PPA, the helix is preceded by a highly flexible coil region (the N-terminal region of L3, L3 N ), which does not feature a correspondence in the structure of the cold-adapted enzymes, where gaps are located according to the structural alignment (Fig. 1A). The comparison of the intensity and localization of the L3 dynamics shows that in AHA high flexibility is localized nearer to the catalytic triad than in PPA. 41 L5 is immediately downstream the catalytic E200 AHA (E233 PPA ) residue and is conserved in both structure and fluctuation intensity in AHA and PPA. However, significant differences were identified in the map of correlated motions between the L5 and L3 regions, which are greater and more connected in AHA. 41 L7 is downstream a short a-helix carrying another catalytic residue, D264 AHA (D300 PPA ). It is the most flexible region of AHA and features the greatest RMSF differences between the two enzymes (Fig. 1B). Dominant motions of L7 involve oscillations to and from the active-site groove, with amplitudes enhanced in AHA. In PPA, above and directly interacting with L7, a high flexible insertion with respect to AHA is located (L8), which is the most flexible region of PPA (Fig. 1B). L3 C and L7 are the loops of the b-barrel that are closer to the catalytic site and their flexibility is optimized in AHA, whereas in PPA the higher protein fluctuations are shifted further away from the catalytic triad at the L3 N and L8 PPA insertions. The dominant motions of PPA are partially restored in the mesophilic-like AHA mutants, even if the mutation sites are not located in L3, L5, L7 or L8 regions or their immediate proximity (Fig. 1A). The long-range dynamic effects induced by single and multiple mutations suggest that our analysis should be directed not only toward the comparison of the global dynamics of these systems but also toward the detection of which local dynamic features are able to influence their global dynamics in a mesophilic-like direction. SOM analysis of the flexibilities of AHA and its mutants The approach combining SOMs and hierarchical clustering (complete linkage), which we previously proposed for the comparison of MD trajectories, 7 was used to analyse and compare the flexibilities of the wild-type AHA (AHAwt) and its seven mutants (AHA5, AHA5SS, AHAQI, AHASS, AHATV, AHAVF, and AHANR). The protocol and the SOM parameters employed, derived from the previous optimization process, 7 are reported in the Materials and methods section. To perform a multiple comparison, the ensemble of conformations extracted from the trajectories of AHAwt and its seven mutant variants was provided as input data to a single map. As reported in the Materials and methods section, each conformation was described by the Cartesian coordinates of its a-carbon (Ca atoms). To focus our attention on the flexibility around the catalytic site, only the domains A and B were considered for all the systems. Moreover, with the aim of a subsequent comparison with the PPA flexibility, only the matching residues extracted from the AHA PPA structural alignment 41 were retained, based on the assumption that they bear the differences in the dynamic features of the two systems (Fig. 1B). 4,5,9,41 This selection process produced a set of 351 Ca atoms considered for each system. After the learning process, the proposed dimension of the map, 7 i.e. 100 neurons, led to a set of 100 prototype vectors describing the most relevant conformational features from the input space and providing an intermediate topological reference ( proto-clusters 31 ). These vectors were then clustered using the complete linkage approach and all the original conformations won by a neuron were accordingly assigned to the same cluster. Therefore the output SOM allows us to extract information at different levels: neuron level, using only the characteristics of the prototype vectors, i.e. their variables and their hit population; centroid level, using the input conformation closest to the centroid of each cluster; cluster level, using all the input conformations assigned to a cluster. The SOM trained with the eight trajectories of AHA and its mutants is shown in Fig. 2A by using a pictorial representation where each neuron is represented by an hexagon, and the number of input data won by each neuron by black hexagons with a size proportional to the neuron population. Visual inspection of the map allows the analysis of the final distribution of the original data: neighbour neurons are associated to similar groups of conformations; map regions with highly populated neurons represent conformations frequently sampled in the trajectories. The seven clusters obtained by complete linkage (see the Materials and methods section) showed a good quality, as demonstrated by the Silhouette plot, in Fig. 2B, that reports satisfactory values of S for the neurons assigned to all the clusters. To analyse the conformational space described by each cluster, the cluster level and the centroid level information were used. First, the population of each This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2680 2691 2683

Fig. 2 SOM analysis of the dynamics of AHA and its mutants. (A) SOM trained with the eight trajectories of AHA and its mutants. Colors indicate the seven clusters obtained by hierarchical clustering of the neurons. The size of the black hexagons in each neuron is proportional to the number of hits in the neuron. (B) Silhouette plot: the profile of each cluster is composed by a bar for each neuron belonging to the cluster, whose height is the value of the S index; clusters are colored with the same color code used for the SOM. (C) Bar plot representing the percentage distribution of conformations of each mutant in the seven clusters. cluster was analysed to find groups of similar conformations sampled by different systems, as well as to identify conformations with peculiar characteristics sampled by a single system. Second, the centroid of each cluster was analysed to obtain a synthetic three-dimensional (3D) visualization of the conformations that characterize the cluster. The bar plot in Fig. 2C indicates that some clusters are specifically representative of one mutant flexibility, while others are composed by conformations sampled by different mutants. In particular, clusters 3 7 feature more than 40% of conformations relative to one of the systems; clusters 3 and 5 are mainly composed by conformations of AHAwt, while clusters 4, 6 and 7 are typical of AHATV, AHAQI and AHAVF, respectively. In contrast, clusters 1 and 2 do not present a dominant population from one system but it is evident the total absence of conformations relative to AHAwt in both of them. Thus, these two clusters can be interpreted as the ensemble of conformations belonging to common basins sampled in the simulations of the mesophilic-like AHA mutants. The centroid of a cluster in a SOM is defined as the average vector between all the prototype vectors of the neurons assigned to that cluster. Therefore the hit conformation closest to each cluster centroid, in terms of Euclidean distance, well represents the conformational characteristics of that cluster. Fig. 3 shows, at the centroid level, the seven clusters obtained for AHAwt and its mutants, by using a porcupine representation that describes structural differences between each centroid and the AHAwt reference structure. It can be observed that clusters 3 (red) and 5 (magenta) mainly describe the flexibility of AHAwt (Fig. 2C): they include either conformations of AHAwt close to the starting X-ray structure of AHA (cluster 3) or conformations associated to a specific motion involving loops L3, L5 and L7 (cluster 5). Cluster 4 (brown), which mainly describes the 2684 Mol. BioSyst., 2012, 8, 2680 2691 This journal is c The Royal Society of Chemistry 2012

Fig. 3 Centroid level representation of the dynamics of AHA and its mutants. The porcupine representation describes direction and amplitude of the changes between a reference structure and the conformation closest to the centroid of each cluster. For each cluster cones are colored with the same color used for that clusters in the SOM (Fig. 2A). The reference structure is the X-ray structure of AHAwt (PDB code: 1AQH), represented as a cartoon trace of the Ca atoms. The yellow dots represent the catalytic triad, and the yellow spheres the position of the mutations studied. In gray the Ca atoms excluded from the analysis. AHATV flexibility, is characterized by changes in two loop regions: the loops surrounding the catalytic site (L3, L5 and L7), and the region including both the N-terminal part of L8 (L8 N, residues 305-312, in proximity of the insertion in L8 loop, L8 I, of PPA) and the C-terminal part of L8 (L8 C,residues 322 329 of AHA). Cluster 6 (orange), mainly populated by AHAQI conformations, describes conformational changes located in the L3 region and in the N-terminal part of L7. Cluster 7 (cyan), with the majority of conformations from the AHAVF dynamics, shows changes associated to a conformational change involving both the loop L7 and the facing loop, including the L8 N, coupled with conformational changes of the L8 C region. Clusters 1 and 2 describe two common dynamic features of the seven mutants (see Fig. 2C), showing conformational changes very similar to those observed in cluster 7, but with a reduced displacement of the loops (cluster 1, green), and the combined displacement of L7 toward the L8 N residues and of L5 in the opposite direction (cluster 2, blue). Not surprisingly, the most relevant differences in the cluster centroids are localized in the regions characterized by the highest difference in flexibility among the different AHA variants. Notably, the L7 loop not only is the region that features the highest RMSF differences in AHAwt and its mutants, but also features significant displacements in each centroid except cluster 3, which is overall the most similar to the wild-type AHA X-ray structure. AHATV conformations are mainly ascribed to cluster 4, which is the cluster featuring remarkable structural changes in all the relevant aforementioned loops. Interestingly, AHATV is also the mutant characterized by the most striking RMSF differences when compared to AHA. 9 SOM analysis of PPA flexibility As mentioned in the Materials and methods section, once a SOM is trained using a given training set, the neurons of the output map can be used to read and assign new data. What is done is a simple comparison between each input vector presented to the map and all the prototype vectors, but the prototype vector that won the data does not modify itself. The conformations extracted from the MD trajectory of PPA were presented to the SOM trained using the MD data of AHAwt and its mutants. Since only the corresponding positions of AHA and PPA in the structural alignment were considered for all systems (Fig. 1B), both the prototype vectors of the map and the vectors with the Ca coordinates of PPA conformations have the same dimension. The results of the assignment of the PPA conformations to the AHA SOM are reported in Fig. 4A. The majority of the conformations of PPA were assigned either to cluster 1 or to cluster 7 and the highest populated neurons (the largest black hexagons) are located in neurons close to the border between these two clusters. More in detail, 71% of the PPA conformations were assigned to cluster 7 (cyan), 19% to cluster 1 (green) and the remaining 10% were distributed in the rest of the map. In Fig. 4B the PPA conformations that are closest to the centroids of clusters 1 and 7 are described by representing the conformational changes with respect to a reference structure (as shown in Fig. 3). Unlike in the analysis of the flexibility of the AHA mutants, where the compared systems shared the same starting structure, in this case the starting structure of PPA shows significant differences with respect to that of AHA, mainly located in the catalytic site and in the region of L8, as shown in Fig. 4C. While in the first case the comparison of the cluster centroids led to a clear definition of the different conformational basins whose sampling only depends on the mutations studied, in this second case the visualization has to take in account the initial structural differences between AHA and PPA plus the conformational sampling of PPA in the MD trajectory. However, by taking these differences into account, the analysis of Fig. 4B This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2680 2691 2685

Fig. 4 SOM analysis of the dynamics of PPA. (A) Assignment of the conformations of the PPA trajectory to the SOM trained with AHA and its mutants (Fig. 2A). The size of the black hexagons is proportional to the number of PPA conformations assigned to each neuron. (B) Porcupine representation of the conformational changes described by each cluster (see Fig. 3 for details). The reference structure used is the X-ray structure of AHAwt. (C) Structural differences among the starting structures of AHAwt (PDB code: 1AHQ) and PPA (PDB code: 1PIF). The length of the needles represents the distance between the two structures. allows us to highlight the similarities of the conformational space sampled by some AHA mutants and that of PPA. Cluster 7 is the cluster which best represents PPA specific features. Interestingly, the PPA conformations in cluster 7 are characterized by displacements in the loops of the catalytic site and in the regions L8 N and L8 C, as it was observed for conformations of AHA mutants in cluster 7 (Fig. 3). This analysis highlights the ability of AHAVF to sample the conformations closest to PPA (see cluster 7 in Fig. 2C). This suggests an important role of this mutation in producing mesophilic-like dynamic properties. Local descriptors of the clustered conformations The above analyses allowed us to identify and cluster differences in the global dynamic properties of AHA mutants, as well as to specifically identify which mutations determine the major shift from the AHA toward the PPA features. A deeper analysis of the obtained final conformational clusters was aimed at verifying if the proposed approach is also able to highlight some local features, related to flexibility of AHA mutants, that were proposed to determine the partial conversion of AHA in a mesophilic-like variant. 9 In particular, two groups of local descriptors of the clustered conformations were analysed: geometrical descriptors and chemical descriptors. The first group gives an account of the arrangement of the three loops surrounding the catalytic site (L3, L5 and L7), measured by the distances between pairs of selected Ca atoms in the loops. As also summarized above, in this region AHA and PPA show a different flexibility. 41 In fact, L7 is characterized by an enhanced conformational freedom in AHA, whereas in PPA the long insertion in L8 loop, which directly interacts with L7, constrains the flexibility of the L7 loop. On the opposite side of the catalytic triad, the flexibility of AHA is mainly gathered in the C-terminal region of L3 (L3 C ) in the immediate proximity of the catalytic residues, whereas in PPA the highest flexibility peaks are located in a more distant region with respect to the active site, such as L3 N and this is likely to be due to the different effects exerted by L5 on L3 conformation. These descriptors employ a sub-set of Ca atom coordinates, i.e. the same information used to train the map. Therefore they allow to analyse the results using all the levels of representation (neurons, centroid and cluster) made available by the SOM. The descriptors belonging to the second group were chosen among the ones previously defined as related to key intramolecular interactions significant to differentiate AHA and PPA: 9,41 salt bridges typical of AHAwt that are affected by the mutations and absent in PPA, and the value of the w 1 dihedral angle of the catalytic residues D264 AHA and D300 PPA. 41 These descriptors analyse information that was not directly used to train the map. Thus they can be represented only at the cluster level, using the original conformations assigned to each cluster. It is relevant to mention that both the groups of descriptors were employed to identify local differences on a map that was trained using as unique input information the global conformational changes described by the different MD ensemble. This is an important characteristic of the SOM approach that allowed us to highlight the relationships between local behaviours and the global domain flexibility and to detect the local features that more influence the dynamic properties in a mesophilic-like direction. Geometrical descriptors: inter-loop distances The centroids of the previously analysed clusters showed that the main conformational changes of the AHA mutants are located around the catalytic site and in L8. To deeply analyse the different flexibilities in these regions, three inter-loop distances were monitored in the simulated ensemble, by measuring the distances between the Ca atoms of the residue with the highest flexibility in each loop (see the RMSF profiles in Fig. 1B). The distances investigated are: L3 L7 (using residues 126 AHA and 271 AHA ), L3 L5 (residues 126 AHA and 206 AHA ), and L7 L8 (residues 273 AHA and 309 AHA ). A neuron level representation of these distances was plotted on Fig. 5A, using a SOM colored in a gray scale. Each prototype vector at the end of the SOM learning process can be interpreted as the average conformation among all the conformations assigned to the neuron. It is thus possible to compute the Euclidean distance between the selected Cas using the corresponding variables of the prototype vectors and to plot these distances directly on the SOM. In the map referred to the L3 L7 distance, the highest distance values are found in the neurons belonging to clusters 5 (magenta) and 7 (cyan). As discussed before (Fig. 3), these clusters describe conformational changes in both L3 and L7 in the opposite 2686 Mol. BioSyst., 2012, 8, 2680 2691 This journal is c The Royal Society of Chemistry 2012

Fig. 5 Geometrical descriptors: inter-loop distances. (A) Neuron level representation of the distribution of L3 L7, L3 L5 and L7 L8 interloop distances in the SOM trained with AHAwt and its mutants (Fig. 2A). In each neuron, the distance values are calculated between the coordinates of selected Ca atoms belonging to the loops (see text) in the prototype vector. Neurons are colored in the gray scale according to the distance value (in the reference scale the distance values are in A ). The colored lines in each map remind the map clustering (Fig. 2A). (B and C) Cluster level representation of the inter-loop distances of AHAwt and its mutants (full dots) and PPA (empty dots) in the plane L3 L5 vs. L3 L7 (B) and L3 L5 vs. L7 L8 (C), colored according to the SOM clustering (Fig. 2A). direction, corresponding to the opening of the catalytic site. Cluster 3 (red) and 2 (blue) show the lowest values of this distance, defining a more closed catalytic site. In the map reporting the L3 L5 distance, the region of cluster 2 (blue) shows the highest values that, according to the analysis of the centroid conformation (Fig. 3), indicate a displacement of these two loops in the opposite direction; clusters 3 and 5, typical of AHAwt, are characterized by values around 10.0 A. For the L7 L8 distance, the latter clusters (3 and 5) have the highest values in the map, opposed to cluster 2 with the lowest ones. Once the values of each geometrical descriptor are plotted and analysed on the map, the same descriptor can be analysed at the cluster level, using all the conformations assigned to the clusters. In this case the Euclidean distances are computed using the coordinates of the selected Ca atoms in the original conformations. Both the above analysis of the map (Fig. 5A) and the populations of the clusters (Fig. 2C) suggest that some clusters are particularly suitable to highlight the different flexibility of AHAwt and its mutants with respect to PPA. For the most informative clusters, in Fig. 5B and C the interloop distances in the conformations of the AHA mutants and PPA are plotted in the plane L3 L5 vs. L3 L7 and L3 L5 vs. L7 L8, respectively. The plot in Fig. 5B, which summarizes the conformational changes of the catalytic site, shows a good separation of the conformations sampled by AHAwt and PPA that have different flexibility in this area. 41 In fact, as mentioned in the analysis of the cluster centroids, cluster 3 (red) contains the conformations close to the X-ray structure of AHA and cluster 5 (magenta) describes the main conformational changes during AHAwt MD simulation, involving primarily L7; the graph shows low L3 L5 distances for both the clusters and increasing values of the L3 L7 distance from cluster 3 to 5. The conformations of PPA, instead, feature the highest variance in the L3 L5 distance and a conserved value of the L3 L7 distance. This different flexibility of AHAwt and PPA results in a scarce overlap of the two regions in the graph. Interestingly, the conformations in cluster 2 (blue) create a connection between the two; since it contains conformations of all the mutants, without any conformations of AHAwt, it results that the mutations shift the flexibility of the catalytic site of AHA toward mesophilic-like characteristics. In the second plot (Fig. 5C), the displacement L3 L5 in the catalytic site is plotted versus that of L7 toward the corresponding L8 region in PPA. In this last region the structures of AHA and PPA are significantly different (Fig. 4C). This is reflected in the plot by the separation of clusters 5 (that describes AHAwt conformations) and 7 (typical of PPA): while AHAwt conformations have high values of L7 L8 and low values of L3 L5, the opposite behavior is observed for PPA conformations. The analysis of cluster centroids showed characteristic conformational changes in these regions especially in clusters 4 (brown) and 6 (orange), which were mainly populated by conformations of AHATV and AHAQI, respectively. Indeed, the localization of the conformations belonging to these two clusters in the plot indicates that they are intermediate between those of AHAwt and PPA in terms of the two monitored distances: cluster 4 better reproduces the approach of L7 toward L8 typical of PPA, while cluster 6 has values of both the distances intermediate between AHAwt and PPA. Chemical descriptors: salt-bridges Two previously identified salt-bridges 9 were chosen for the analysis, within those with the greatest variability in terms of persistence in the MD simulations among the studied systems. The K177-D203 salt bridge in AHA (K200-D306 in PPA) connects loops L3 and L5, and it is involved in a network of This journal is c The Royal Society of Chemistry 2012 Mol. BioSyst., 2012, 8, 2680 2691 2687

electrostatic interactions that is partially disrupted both in some AHA mutants and in PPA. 9 The presence of this saltbridge thus well describes the structural relationships between these two loops and its persistence in the simulations may be correlated with local structural modifications. The E279-K334 salt-bridge connects the L7 and L8 loops at their C-terminal portions, and it is present in AHA only; its persistence will likely be influenced by the conformations of the L7 and L8 loops. In each cluster, the values of the distance between the charged groups of the two interacting residues were calculated and divided in 120 equally sized discrete bins in the range 0 1.5 nm. The number of structures collected in each bin describes the relative frequency at which the charged groups are found at the given distance, thus obtaining the distribution of the conformations over the distance between the charged groups. While both for K177-D203 and E279-K334 the obtained distributions are complex and multimodal, remarkably the SOM clustering succeeded in separating the conformations in clusters that also differ for the peculiar behavior of these local features. This can be evidenced by comparing the distributions in the single clusters (Fig. 6). For the K177-D203 salt-bridge (Fig. 6A), clusters 1 (green), 3 (red), 4 (brown) and 7 (cyan) show a distinct peak around 0.2 nm, which means that the salt-bridge is actually present in these clusters. In contrast, in clusters 2 (blue), 5 (red) and 6 (orange) and in the PPA conformations (assigned to clusters 1 and 7) the distribution peak is at higher values (around 0.6 nm), implying that the salt-bridge is not present or at least much less persistent. As cluster 6 is mainly populated by conformations of the AHAQI mutant (Fig. 3) it can be inferred that this single mutation is able to reproduce this PPA local feature. In contrast, the conformations belonging to cluster 7 (cyan), which are the ones which best reproduce the global dynamic behavior of PPA, do not feature the breaking of the K177-D203 salt-bridge, which is otherwise not present in PPA. As the cyan cluster is highly composed by the structures of the AHAVF mutant, it results that this mutation alone is not sufficient to completely restore the PPA characteristics. From what discussed before, it is conceivable that the insertion of this mutation along with one able to disrupt the K177-D203 salt bridge, such as the QI mutation, could modify the flexibility of AHA even more towards that of PPA. The E279-K334 salt-bridge features a significant peak around 0.25 nm in clusters 3 (red) and 5 (magenta), in which the structures of the AHAwt are highly represented. The saltbridge is absent in all the other clusters, as it is in PPA, showing a similar behavior between all the AHA mutants and PPA to this extent. Chemical descriptors: v 1 dihedral of D264 AHA D300 PPA Fig. 6 Chemical descriptors: salt bridges. The bar plots represent the relative frequency of different ranges of the donor acceptor distance for the selected salt bridges in the conformations assigned to the clusters ( cluster level ): (A) salt bridge K177-D203 AHA (K200- D306 PPA ); (B) salt bridge E279-K334 AHA. In both graphs, the width of the bins is set to 0.15 nm. The bins of AHA mutants conformations are colored according to the SOM clustering (Fig. 2A) while the bins of the PPA conformations assigned to clusters 1 and 7 (Fig. 4A) are colored in gray. Both D264 in AHA and D300 in PPA belong to the catalytic triad in the binding site and the orientation of the side-chain of this residue, using the w 1 dihedral as descriptor, was investigated in AHAwt and PPA. 41 A detail of D264 AHA D300 PPA is shown in Fig. 7C, to highlight the difference in the two structures. As described in a previous work 41 the orientation of the side-chain of this residue toward the catalytic site is well correlated with the mobility of L7. The values of the w 1 dihedral of D264 AHA in each cluster are shown in Fig. 7A. Values were divided in discrete bins of 151. The number of structures collected in each bin describes the relative frequency at a given w 1 dihedral, obtaining the distribution of the conformations over the dihedral angle. The circular plots in each line are arranged accordingly to their neighbourhood in the map, i.e. cluster 3 (red) and cluster 5 (magenta), cluster 2 (blue) and cluster 7 (cyan), cluster 1 (green) and cluster 6 (orange), and cluster 4 (brown) by itself. This separation helps the interpretation since neighbourhood in the map means neighbourhood in conformational space, thus transition admitted between one state and the other. In previous work the population of the w 1 dihedral of D264 AHA was described defining three available orientations, centered around 601, 1601 and 3001, 41 while for the PPA population of w 1 dihedral of D300 PPA the orientation close to 3001 was not sampled. Considering the first line of the Fig. 7A, 2688 Mol. BioSyst., 2012, 8, 2680 2691 This journal is c The Royal Society of Chemistry 2012

composed by clusters 3 and 5 typical of AHAwt dynamic properties, all of these three orientations are populated, but in cluster 3 the population is centered around the orientation in the AHA crystal structure (3001), while the orientation around 1601 is preferred in cluster 5. The orientation close to PPA crystal structure (601) is not the preferred in both the clusters. The second and third lines of Fig. 7A can be associated in terms of populations and shift from one orientation to the other. As described previously, both clusters 2 and 1 contain conformations of all the studied mutants while clusters 6 and 7 can be considered typical of AHAQI and AHAVF, respectively. In both the couples there is a shift from a population centered in 601 to a population centered in 1601. A peculiar behavior is observed in cluster 4, found as the typical cluster of AHATV, in which the population is completely centered in 601, assuming the same conformation as the PPA crystal structure. The PPA conformations read and assigned to the clusters are described using the same graphs in Fig. 7B. As mentioned above, the orientation around 3001 was not observed for PPA and in fact this orientation is not sampled. Also in this case the diﬀerent assignment to diﬀerent clusters in the map leads to the same shift observed for the AHA conformations. In fact while the population of the w1 dihedral of D300PPA is centered in 601 for cluster 1, as previously observed, in cluster 7 it is centered around 1601. As previously seen for salt-bridge persistence there is one cluster, cluster 7, that best reproduces the overall dynamics of PPA and there is one cluster, cluster 4, that best reproduces this local descriptor compared with PPA. The insertion of the mutations characterized by clusters 7 (AHAVF) and 4 (AHATV) in AHAwt could improve the overall behavior even more toward PPA. Conclusions Fig. 7 Chemical descriptors: w1 dihedral of D264AHA and D300PPA. (A) Distribution of the values of the w1 dihedral of D264AHA in the conformations of AHA mutants ( cluster level ) assigned to the SOM clusters (Fig. 2A). (B) Distribution of the values of the w1 dihedral of D300PPA in the PPA conformations assigned to clusters 1 and 7 (Fig. 4A). In all the plots: the bin size was set to 151, the cluster color code is the same adopted in Fig. 2A; the red and blue lines are the w1 values of D264AHA and D300PPA in the corresponding crystal structures, respectively. (C) Cartoon representation of the structural superimposition of the crystal structure of AHAwt (red) and PPA (blue), in the surrounding of residues D264AHA and D300PPA, represented as sticks. Loops L3, L5 and L7 are labeled and residues D174AHA and E200AHA are represented as yellow dots. This journal is c The Royal Society of Chemistry 2012 The set of seven AHA mutants here studied are functional intermediates between psychrophilic and mesophilic enzymes featuring a diﬀerent degree of mesophilic-like character,9,28,29 and are characterized by a strong link between structural dynamics and functional activity, similarly to other coldadapted enzymes.23,24,47 Therefore, it is expected that a multiple comparison of their dynamics could help to elucidate relative functional eﬀects, giving a crucial contribution to protein engineering studies. MD conformational ensembles could be used for this purpose, but, unfortunately, this incurs in a two-fold challenge. First, while the comparison of simulation replicas is routinely done for single systems, there is still a need of standard approaches for an eﬀective comparison of trajectories of diﬀerent mutants. Second, such a type of approaches should also facilitate the direct interpretation of similarities in relation to functional activity. We recently proposed a two-level approach that combines SOMs and hierarchical clustering. This is a suitable and general tool to detect and functionally annotate similarities and diﬀerences in MD ensembles.7 This approach allows the comparison of multiple trajectories of diﬀerent systems at the same time, as well as the detection of subtle diﬀerences otherwise unrecoverable with analyses based on global ﬂexibility Mol. BioSyst., 2012, 8, 2680 2691 2689