Predicting the functional consequences of amino acid polymorphisms using hierarchical Bayesian models

Size: px
Start display at page:

Download "Predicting the functional consequences of amino acid polymorphisms using hierarchical Bayesian models"

Transcription

1 Predicting the functional consequences of amino acid polymorphisms using hierarchical Bayesian models C. Verzilli 1,, J. Whittaker 1, N. Stallard 2, D. Chasman 3 1 Dept. of Epidemiology and Public Health, Imperial College London, London, UK 2 Medical and Pharmaceutical Statistics Research Unit, The University of Reading, Reading, UK 3 Variagenics/Nuvelo, Cambridge, MA, USA Funded by the Wellcome Trust Cardiff 11/06/04 p.1/26

2 Contents Protein structure and amino acid polymorphisms Cardiff 11/06/04 p.2/26

3 Contents Protein structure and amino acid polymorphisms Data from Lac Repressor and Lysozyme mutagenesis experiments Cardiff 11/06/04 p.2/26

4 Contents Protein structure and amino acid polymorphisms Data from Lac Repressor and Lysozyme mutagenesis experiments Modelling Cardiff 11/06/04 p.2/26

5 Contents Protein structure and amino acid polymorphisms Data from Lac Repressor and Lysozyme mutagenesis experiments Modelling Results Cardiff 11/06/04 p.2/26

6 Contents Protein structure and amino acid polymorphisms Data from Lac Repressor and Lysozyme mutagenesis experiments Modelling Results Discussion Cardiff 11/06/04 p.2/26

7 Protein structure and AA polymorphisms Central dogma: DNA RNA protein Cardiff 11/06/04 p.3/26

8 Protein structure and AA polymorphisms Central dogma: DNA RNA protein Triplets of bases (codons) in coding regions of the DNA... GTG CAC CTG ACT CCT GAG GAG... Cardiff 11/06/04 p.3/26

9 Protein structure and AA polymorphisms Central dogma: DNA RNA protein Triplets of bases (codons) in coding regions of the DNA... GTG CAC CTG ACT CCT GAG GAG... are translated to a sequence of amino acids in a protein Cardiff 11/06/04 p.3/26

10 Protein structure and AA polymorphisms Central dogma: DNA RNA protein Triplets of bases (codons) in coding regions of the DNA... GTG CAC CTG ACT CCT GAG GAG... are translated to a sequence of amino acids in a protein... Val His Leu Thr Pro Glu Glu... Cardiff 11/06/04 p.3/26

11 Protein structure and AA polymorphisms Central dogma: DNA RNA protein Triplets of bases (codons) in coding regions of the DNA... GTG CAC CTG ACT CCT GAG GAG... are translated to a sequence of amino acids in a protein... Val His Leu Thr Pro Glu Glu... which folds spontaneously to a three-dimensional structure. Cardiff 11/06/04 p.3/26

12 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. Cardiff 11/06/04 p.4/26

13 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Cardiff 11/06/04 p.4/26

14 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Example: sickle cell anæmia: Cardiff 11/06/04 p.4/26

15 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Example: sickle cell anæmia: GAG (GLu) GTG (Val) mutation in β-globin Cardiff 11/06/04 p.4/26

16 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Example: sickle cell anæmia: GAG (GLu) GTG (Val) mutation in β-globin introduces an hydrophobic patch on the surface of the molecule Cardiff 11/06/04 p.4/26

17 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Example: sickle cell anæmia: GAG (GLu) GTG (Val) mutation in β-globin introduces an hydrophobic patch on the surface of the molecule major changes to its properties. Cardiff 11/06/04 p.4/26

18 Protein structure and AA polymorphisms There are 20 different possible amino acids (AA) that vary in their physico-chemical properties. DNA variants that change amino acid sequence (nssnp) may change protein structure and hence function Example: sickle cell anæmia: GAG (GLu) GTG (Val) mutation in β-globin introduces an hydrophobic patch on the surface of the molecule major changes to its properties. Any nssnp which disrupts structure is a strong candidate in disease/pharmacogenetic studies Cardiff 11/06/04 p.4/26

19 Objective Identify nssnps likely to disrupt function in a novel protein, based on training data where the functionality of nssnps is known. Cardiff 11/06/04 p.5/26

20 Objective Identify nssnps likely to disrupt function in a novel protein, based on training data where the functionality of nssnps is known. Explanatory variables: Cardiff 11/06/04 p.5/26

21 Objective Identify nssnps likely to disrupt function in a novel protein, based on training data where the functionality of nssnps is known. Explanatory variables: structural data: hydrophobicity, relative B factor, surface accessibility of native amino acid. Cardiff 11/06/04 p.5/26

22 Objective Identify nssnps likely to disrupt function in a novel protein, based on training data where the functionality of nssnps is known. Explanatory variables: structural data: hydrophobicity, relative B factor, surface accessibility of native amino acid. sequence-based data: conservation of native amino acid in table of multiple sequence alignment. Cardiff 11/06/04 p.5/26

23 Objective Lots of recent interest in this (Gunther et al, 2003; Stitziel et al, 2003; Wang and Moult, 2001; Terp et al, 2002; del Sol Mesa et al, 2003; Ng and Henikoff, 2002; Chasman and Adams, 2001; Sunyaev et al, 2000,2001; Saunders and Baker, 2002) Cardiff 11/06/04 p.6/26

24 Objective Lots of recent interest in this (Gunther et al, 2003; Stitziel et al, 2003; Wang and Moult, 2001; Terp et al, 2002; del Sol Mesa et al, 2003; Ng and Henikoff, 2002; Chasman and Adams, 2001; Sunyaev et al, 2000,2001; Saunders and Baker, 2002) None of these use statistical models: we will build a probabilistic model for protein function. Cardiff 11/06/04 p.6/26

25 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. Cardiff 11/06/04 p.7/26

26 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Cardiff 11/06/04 p.7/26

27 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Effect on protein functionality recorded. Cardiff 11/06/04 p.7/26

28 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Effect on protein functionality recorded. For our data: Cardiff 11/06/04 p.7/26

29 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Effect on protein functionality recorded. For our data: Roughly 12 substitutions per site Cardiff 11/06/04 p.7/26

30 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Effect on protein functionality recorded. For our data: Roughly 12 substitutions per site Outcome is binary indicator of functionality Cardiff 11/06/04 p.7/26

31 Data Data from site-directed mutagenesis experiments on proteins with known 3D structures, specifically Lac repressor (Markiewicz et al, 1994) and Lysozyme (Rennell et al 1991) proteins. the native, wild-type, amino acids are replaced, one at a time, by non-native amino acids. Effect on protein functionality recorded. For our data: Roughly 12 substitutions per site Outcome is binary indicator of functionality We will train on the Lac repressor and validate on Lysozyme (and vice-versa). Cardiff 11/06/04 p.7/26

32 Lac Repressor Lac repressor controls the synthesis of various enzymes. Cardiff 11/06/04 p.8/26

33 Lac Repressor Lac repressor controls the synthesis of various enzymes. In the absence of lactose it binds to the DNA double helix upstream of the genes that code for enzymes necessary for E. coli to use lactose as a source of energy, preventing their synthesis. Cardiff 11/06/04 p.8/26

34 Lac Repressor Lac repressor controls the synthesis of various enzymes. In the absence of lactose it binds to the DNA double helix upstream of the genes that code for enzymes necessary for E. coli to use lactose as a source of energy, preventing their synthesis. Amino acids at 264 sites (out of 360) mutated to give total of 3245 observations. Cardiff 11/06/04 p.8/26

35 Lac Repressor Lac repressor controls the synthesis of various enzymes. In the absence of lactose it binds to the DNA double helix upstream of the genes that code for enzymes necessary for E. coli to use lactose as a source of energy, preventing their synthesis. Amino acids at 264 sites (out of 360) mutated to give total of 3245 observations. Lysozyme molecule from T4 phage: Cardiff 11/06/04 p.8/26

36 Lac Repressor Lac repressor controls the synthesis of various enzymes. In the absence of lactose it binds to the DNA double helix upstream of the genes that code for enzymes necessary for E. coli to use lactose as a source of energy, preventing their synthesis. Amino acids at 264 sites (out of 360) mutated to give total of 3245 observations. Lysozyme molecule from T4 phage: synthesised from the phage DNA once it has infected a bacterium and digested the bacteria cell wall, allowing replicated copies of the phage to escape. Cardiff 11/06/04 p.8/26

37 Lac Repressor Lac repressor controls the synthesis of various enzymes. In the absence of lactose it binds to the DNA double helix upstream of the genes that code for enzymes necessary for E. coli to use lactose as a source of energy, preventing their synthesis. Amino acids at 264 sites (out of 360) mutated to give total of 3245 observations. Lysozyme molecule from T4 phage: synthesised from the phage DNA once it has infected a bacterium and digested the bacteria cell wall, allowing replicated copies of the phage to escape. Amino acids at 143 sites (out of 162) mutated to give a total of 1632 observations. Cardiff 11/06/04 p.8/26

38 Predictive features used Feature Accessibility Relative accessibility Relative phylogenetic entropy Neighbourhood rel. phylogenetic entropy Relative B-factor Neighbourhood relative B-factor Unusual AA Buried charge Turn breaking Helix breaking Conserved Description Solvent accessible area of native AA Accessibility relative to maximum accessibility in training set Normalised phylogenetic entropy of native AA Phylogenetic entropy of structural neighbourhood of native AA Normalised B-factor of native AA Normalised B-factor of structural neighbourhood of native AA Mutant AA is not in phylogenetic profile Mutant is charged AA at buried site Mutant AA occurs at glycine or proline in a turn Mutant AA occurs in helical region and involves glycine or proline Native AA is at conserved position in phylogenetic profile Cardiff 11/06/04 p.9/26

39 Methods Prediction problem; lots of standard classification tools could be tried. Cardiff 11/06/04 p.10/26

40 Methods Prediction problem; lots of standard classification tools could be tried. Logistic regression Cardiff 11/06/04 p.10/26

41 Methods Prediction problem; lots of standard classification tools could be tried. Logistic regression Classification trees Cardiff 11/06/04 p.10/26

42 Methods Prediction problem; lots of standard classification tools could be tried. Logistic regression Classification trees Support vector machines Cardiff 11/06/04 p.10/26

43 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where K η = β 0 + β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [ skj (x wkj t kj) ] + k = 1,..., K Cardiff 11/06/04 p.11/26

44 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where K η = β 0 + β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [ skj (x wkj t kj) ] + k = 1,..., K where J k is the degree of interaction, Cardiff 11/06/04 p.11/26

45 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where K η = β 0 + β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [s kj (x wkj t kj )] + k = 1,..., K where J k is the degree of interaction, [ ] + = max[0, ] +, Cardiff 11/06/04 p.11/26

46 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where The linear predictor of a MARS model is η = β 0 + K β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [s kj (x wkj t kj )] + k = 1,..., K where J k is the degree of interaction, [ ] + = max[0, ] +, s kj {±1}, Cardiff 11/06/04 p.11/26

47 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where η = β 0 + K β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [s kj (x wkj t kj )] + k = 1,..., K where J k is the degree of interaction, [ ] + = max[0, ] +, s kj {±1}, w kj indexes the predictor included Cardiff 11/06/04 p.11/26

48 Multivariate Adaptive Regression Splines Extension of logistic regression, with logit(p) = η where η = β 0 + K β k B k (x) k=1 with basis functions B k (x) defined as B k (x) = J k j=1 [s kj (x wkj t kj )] + k = 1,..., K where J k is the degree of interaction, [ ] + = max[0, ] +, s kj {±1}, w kj indexes the predictor included and t kj are knot points. Cardiff 11/06/04 p.11/26

49 Multivariate Adaptive Regression Splines Basis function [t x] + [x t] t x Example of MARS basis functions Cardiff 11/06/04 p.12/26

50 Methods However, note we have multiple observations on each site: therefore clustered data. Cardiff 11/06/04 p.13/26

51 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. Cardiff 11/06/04 p.13/26

52 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. We extend the Bayesian Multivariate Adaptive Regression Spline model (Holmes and Denison, 2003) to clustered training data. Cardiff 11/06/04 p.13/26

53 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. We extend the Bayesian Multivariate Adaptive Regression Spline model (Holmes and Denison, 2003) to clustered training data. Accounts for clustering in the data Cardiff 11/06/04 p.13/26

54 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. We extend the Bayesian Multivariate Adaptive Regression Spline model (Holmes and Denison, 2003) to clustered training data. Accounts for clustering in the data Deals with potentially nonlinear effects of predictors Cardiff 11/06/04 p.13/26

55 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. We extend the Bayesian Multivariate Adaptive Regression Spline model (Holmes and Denison, 2003) to clustered training data. Accounts for clustering in the data Deals with potentially nonlinear effects of predictors Uses Bayesian model averaging to make predictions Cardiff 11/06/04 p.13/26

56 Methods However, note we have multiple observations on each site: therefore clustered data. Need a method that allows for this. We extend the Bayesian Multivariate Adaptive Regression Spline model (Holmes and Denison, 2003) to clustered training data. Accounts for clustering in the data Deals with potentially nonlinear effects of predictors Uses Bayesian model averaging to make predictions Fitted via MCMC Cardiff 11/06/04 p.13/26

57 Hierarchical BMARS For amino acid site i and mutation m assume ( ) K p(y im = 1 β, x im, b i ) = Φ β 0 + β k B k (x im ) + b i = Φ(η im +b i ) with b i N(0, σ 2 ). k=1 Cardiff 11/06/04 p.14/26

58 Hierarchical BMARS For amino acid site i and mutation m assume p(y im = 1 β, x im, b i ) = Φ with b i N(0, σ 2 ). ( β 0 + ) K β k B k (x im ) + b i = Φ(η im +b i ) k=1 Probit link for technical reasons allows us to work out full conditionals for regression parameters and hence we can use gibbs sampling to update these Cardiff 11/06/04 p.14/26

59 Hierarchical BMARS For amino acid site i and mutation m assume p(y im = 1 β, x im, b i ) = Φ with b i N(0, σ 2 ). ( β 0 + ) K β k B k (x im ) + b i = Φ(η im +b i ) k=1 Probit link for technical reasons allows us to work out full conditionals for regression parameters and hence we can use gibbs sampling to update these Reversible jump MCMC to add, delete or modify a basis function at each iteration. Cardiff 11/06/04 p.14/26

60 Application to mutagenesis data Fitted values given by [ 1 ŷ new = I N ] N Φ(B (t) (x new )β (t) ) > α. t=1 Cardiff 11/06/04 p.15/26

61 Application to mutagenesis data Fitted values given by [ 1 ŷ new = I N ] N Φ(B (t) (x new )β (t) ) > α. t=1 Predictive performance assessed by calculating area under ROC curves, where ROC curves plot sensitivity versus specificity as α varies in [0, 1]. Cardiff 11/06/04 p.15/26

62 Application to mutagenesis data Fitted values given by [ 1 ŷ new = I N ] N Φ(B (t) (x new )β (t) ) > α. t=1 Predictive performance assessed by calculating area under ROC curves, where ROC curves plot sensitivity versus specificity as α varies in [0, 1]. sensitivity: proportion of mutations affecting function correctly classified Cardiff 11/06/04 p.15/26

63 Application to mutagenesis data Fitted values given by [ 1 ŷ new = I N ] N Φ(B (t) (x new )β (t) ) > α. t=1 Predictive performance assessed by calculating area under ROC curves, where ROC curves plot sensitivity versus specificity as α varies in [0, 1]. sensitivity: proportion of mutations affecting function correctly classified specificity: proportion of mutations not affecting function correctly classified Cardiff 11/06/04 p.15/26

64 Application to mutagenesis data Train on Lac repressor/ Test on Lysozyme Sensitivity Specificity H-BMARS (0.82) - - BMARS (0.79) MARS (0.78) - - SVM (0.76) Tree (0.77) Cardiff 11/06/04 p.16/26

65 Application to mutagenesis data Train on Lysozyme/ Test on Lac repressor Sensitivity Specificity H-BMARS (0.83) - - BMARS (0.80) MARS (0.79) - - SVM (0.73) Tree (0.75) Cardiff 11/06/04 p.17/26

66 Application to mutagenesis data: SVM caveat SVM used the radial kernel. Cardiff 11/06/04 p.18/26

67 Application to mutagenesis data: SVM caveat SVM used the radial kernel. Need to tune this to control the smoothness of the decision boundary: 5 fold CV used. Cardiff 11/06/04 p.18/26

68 Application to mutagenesis data: SVM caveat SVM used the radial kernel. Need to tune this to control the smoothness of the decision boundary: 5 fold CV used. CV misclassification surface is rather flat: taking minimum gives poor performance on test data Cardiff 11/06/04 p.18/26

69 Application to mutagenesis data: SVM caveat SVM used the radial kernel. Need to tune this to control the smoothness of the decision boundary: 5 fold CV used. CV misclassification surface is rather flat: taking minimum gives poor performance on test data Current analysis uses the smoothest decision boundary leading to acceptable CV misclassification rate Cardiff 11/06/04 p.18/26

70 Application to mutagenesis data H-BMARS BMARS K K Posterior distribution of the number of basis functions Cardiff 11/06/04 p.19/26

71 Application to mutagenesis data Nbhd Rel B f Access Unusual AA Rel Access Helix Buried Unus by class Nbhd Rel Entr Interface Rel B f Turn Rel Entropy Conserved Near Cons Relative importance of predictors in the generated sample Cardiff 11/06/04 p.20/26

72 Application to mutagenesis data Access Rel Access Rel Entropy Nbhd Rel Entr Rel B f Nbhd Rel B f Unusual AA Unus by class Buried Turn Helix Conserved Near Cons Interface Interface Near Cons Conserved Helix Turn Buried Unus by class Unusual AA Nbhd Rel B f Rel B f Nbhd Rel Entr Rel Entropy Rel Access Access Interaction terms in the generated sample Cardiff 11/06/04 p.21/26

73 Application to mutagenesis data The posterior main effect of generic predictor p may be quantified as Ê[Φ p (x)] = 1 L L l=1 k:j k =1 w 1k =p β (l) k B(l) k (x) from a posterior sample of size L. Cardiff 11/06/04 p.22/26

74 Application to mutagenesis data E^(η j ) Nbhd Rel BF Entropy Posterior mean main effects (H-BMARS) Cardiff 11/06/04 p.23/26

75 Application to mutagenesis data E(eta) Rel Entropy Nbhd Rel BF Posterior mean interaction between Entropy and Nbhd Rel BF Cardiff 11/06/04 p.24/26

76 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Cardiff 11/06/04 p.25/26

77 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Heterogeneous misclassification rates of less than 20% compared to 27 35% reported by other authors using the same data. Cardiff 11/06/04 p.25/26

78 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Heterogeneous misclassification rates of less than 20% compared to 27 35% reported by other authors using the same data. Acknowledging clustered nature of data leads to enhanced interpretability and improved mixing. Cardiff 11/06/04 p.25/26

79 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Heterogeneous misclassification rates of less than 20% compared to 27 35% reported by other authors using the same data. Acknowledging clustered nature of data leads to enhanced interpretability and improved mixing. Allowing higher degree of interaction gave very similar results Cardiff 11/06/04 p.25/26

80 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Heterogeneous misclassification rates of less than 20% compared to 27 35% reported by other authors using the same data. Acknowledging clustered nature of data leads to enhanced interpretability and improved mixing. Allowing higher degree of interaction gave very similar results Solvent accessibility and molecular rigidity (B-factor) are good predictors of functionality. Cardiff 11/06/04 p.25/26

81 Summary Hierarchical BMARS achieves better out-of-sample specificity and sensitivity than competing methods here. Heterogeneous misclassification rates of less than 20% compared to 27 35% reported by other authors using the same data. Acknowledging clustered nature of data leads to enhanced interpretability and improved mixing. Allowing higher degree of interaction gave very similar results Solvent accessibility and molecular rigidity (B-factor) are good predictors of functionality. Code available as R package. Cardiff 11/06/04 p.25/26

82 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling Cardiff 11/06/04 p.26/26

83 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Cardiff 11/06/04 p.26/26

84 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Lysozyme very different to Lac repressor: consider prediction in specific protein families? Cardiff 11/06/04 p.26/26

85 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Lysozyme very different to Lac repressor: consider prediction in specific protein families? How useful is this? Cardiff 11/06/04 p.26/26

86 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Lysozyme very different to Lac repressor: consider prediction in specific protein families? How useful is this? Predictions will be weak in general Cardiff 11/06/04 p.26/26

87 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Lysozyme very different to Lac repressor: consider prediction in specific protein families? How useful is this? Predictions will be weak in general But functional biology is hard and expensive Cardiff 11/06/04 p.26/26

88 Discussion Improved performance more likely to come from improved model inputs than more sophisticated modelling In particular, need more mutant AA-specific covariates (as opposed to cluster-specific) Lysozyme very different to Lac repressor: consider prediction in specific protein families? How useful is this? Predictions will be weak in general But functional biology is hard and expensive Will nssnps disrupting function be interesting for genetic epidemiology/pharmacogenetics? Cardiff 11/06/04 p.26/26