Detecting amino acid sites under positive. selection and purifying selection

Size: px
Start display at page:

Download "Detecting amino acid sites under positive. selection and purifying selection"

Transcription

1 Genetics: Published Articles Ahead of Print, published on January 16, 2005 as /genetics Detecting amino acid sites under positive selection and purifying selection Tim Massingham 1 2 Nick Goldman 3 European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridgeshire, UK. November 18, Author for correspondence 2 tim.massingham@ebi.ac.uk 3 goldman@ebi.ac.uk

2 Abstract An excess of nonsynonymous over synonymous substitution at individual amino acid sites is an important indicator that positive selection has affected the evolution of a protein between the extant sequences under study and their most recent common ancestor. Several methods exist to detect the presence, and sometimes location, of positively selected sites in alignments of protein coding sequences. This paper describes the Sitewise Likelihood Ratio (SLR) method for detecting non-neutral evolution, a statistical test that can identify sites that are unusually conserved as well as those that are unusually variable. We show that the SLR method can be more powerful than currently published methods for detecting the location of positive selection, especially in difficult cases where the strength of selection is low. The increase in power is achieved whilst relaxing assumptions about how the strength of selection varies over sites and without elevated rates of false positive results that have been reported with some other methods. We also show that the SLR method performs well even under circumstances where the results from some previous methods can be misleading. Keywords: conservation, likelihood ratio tests, positive selection, purifying selection, synonymous / nonsynonymous substitution Running head: Detecting selection

3 INTRODUCTION Analyzing the instantaneous rate of nonsynonymous (amino acid changing) and synonymous (silent) nucleotide substitutions in protein-coding molecular sequences can give important clues to understanding how they evolved. In particular, the ratio of the rates of nonsynonymous and synonymous fixation has been used to measure the level of selective pressure on proteins (McDonald and Kreitman 1991, for example). Synonymous mutations do not change the encoded protein and so are often assumed to be selectively neutral. If a nonsynonymous mutation does not affect the fitness of a protein, it would become fixed within the population at the same rate as a synonymous mutation, giving a nonsynonymous / synonymous rate ratio ( ) of 1. If a nonsynonymous change makes the protein more or less fit on average then will be greater or less than 1, respectively. A significant excess of nonsynonymous over synonymous substitution has been used as evidence for adaptive evolution (for a review, see Yang and Bielawski 2000). Several methods have been proposed to detect the presence of positive selection in an aligned set of homologous protein sequences, from counting the observed number of synonymous and nonsynonymous differences between pairs of sequences (e.g., Li et al. 1985; Nei and Gojobori 1986; McDonald and Kreitman 1991) to more sophisticated techniques that allow for the phylogenetic relationship between many sequences and permit a statistical test of the result (Nielsen and Yang 1998; Suzuki and Gojobori 1999). These last two methods extract more information by considering the evolutionary relationships between the sequences 1

4 analyzed but differ markedly in their approaches. If a site has, it is unusually variable and is said to have evolved under positive selection. Similarly, a site with is unusually conserved and is said to have been subject to purifying selection. We will describe methods based on the discovery of such sites as tests for detecting the location of selection. The Suzuki and Gojobori (SG) method is such a test, assessing each site of an alignment separately for deviations from neutrality. We refer the reader to the original paper for a more detailed description, but in essence the SG method reconstructs ancestral sequences, counts the number of implied synonymous and nonsynonymous changes and then tests the result for deviation from neutrality. The SG method largely ignores the uncertainty in the ancestral reconstruction and in the evolutionary path taken between ancestral and extant sequences, and failing to account for this extra variability may adversely affect the performance of the test. In contrast, the Nielsen and Yang (NY) method uses the entire sequence to detect whether any part of it has undergone positive selection, allowing information from all sites to be used to estimate those quantities common to all sites (e.g., evolutionary distances) more accurately. We describe such methods as tests for detecting the presence of positive selection. The NY method describes variation in the level of selection along the sequence as if each site were a random draw from some distribution. It is not clear what distribution appropriately describes how varies along the sequence and Yang et al. (2000a) investigated the relative performance of several different parametric families. By using maximum likelihood (ML) methods to compare the performance of the chosen distribution with a suitable null model which does not allow for positive selection, a likelihood ratio test (LRT) for the presence of positive selection can be performed (Anisimova 2

5 et al. 2001). The NY method can also be used with a post hoc empirical Bayesian analysis to perform a test of the location of positive selection (Nielsen and Yang 1998). This may be done with or without a prior LRT, although we provide evidence below that empirical Bayes analysis of data where there is not a significant indication of the presence of positive selection can result in unacceptably high rates of false positive results. Simulation studies have shown that some variants of the NY method permit unexpectedly high rates of false positive results when analyzing sequences which, in truth, contain large proportions of neutrally, or almost neutrally, evolving sites (Anisimova et al. 2002; Suzuki and Nei 2002). Swanson et al. (2003) argue that, with sufficient data, applying some variants of the NY method to a data consisting of a mixture of unusually conserved and strictly neutral sites may falsely detect the presence of positive selection 50% of the time regardless of how the nominal size of the test is chosen. These elevated high rates of false positives are a consequence of the distributions used to model the variation in and not all choices are affected equally; notably the variant proposed by Swanson et al. (2003) (henceforth denoted SNY) does not have this unfortunate property when testing for the presence of positive selection, although in this paper we show that the method still has some undesirable properties when used in conjunction with empirical Bayes techniques to detect the location of positively selected sites. This paper introduces a test for non-neutral selection that combines the statistical foundation and more realistic models of substitution used by the NY method with the sitewise testing approach used by the SG method. This new Sitewise Likelihood Ratio (SLR) method tests each site individually for neutrality but uses the entire alignment to determine quantities common to all sites, such as evolu- 3

6 tionary distances. The SLR method is thus a test for the location of selection, devised from the best features of existing phylogeny-based tests for both the presence and location of positive selection. It can be used to detect either positive or purifying selection, and can form the basis of a test for the presence of selection; we discuss both these points further below. Treating many parameters as common to all sites allows the SLR method to extract more information from the data without having to make strong assumptions about how selection varies along the sequence. The sitewise nature of the SLR test means that there is no need to specify a model of how varies along the sequence, and so is not susceptible to the elevated rates of false positive results that some variants of the NY method permit. The weaker assumptions underlying the SLR method mean that it is more generally applicable to real data and more robust to potential errors when data violate the models of variation that the NY method assumes. The behavior of the SLR method is studied using data simulated under a variety of conditions similar to those previously used to investigate the strengths and weaknesses of the NY method (Anisimova et al. 2001; Suzuki and Nei 2002) and for two real data sets that provide interesting case studies. In the simulations, data simulating both strictly neutral evolution and evolution with a proportion of sites under positive selection were used. Looking at the performance of the method on neutrally evolving data allows the actual size of the test to be checked, confirming that the rate of false positives is controllable and that the test is neither liberal nor unduly conservative. Simulated data containing sites evolving under positive selection are used to compare the power (ability to detect positive selection) of the SLR and SNY tests. In particular, detecting sites under weak positive selec- 4

7 tion is extremely difficult and so, while such sites may not be of the most interest when analyzing real data, comparing the ability to find such sites is a good way to discriminate between different methods. THEORY AND METHODS Substitution model The SLR method is based on the same probabilistic model of sitewise evolution as the NY method, which assumes that substitutions at a given codon site occur independently of every other site and that the process can be modeled as a continuous-time Markov process. This model of substitution may be thought of as a two-stage process, describing background mutation and subsequent selection within a population. At each codon position, instantaneous nucleotide mutations are assumed to occur in a fashion similar to the HKY85 model of evolution (Hasegawa et al. 1985). These mutations then become fixed in the population with some probability: 0 if the mutation involves the creation of a stop codon, if the mutation is synonymous, or if the mutation is nonsynonymous. Splitting the substitution process into these two stages gives a biological interpretation to the parameterization of the model of codon substitution proposed by Muse and Gaut (1994): their and parameters are proportional to the probabilities of fixation given that a synonymous or nonsynonymous mutation has occurred, respectively. Splitting the process in this manner also highlights that the model assumes that 5

8 ! " all observed mutations are fixed within the population. Consequently, care should be taken when applying the methods in situations where this may not be true: for example, if the data contain SNPs. The rate matrix for the Markov chain therefore has entries ( (# a stop codon or $&%'# not a single nucleotide mutation) ($)%'# a synonymous transversion) ($)%'# a nonsynonymous transversion) ( * ( ($)%'# a synonymous transition) * ( ($)%'# a nonsynonymous transition) where is the instantaneous rate of substitution from codon $ to codon #, * is the transition/transversion rate ratio, + is the equilibrium frequency of codon #, and is a parameter confounded with time that describes the rate at which mutations occur. By setting, -/.0 (, constant over all codons, the model M0 described by Yang et al. (2000a) (see also Goldman and Yang 1994) is recovered. This is equivalent to assuming that every site has been subject to the same strength of selection. Alternatively, the NY method describes variation in the strength of selection along the sequence as if the value of at each site was independently drawn from some distribution, with all of the other parameters common to all sites. By using distributions in this fashion, the variation in the strength of selection can be described using only a few parameters, although its actual value at a particular site remains unknown. In contrast, the SLR method models each site separately with taking the value ) at site $, using more parameters but directly estimating the strength at each site and making no assumptions about the overall distribution. 6

9 For the NY method, the quantity is chosen so that, in unit time, the expected number of nucleotide substitutions per codon site is 1. Models of evolution which allow for some distribution of values of along the sequence have a consequent variation in the rate of evolution along the sequence; these models set so that the average rate across all sites is 1, even though each individual site may not be evolving at this rate (Nielsen and Yang 1998; Yang et al. 2000a). In contrast, for the SLR method the rate of evolution could potentially be different at every site as the level of selection changes and it is appropriate to choose so that one nucleotide change is expected per unit time for neutrally evolving data. This means that sites under purifying selection (with (, or )1 2 ) appear to evolve more slowly than neutral sites ( 3&4 ), which in turn evolve slower than positively selected sites ( &5 6 ), agreeing with biological intuition. The SLR test consists of performing a likelihood-ratio test on a sitewise basis, testing the null model (neutrality, & ) against an alternative model ( &87 ; or see below for alternatives). While revising this paper we were made aware of the maximum likelihood method for detecting positive selection of Suzuki (2004), which shares many similarities with the SLR method. The main differences between the methods are in how the common parameters are estimated, although it is not clear what process Suzuki recommends. Suzuki (2004) does not analyze the size and power of the test presented but the results in this paper will be largely transferable if similar methods to estimate common parameters are used. 7

10 Parameter estimation and likelihood calculation All inferences and tests are performed using likelihood calculations and optimizations standard in phylogenetics (see Felsenstein 2003, for example). In the SLR method, the tree topology, branch lengths, equilibrium codon frequencies and the transition / transversion rate ratio will be considered to be common to all sites in an alignment, and so are estimated using information from every site. The null model for the SLR test at site $ is that &9: while all other parameters, including the strength of selection at other sites ( ; for all # 7 $ ), are free to vary. Similar techniques to those described in Yang (1996) can be used to hold appropriate parameters common to all sites while allowing others to vary on a site-by-site basis, the contribution of each site toward the log-likelihood being calculated using the familiar pruning algorithm (Felsenstein 2003). The alternative model for site $ allows all the parameters to vary freely (still including the strength of selection at other sites). Consequently, the alternative model parameter estimates and maximum log-likelihood are identical at every site $, and one high-dimensional optimization (involving all the common parameters and one parameter 5 for each site $ ) has to be performed to find them. Maximizing the log-likelihood of the null model requires a similar high-dimensional optimization at each site and so may require considerable computing resources. Instead of attempting to perform these optimizations, two approximations are made: (a) the estimates of common parameters by M0 are unbiased and consistent, and (b) the effect of a single site on the ML values of the common parameters is negligible. The first approximation allows the common parameters to be estimated using M0 and, in conjunction with the second approximation, held fixed. 8

11 Given fixed common parameters, the distribution at each site is independent of = ) can be every other site and so the ML estimate for selective pressure at site $ ( < found via a one-dimensional optimization. In all, these approximations mean that the two high-dimensional optimizations required to find the maximum likelihood estimates of all parameters can be reduced to a comparatively low-dimensional optimization to estimate the common parameters and a one-dimensional optimization at each site to estimate the strength of selection. There is some evidence (Yang 2000) that branch lengths are under-estimated when applying M0 to data with rate variation, and so approximation (a) may not be realistic. Reanalysing Yang s small data set, we find the correlation between synonymous branch length estimates under M8 (dnm8) and those under M0 (dsm0) to be dsm8 = * dsm e-07 with an >9? value of The under-estimation is significant and consistent across branches, which may cause an increase in the number of false positives reported by the SLR method. The simulations presented in this paper take the under-estimation into account and it is not found to make an appreciable difference to the false positive rate. If the number of sites is large, the contribution from any one site towards to common parameters will be swamped by that from all the others and so approximation (b) is reasonable. In all cases, the codon frequencies + were estimated from the empirical counts in the observed data rather than using the procedure outlined above. 9

12 SLR test and distribution of test statistic The sitewise LRT statistic for non-neutral evolution at site is twice the difference in log-likelihood between the null and alternative models, individually maximized. Approximation (b), above, means is in fact simply twice the difference between the contributions of site $ to the log-likelihoods under the null and alternative models. Writing A CBD 5FE for the contribution of site $ to the loglikelihood given common parameter values and selective pressure =, then A CBC GE and A HBJI 5KE are the contributions toward the maximum log-likelihood of the null and alternative models, respectively, and Note RQ MLB A HBJI 5DE&N A HBH GEOEP, necessarily. Treating all parameters except 3 as fixed, the usual asymptotic theory (e.g., Garthwaite et al. 2002) states should be compared to a S5?T distribution for a statistical test of neutrality ( 3 ). The alternative hypothesis is that the site has evolved under purifying or positive selection ( 5U7 V ). Often only one of two possible deviations from neutrality may be of interest (for example, detecting positively selected sites), and the power of the test can be improved by considering the likelihood ratio analog of a one-tailed test. By placing a boundary at )8W, and finding the maximum likelihood over 3XQY (positive selection) or &5Z (purifying selection), only one of the possible alternatives is considered by the test. When the null hypothesis occurs at a boundary of the alternative hypothesis like this, the likelihood ratio test statistic is asymptotically distributed as an equal mixture of a point mass on 0 and a S?T distribution (Self and Liang 1987). For this distribution, -values can easily be calculated 10

13 by halving those obtained from using a S?T distribution. Following Goldman and Whelan (2000), who provide a table of relevant critical values, this mixture distribution will be denoted S [?T. As with most parametric tests, the asymptotic distribution of the test statistic is only an approximation to its actual distribution for a finite number of observations and so -values calculated using this asymptote may not be strictly correct. An alternative to approximating the test distribution by its asymptote is to estimate it by a parametric bootstrap, as described by Goldman (1993). Under the null hypothesis )V, with all the other model parameters at their previously estimated values, the distribution of all possible observations is completely determined and pseudo-replicates of the data can be generated by Monte Carlo simulation. The observed values of the test statistic for these replicates form an estimate of its actual distribution, from which -values can be derived. In the SLR test, the null model is the same at every site and therefore the bootstrap estimate of the distribution only needs to be made once. The bootstrap estimate is dependent on the tree topology, and other estimated parameters, so it is not valid to reuse it to analyze different data a new bootstrap estimate must be made. SNY test The variant of the NY method introduced in Swanson et al. (2003) describes the distribution of selective pressure along the sequence as a mixture of a two parameter beta distribution and a point mass. The beta distribution describes the variation in conservation at sites under purifying selection and the point mass describes neutrally evolving or positively selected sites. The point mass is fixed at 11

14 neutral ( V' ) under the null model (model M8A of Swanson et al. 2003; see also Wong et al. 2004) and the presence of positive selection is tested for by performing a LRT of \V against ]Q^ (a restricted variant of model M8 [ M8B ]: Yang et al. 2000a; Swanson et al. 2003; Wong et al. 2004), comparing the LRT statistic to a S5?T [ distribution. This test does not conform to typical statistical procedure since (a) the parameter of interest is on a boundary under the null model, (b) the parameters involved in specifying the shape of the beta distribution can disappear when all the probability of the mixture distribution is placed on the point mass and (c) the parameter specifying the position of the point mass can disappear when all the probability is placed on the beta distribution. The consequences for performing LRTs are considered below. The NY method can infer the location of positively selected sites by using empirical Bayes techniques (Nielsen and Yang 1998). We will write simply the SNY test to denote the SNY test for the location of positive selection using the model M8B with the NY method, estimating parameters by ML and inferring location by using empirical Bayes with some specified cutoff. SNY+LRT_ will explicitly denote the use of the SNY test only if a prior LRT (with significance level `a ) between M8A and M8B indicates significant positive selection. Simulations The probabilistic model which underlies both the SNY and SLR methods means that sample data sets can be simulated under known conditions; for example, with 5% of sites subject to a given level of positive selection. Neutrally evolving data can be generated to check whether there is any significant difference between the 12

15 actual distribution of a test statistic and its asymptote, and so ensure that critical points obtained from the asymptotic approximation are accurate and the size of the test (probability of type I error or false positive results) is properly controlled. If data are generated with sites under purifying or positive selection, the location of these sites in the sequence is known and so the number of the sites a method correctly detects can be determined. Multiple methods can be compared by looking at the number of sites each of them correctly (or incorrectly) detects when analyzing the same data. In this paper, the situation we study in detail is that of trying to detect positive selection since we expect this variant of the SLR test to be of most interest. The power of two methods can only be fairly compared if their respective probabilities of a false positive result are the same. This poses a problem when comparing the SNY and SLR tests since the SNY test produces a Bayesian posterior probability of positive selection rather than a -value, and the SLR produces a -value assuming strict neutrality (the hardest case to distinguish from positive selection) and so is an upper bound on the true probability that the result is a false positive. For this reason, Receiver Operator Characteristic (ROC) curves plots of the number of correctly identified sites against the number of false positives as the cutoff value for the test is lowered are useful. These curves allow the power of the methods to be compared as if the probability of a false positive could be chosen perfectly, a situation which is not possible for either test. 13

16 Testing for the presence of positive selection As multiple tests will generally have been performed (e.g., one at each codon site), the detection by SLR of one (or more) sites under positive selection may not be sufficient evidence for inferring the presence of positive selection in the sequence as a whole. The statistics must be adjusted for the number of tests performed, using standard techniques for multiple comparisons (Hsu 1996). Application of corrections for multiple comparisons to the SG method have been investigated by Wong et al. (2004) and these tests are equally applicable to the SLR method. Standard corrections for multiple comparisons are likely to lead to conservative tests in the SLR method since they assume that all sites have the same probability of falsely indicating positive selection as a neutral site whereas in reality many sites are likely to be under purifying selection and therefore have a lower probability of giving a false positive result. In addition, the question of interest is whether there are sites present under positive selection, rather than identifying particular sites, which is a weaker criterion than assumed by most multiple comparison corrections. An unusually large number of sites with some evidence for positive selection could be just as convincing as one site with overwhelming evidence. Sites which only exist in a single species, perhaps as a result of a recent insertion, contain no information about the substitution process and are defined to have 5b. Such sites cannot be detected as being under positive selection and should not count toward the number of tests performed for the purposes of multiple comparisons corrections. 14

17 RESULTS Distribution of SLR test statistic We present two simulations representative of the range of results we have observed in a greater number of experiments. In the first, 25 alignments of 12 sequences, each 200 codons long, were generated under neutral evolution with *cdl and codon frequencies taken from a set of 25 abalone species sperm lysin sequences (Yang et al. 2000b; data also distributed with Yang 1997) on the tree of figure 1(A) using the program evolver from the PAML package (Yang 1997). The generated alignments were then analyzed on the correct tree using the SLR method, all common parameters estimated using only data from each alignment. The 5000 sitewise test statistics form a Monte Carlo estimate of their test distribution, allowing for variation in the estimates of common parameters, which was compared to the S [?T distribution using a chi-squared goodness-of-fit test (Sokal and Rohlf 1995) and is shown in figure 2. The -value of the fit was 0.24 and we conclude that, for all practical purposes, there is no difference between the actual distribution of the test statistic and its asymptote in this case. In the second experiment, the analysis was again performed using the parameter values derived from the sperm lysin alignment but now the tree shown in figure 1(B) was used. This tree is short and has few sequences, and hence each site contains relatively little information about its evolution. The distribution of the test statistic is also shown in figure 2, and the -value for the fit is e, "gfih, indicating a significant difference between the actual and asymptotic distributions. 15

18 a However, the 95% and 99% critical values of the asymptotic S [?T distribution, 2.71 and 5.41, correspond to the 95.5% and 99.1% points of the parametric bootstrap distribution, respectively, and so there is little difference in the tail of the distribution. This is the region that is important for all the tests we describe here, and so for the purposes of hypothesis testing in this paper the asymptotic critical points will be taken as correct. This similarity between the simulated data and theory agrees with the results of Knudsen and Miyamoto (2001) for a similar test to detect sitewise changes in evolutionary rate. They also observed that the fit appeared especially good in the tail of the distribution. False positives and size of SLR test Table 1 summarizes the performance of the SLR and SNY tests on neutrally evolving sequences ( )jk for all sites), the most difficult case to distinguish from positive selection for either method. One hundred sets of data, each consisting of 300 codons, were simulated under a neutral model of evolution on the tree shown in figure 1C of Suzuki and Nei (2002) using the same branch lengths (all 0.1). Suzuki and Nei (2002) claim this tree to be a difficult case, in which they report the NY method can give high rates of false positives. As shown in table 1, the size for the SLR test is approximately correct for these data. This contrasts markedly with the results for the SNY test used without first performing a LRT for the presence of positive selection, which show a false positive rate ml " even when the empirical Bayes posterior probability cutoff is Once the LRT is carried out, the size of the SNY test is reduced to more 16

19 appropriate values; we note, however, that the size is affected more by the significance level chosen for the LRT than by the posterior probability cutoff chosen for the empirical Bayes analysis. For the analysis of strictly neutral data, the cutoff for empirical Bayes analysis provides no meaningful control over the rate of positive results. As also suggested by Wong et al. (2004), these results strongly support the requirement to confirm that there is significant evidence for the presence of positive selection before using empirical Bayes techniques to determine its location. Even though the overall false positive rate is reasonable for the SNY+LRT_ tests, consideration of the range of the number of false positive results per data set (from 0% to 100% of sites falsely identified) shows that the behavior of the method is still extreme: when one site is wrongly inferred to be under positive selection, many more false positives are also likely to occur and there is an incorrect inference of apparently pervasive positive selection. Power of SLR test The simulations have shown that the size of the SLR method is properly controlled and that it is reasonable to approximate critical points from the asymptotic S [?T distribution of the test statistic. Having established that the method is well-behaved, it is meaningful to investigate its power. The power of the SLR method was compared to that of the SNY test in three situations which differ in the distribution from which the value of at each site is drawn. These distributions were: (A) the true model for the SNY test, M8B with " P "onqpsr taking the value,, tlip "sr, " P nsu L, and tlipv u L ; (B) a mixture distribution " P n with probability 0.75 or else \V sp n ; and (C) a distribu- 17

20 tion derived empirically by fitting a large number of categories to an alignment of D-mannose binding lectin sequences (Pandit accession number PF01453), seven of which had non-zero weight and one of which was slightly positively selected ( w qp u L, " P " lsx ). All other parameters took the same values as in the simulations described above. Cases (A) and (C) represent realistic examples of positive selection; in case (A) the alternative hypothesis model of the SNY test is true, giving that test a particularly good chance of succeeding. Case (B) was selected to be a difficult problem for both methods. For each model of variation, data sets 200 codons long were generated on the tree shown in figure 1(B). The short length of this tree, along with the limited number of sequences, means there is little information per site for a method to detect positive selection and so it is a good example for comparing the power of methods. To reflect realistic analysis of data and reduce the false positive rate for the SNY test, data sets were simulated until there were 100 which passed the LRT test for the presence of positive selection at the 95% level. Only these 100 sets were analyzed; in total 229, 456 and 931 sets of data had to be generated to achieve the required number of passes for cases (A), (B) and (C), respectively. The ROC curves in figure 3 show how the number of true positive and false positive results change for each method as their respective cutoff points are varied. In all three cases, the SLR test dominates the SNY+LRTy{z test and so is the more powerful discriminator: for any given level of false positives, the SLR method correctly identifies more sites evolving under positive selection. In all cases the nominal 95% and 99% cutoffs are conservative, a result which is expected for the SLR test for reasons similar to those discussed above. (The SNY test does not have to be conservative; here, it probably is so because none of the three 18

21 models contain any almost neutral sites). The simulations were repeated using sets of data 1000 codons long, generating data from tree A in figure 1, or both of these. Using longer sequences makes little difference to the ROC of either method but increasing the number of species does, findings consistent with those of Anisimova et al. (2002). The ROC curves for these additional simulations are in the supplementary material. The curves in figure 3 do not tell the entire story, as the SLR method would have correctly detected sites in the many data sets that were discarded because they did not pass the LRT for the presence of positive selection. For the discarded data sets (56%, 78% and 89% of data sets in cases A, B and C, respectively) the SNY+LRTy{z test has effectively no power, whereas the power of the SLR method does not drop substantially compared to the results shown in figure 3 (see supplementary material) and so overall it is by far the more powerful of the two tests. As the evidence for the presence of positive selection increases, the proportion of data sets for which the SNY+LRTy{z test will detect the presence of positive selection will increase and so this difference in performance between it and the SLR test will decrease. Real examples By way of comparison on real data, we reanalyze two data sets from Yang et al. (2000a) using both the SLR and SNY tests. The vertebrate -globin and HIV- 1 pol data sets (D2 and D7 in the original paper) were analyzed for positively selected sites with both tests, using the alignments and topologies available from ftp://abacus.gene.ucl.ac.uk/pub/yngp2000/. The SNY test only 19

22 gives marginal evidence for the presence of positive selection in -globin, with a LRT statistic of 5.45, in comparison to the HIV-1 pol data for which a value of is obtained (cf. S [?T 95% and 99% critical values of 2.71 and 5.41, respectively). Using criteria consistent with our simulations, there is sufficient evidence of positive selection to carry out empirical Bayes detection of location for both data sets. The results are summarized in table 2. The SLR test finds no sites with significant positive selection in the -globin data, compared to five sites found by the SNY test with a 95% posterior probability cutoff. Correction of the SLR test for multiple comparisons with Hochberg s method (Hsu 1996) gives a -value indistinguishable from 1, again indicating that there is no evidence for the presence of positive selection. For the HIV-1 pol data set, the SNY test finds 13 sites with a posterior probability of 0.95 or greater, of which six have posterior probability Q SLR test finds 22 sites with a -value Z " P "on and 13 with -value Z " P xqx. The " P " ; these latter 13 matched exactly the sites found by the SNY test. After Hochberg s correction for multiple comparisons, two sites have adjusted -values Z " P "on more extreme of these has -value L}~ " f ; the other has -value ~ " f. The This indicates that the SLR test finds highly significant evidence for the presence of positive selection in the HIV-1 pol data, at sites 67 and 347. For this data set, the SLR and SNY tests indicate the presence of significant positive selection and suggest similar locations. These two data sets show that the SNY and SLR tests have broadly comparable performance on real data, with SLR detecting more sites in the HIV-1 pol data set at a nominal cutoff although the actual numbers of false positive results is unknowable. The SLR test provides no evidence for positive selection in -globin 20

23 and the SNY test and associated parameter value estimates for M8B suggest that, should any positive selection exist, it is weak. The results for -globin differ from those obtained by Yang et al. (2000a) for two reasons: the comparison of models M8A and M8B in the SNY test permits a direct test for the presence of positive selection, whereas other comparisons used by Yang et al. (2000a) may not (Swanson et al. 2003), and the codon frequencies were estimated differently (Yang et al. 2000a used four nucleotide frequencies for each of the three codon positions, whereas we use 61 codon frequencies). DISCUSSION This paper has presented a new method to detect both positive and purifying selection. It has been shown that the SLR test can achieve higher power than the SNY test while providing numerous other advantages: the method relaxes some important assumptions about how variation in the level of selection is modeled, the probability of a false positive result from the SLR method is controllable, and the method is better-behaved than the SNY test when analyzing strictly neutral data and does not occasionally give drastically wrong results in such cases. Having to model how the level of selective pressure varies along a sequence has caused a proliferation of variants of the NY method and appears to be the source of the observed problems with the LRT (Anisimova et al. 2001; Suzuki and Nei 2002; Massingham 2003). By making fewer assumptions about how the level of selection varies along the sequence, the SLR method is more generally 21

24 applicable to data and robust to possible errors when assumptions about the distribution of are violated. The SLR test appears to have excellent control over the levels of false positive inference of sites evolving under positive selection, with no evidence of the high rates that have been reported for variants of the NY method. In contrast, table 1 shows that the SNY test can occasionally make large mistakes when analyzing strictly neutral data. When a set of data falsely passes the LRT, empirical Bayes analyses can go badly wrong, often implying that many sites are under positive selection with extremely high confidence apparently strong evidence of pervasive positive selection when none was actually present. We believe that this behavior is due to the empirical Bayes analysis taking the estimated parameter values as true and not allowing for their inherent uncertainty. For example, consider the case where the truth consists of most sites evolving under strictly neutral evolution with a small number evolving with rates as if chosen from a -distribution: even for a large amount of data the estimated location of the point mass in model M8B will place M about 50% of the time. This point mass may have a large weight and under these circumstances an empirical Bayes analysis will indicate positive selection with high confidence. In contrast, a full Bayes analysis (Huelsenbeck and Dyer 2004) would only give confidence of 50%. When analyzing real data, false indications of pervasive positive selection may be differentiated from real ones by looking at the standard errors of the parameter values, or by checking whether the sites identified as being under positive selection are distributed randomly along the sequence. An inference of a high proportion of sites with close to 1 should be cause for extreme caution. 22

25 The observed size of the LRT for the SNY test did not agree with what may have been expected from theory (Self and Liang 1987). The discrepancy may be because the distribution used to generate the data lies on the boundary of parameter space for the null model, making the two parameters describing the beta distribution component of the model inestimable. When parameters cannot be estimated under the null model, even nuisance parameters not directly involved in the test, the regularity conditions necessary for the asymptotic approximations do not hold and there is no reason to expect -values from a S&?T [ distribution to be accurate. In this case, the techniques described in Davies (1987) may be useful in constructing critical values whose nominal -values more accurately reflect the actual size of the test. This discrepancy only arises when the truth is strictly neutral evolution, although bad fit to the asymptotic distribution may also be observed for small to medium sample sizes when the truth is close to strictly neutral (for example, when a large proportion of sites have ]ƒ4 ). Whatever the cause, at the moment the SNY test allied with an empirical Bayesian analysis gives no predictable control over the size of the ensuing test for the location of positive selection. The examples presented in this paper suggest that the power of the SLR test to detect the location of positive selection exceeds that of the SNY test, even when considering only those sets of data with significant evidence of the presence of positive selection. In addition, the SLR method does not suffer from uncontrollable rates of false positive results when analyzing strictly neutrally evolving data and relaxes restrictions about how the variation in the strength of selection is modeled. The apparent paradox of an increase in power accompanying a relaxation of restrictions may be due in part to the ill fit between the assumptions and the data 23

26 (in some examples) and to the failure of the empirical Bayes analysis to take the variability of parameter estimates into account. The SLR method assumes that the background pattern of mutation is constant along the sequences being analyzed and that the differences between the observed rates of evolution at each site are solely due to differences in the strength of selection. For data consisting of alignments of single proteins, the genomic environment of each site is similar, as will be the probability of repair and other factors influencing the rate at which synonymous mutations become fixed within a population. However, if there is expected to be significant variation in mutation rates along the sequence, extra parameters could be added at each site to describe the rate of mutation at that site equivalent to allowing to vary on a site-by-site basis. Modifying the method in this manner might cause a noticeable drop in power as many more parameters would have to be estimated from the same amount of data. Alternatively, rate variation could be incorporated in a manner common to all sites using techniques similar to those described by (Yang 1993, 1994). Lastly, we note that the SLR method is a test of whether a given site has undergone selection or not, and the test statistic summarizes the strength of the evidence for selection rather than the strength of the selection itself. The variance of the estimate of may be very large and values at different sites or from different data sets should not be compared without reference to confidence intervals. Program availability The program used to carry out the SLR test is available on request from TM. 24

27 ACKNOWLEDGMENTS Thanks to Ziheng Yang for his generous distribution of the PAML software package, and to Simon Whelan for useful discussions. The authors are grateful to Rasmus Nielsen, Wendy Wong and Ziheng Yang for access to and discussion of unpublished data, and acknowledge their priority in discovering that the actual size of the test for the Swanson modification does not always agree closely with the theory from Self and Liang (1987); this result was be published separately as part of Wong et al. (2004). TM s work was funded by a Post-doctoral Fellowship from the European Bioinformatics Institute of the European Molecular Biology Laboratory. NG is supported by a Wellcome Trust Senior Fellowship in Basic Biomedical Research. 25

28 References ANISIMOVA, M., BIELAWSKI, J. P., AND YANG, Z Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18: ANISIMOVA, M., BIELAWSKI, J. P., AND YANG, Z Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 19: DAVIES, R. B Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74: FELSENSTEIN, J Inferring Phylogenies. Sinauer Associates, Sunderland, MA. GARTHWAITE, P., JOLLIFFE, I., AND JONES, B Statistical Inference, 2nd edition. Oxford University Press, Oxford. GOLDMAN, N Statistical tests of models of DNA substitution. J. Mol. Evol. 36: GOLDMAN, N. AND WHELAN, S Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. Mol. Biol. Evol. 17: GOLDMAN, N. AND YANG, Z A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:

29 HASEGAWA, M., KISHINO, H., AND YANO, T Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: HSU, J. C Multiple Comparisons. Theory and Methods. Chapman and Hall, London. HUELSENBECK, J. P. AND DYER, K. A Bayesian estimation of positively selected sites. J. Mol. Evol. 58: KNUDSEN, B. AND MIYAMOTO, M. M A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl Acad. Sci. USA 98: LI, W.-H., WU, C.-I., AND LUO, C.-C A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2: MASSINGHAM, T Detecting positive selection in proteins: models of evolution and statistical tests. PhD thesis, University of Cambridge, England. MCDONALD, J. H. AND KREITMAN, M Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: MUSE, S. V. AND GAUT, B. S A likelihood approach for comparing synonymous and non-synonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:

30 NEI, M. AND GOJOBORI, T Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3: NIELSEN, R. AND YANG, Z Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: SELF, S. G. AND LIANG, K.-Y Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Amer. Stat. Assoc. 82: SOKAL, R. R. AND ROHLF, F. J Biometry, 3rd edition. W. H. Freeman and Company, New York. SUZUKI, Y New methods for detecting positive selection at single amino acid sites. J. Mol. Evol. 59: SUZUKI, Y. AND GOJOBORI, T A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16: SUZUKI, Y. AND NEI, M Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 19: SWANSON, W. J., NIELSEN, R., AND YANG, Q Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol. 20:

31 WHELAN, S., DE BAKKER, P. I. W., AND GOLDMAN, N Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics 19: WONG, W., YANG, Z., GOLDMAN, N., AND NIELSEN, R Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168: YANG, Z Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10: YANG, Z Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39: YANG, Z Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42: YANG, Z PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13: YANG, Z Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51: YANG, Z. AND BIELAWSKI, J. P Statistical methods for detecting molecular adaptation. TREE 15:

32 YANG, Z., NIELSEN, R., GOLDMAN, N., AND PEDERSEN, A.-M. K. 2000a. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: YANG, Z., SWANSON, W. J., AND VACQUIER, V. D. 2000b. Maximum likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol. Biol. Evol. 17:

33 Method Cutoff Wrong sites %-age Range (%-age) SLR (1 14) (0 5) SNY (0 100) (0 100) SNY+LRTy{z (0 100) (17) (0 100) SNY+LRTy{y (0 100) (5) (0 100) Table 1: Number of sites falsely identified as having evolved under positive selection. One hundred data sets, each 300 codons long, were simulated under a model of neutral evolution and analyzed. SNY+LRT_ B E refers to detecting sites using empirical Bayes on data sets which pass a statistical test for the presence of positive selection using significance level `a, the number in parentheses being the number of such data sets (out of 100) which passed the test. Cutoff refers to the significance level used in LRTs for the SLR test, and to the empirical Bayes posterior probability cutoff level used in the SNY test note that these numbers are not directly comparable (see text for further details). Range shows the minimum and maximum numbers of falsely identified sites observed within the 100 individual data sets. 31

Simulation Study of the Reliability and Robustness of the Statistical Methods for Detecting Positive Selection at Single Amino Acid Sites

Simulation Study of the Reliability and Robustness of the Statistical Methods for Detecting Positive Selection at Single Amino Acid Sites Simulation Study of the Reliability and Robustness of the Statistical Methods for Detecting Selection at Single Amino Acid Sites Yoshiyuki Suzuki and Masatoshi Nei Institute of Molecular Evolutionary Genetics,

More information

Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models

Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models Ziheng Yang* and Rasmus Nielsen *Department of Biology, University College London, England; and Department

More information

MODELING the heterogeneity of evolutionary regimes

MODELING the heterogeneity of evolutionary regimes INVESTIGATION On the Statistical Interpretation of Site-Specific Variables in Phylogeny-Based Substitution Models Nicolas Rodrigue 1 Eastern Cereal and Oilseed Research Centre, Agriculture and Agri-Food

More information

Codon-Substitution Models for Detecting Molecular Adaptation at Individual Sites Along Specific Lineages

Codon-Substitution Models for Detecting Molecular Adaptation at Individual Sites Along Specific Lineages Codon-Substitution Models for Detecting Molecular Adaptation at Individual Sites Along Specific Lineages Ziheng Yang* and Rasmus Nielsen *Galton Laboratory, Department of Biology, University College London;

More information

Mutation Rates and Sequence Changes

Mutation Rates and Sequence Changes s and Sequence Changes part of Fortgeschrittene Methoden in der Bioinformatik Computational EvoDevo University Leipzig Leipzig, WS 2011/12 From Molecular to Population Genetics molecular level substitution

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

CODON-BASED DETECTION OF POSITIVE SELECTION CAN BE BIASED BY HETEROGENEOUS DISTRIBUTION OF POLAR AMINO ACIDS ALONG PROTEIN SEQUENCES

CODON-BASED DETECTION OF POSITIVE SELECTION CAN BE BIASED BY HETEROGENEOUS DISTRIBUTION OF POLAR AMINO ACIDS ALONG PROTEIN SEQUENCES 3351 CODON-BASED DETECTION OF POSITIVE SELECTION CAN BE BIASED BY HETEROGENEOUS DISTRIBUTION OF POLAR AMINO ACIDS ALONG PROTEIN SEQUENCES Xuhua Xia Department of Biology, University of Ottawa 3 Marie Curie,

More information

The Trouble with Sliding Windows and the Selective Pressure in BRCA1

The Trouble with Sliding Windows and the Selective Pressure in BRCA1 The Trouble with Sliding Windows and the Selective Pressure in BRCA1 Karl Schmid 1, Ziheng Yang 2 * 1 Department of Plant Biology and Forest Genetics, Swedish University of Agricultural Sciences (SLU),

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1

ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1 ESTIMATING GENETIC VARIABILITY WITH RESTRICTION ENDONUCLEASES RICHARD R. HUDSON1 Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 Manuscript received September 8, 1981

More information

I See Dead People: Gene Mapping Via Ancestral Inference

I See Dead People: Gene Mapping Via Ancestral Inference I See Dead People: Gene Mapping Via Ancestral Inference Paul Marjoram, 1 Lada Markovtsova 2 and Simon Tavaré 1,2,3 1 Department of Preventive Medicine, University of Southern California, 1540 Alcazar Street,

More information

2. The rate of gene evolution (substitution) is inversely related to the level of functional constraint (purifying selection) acting on the gene.

2. The rate of gene evolution (substitution) is inversely related to the level of functional constraint (purifying selection) acting on the gene. NEUTRAL THEORY TOPIC 3: Rates and patterns of molecular evolution Neutral theory predictions A particularly valuable use of neutral theory is as a rigid null hypothesis. The neutral theory makes a wide

More information

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions

Measurement of Molecular Genetic Variation. Forces Creating Genetic Variation. Mutation: Nucleotide Substitutions Measurement of Molecular Genetic Variation Genetic Variation Is The Necessary Prerequisite For All Evolution And For Studying All The Major Problem Areas In Molecular Evolution. How We Score And Measure

More information

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination UC Davis Computer Science Technical Report CSE-2003-29 1 Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination Dan Gusfield, Satish Eddhu, Charles Langley November 24,

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Lecture #8 2/4/02 Dr. Kopeny

Lecture #8 2/4/02 Dr. Kopeny Lecture #8 2/4/02 Dr. Kopeny Lecture VI: Molecular and Genomic Evolution EVOLUTIONARY GENOMICS: The Ups and Downs of Evolution Dennis Normile ATAMI, JAPAN--Some 200 geneticists came together last month

More information

Evaluation of Amino Acids Change in Structural Protein of Filoviridae Family during Evolution Process

Evaluation of Amino Acids Change in Structural Protein of Filoviridae Family during Evolution Process Evaluation of Amino Acids Change in Structural Protein of Filoviridae Family during Evolution Process Maryam Mohammadlou* Science Faculty, I. Azad University, Ahar branch, Ahar, Iran. * Corresponding author.

More information

The neutral theory of molecular evolution

The neutral theory of molecular evolution The neutral theory of molecular evolution Objectives the neutral theory detecting natural selection exercises 1 - learn about the neutral theory 2 - be able to detect natural selection at the molecular

More information

Assigning Sequences to Taxa CMSC828G

Assigning Sequences to Taxa CMSC828G Assigning Sequences to Taxa CMSC828G Outline Objective (1 slide) MEGAN (17 slides) SAP (33 slides) Conclusion (1 slide) Objective Given an unknown, environmental DNA sequence: Make a taxonomic assignment

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

A CUSTOMER-PREFERENCE UNCERTAINTY MODEL FOR DECISION-ANALYTIC CONCEPT SELECTION

A CUSTOMER-PREFERENCE UNCERTAINTY MODEL FOR DECISION-ANALYTIC CONCEPT SELECTION Proceedings of the 4th Annual ISC Research Symposium ISCRS 2 April 2, 2, Rolla, Missouri A CUSTOMER-PREFERENCE UNCERTAINTY MODEL FOR DECISION-ANALYTIC CONCEPT SELECTION ABSTRACT Analysis of customer preferences

More information

Introns early. Introns late

Introns early. Introns late Introns early Introns late Self splicing RNA are an example for catalytic RNA that could have been present in RNA world. There is little reason to assume that the RNA world was not plagued by self-splicing

More information

Lecture 10 Molecular evolution. Jim Watson, Francis Crick, and DNA

Lecture 10 Molecular evolution. Jim Watson, Francis Crick, and DNA Lecture 10 Molecular evolution Jim Watson, Francis Crick, and DNA Molecular Evolution 4 characteristics 1. c-value paradox 2. Molecular evolution is sometimes decoupled from morphological evolution 3.

More information

Drift versus Draft - Classifying the Dynamics of Neutral Evolution

Drift versus Draft - Classifying the Dynamics of Neutral Evolution Drift versus Draft - Classifying the Dynamics of Neutral Evolution Alison Feder December 3, 203 Introduction Early stages of this project were discussed with Dr. Philipp Messer Evolutionary biologists

More information

Genetic drift. 1. The Nature of Genetic Drift

Genetic drift. 1. The Nature of Genetic Drift Genetic drift. The Nature of Genetic Drift To date, we have assumed that populations are infinite in size. This assumption enabled us to easily calculate the expected frequencies of alleles and genotypes

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Gibbs Sampling and Centroids for Gene Regulation

Gibbs Sampling and Centroids for Gene Regulation Gibbs Sampling and Centroids for Gene Regulation NY State Dept. of Health Wadsworth Center @ Albany Chapter American Statistical Association Acknowledgments Team: Sean P. Conlan (National Institutes of

More information

Random Allelic Variation

Random Allelic Variation Random Allelic Variation AKA Genetic Drift Genetic Drift a non-adaptive mechanism of evolution (therefore, a theory of evolution) that sometimes operates simultaneously with others, such as natural selection

More information

Park /12. Yudin /19. Li /26. Song /9

Park /12. Yudin /19. Li /26. Song /9 Each student is responsible for (1) preparing the slides and (2) leading the discussion (from problems) related to his/her assigned sections. For uniformity, we will use a single Powerpoint template throughout.

More information

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1 Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1 Human single nucleotide polymorphisms The majority of human sequence variation is due to substitutions that have occurred once in the

More information

Neutral theory: The neutral theory does not say that all evolution is neutral and everything is only due to to genetic drift.

Neutral theory: The neutral theory does not say that all evolution is neutral and everything is only due to to genetic drift. Neutral theory: The vast majority of observed sequence differences between members of a population are neutral (or close to neutral). These differences can be fixed in the population through random genetic

More information

Detecting selection on nucleotide polymorphisms

Detecting selection on nucleotide polymorphisms Detecting selection on nucleotide polymorphisms Introduction At this point, we ve refined the neutral theory quite a bit. Our understanding of how molecules evolve now recognizes that some substitutions

More information

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous. Two proteins sharing a common ancestor are said to be homologs. Homologyoften implies structural

More information

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE

COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE COORDINATING DEMAND FORECASTING AND OPERATIONAL DECISION-MAKING WITH ASYMMETRIC COSTS: THE TREND CASE ABSTRACT Robert M. Saltzman, San Francisco State University This article presents two methods for coordinating

More information

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc. Sawtooth Software RESEARCH PAPER SERIES Sample Size Issues for Conjoint Analysis Studies Bryan Orme, Sawtooth Software, Inc. 1998 Copyright 1998-2001, Sawtooth Software, Inc. 530 W. Fir St. Sequim, WA

More information

homology - implies common ancestry. If you go back far enough, get to one single copy from which all current copies descend (premise 1).

homology - implies common ancestry. If you go back far enough, get to one single copy from which all current copies descend (premise 1). Drift in Large Populations -- the Neutral Theory Recall that the impact of drift as an evolutionary force is proportional to 1/(2N) for a diploid system. This has caused many people to think that drift

More information

Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model

Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model Yuko Ohfuku 1,, 3*, Hideo Tanaka and Masami Uebayasi

More information

Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions

Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions Simple Methods for Estimating the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions Masatoshi Nei* and Takashi Gojoborit *Center for Demographic and Population Genetics, University of Texas

More information

COMPUTER SIMULATIONS AND PROBLEMS

COMPUTER SIMULATIONS AND PROBLEMS Exercise 1: Exploring Evolutionary Mechanisms with Theoretical Computer Simulations, and Calculation of Allele and Genotype Frequencies & Hardy-Weinberg Equilibrium Theory INTRODUCTION Evolution is defined

More information

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/

Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ Glossary of Standardized Testing Terms https://www.ets.org/understanding_testing/glossary/ a parameter In item response theory (IRT), the a parameter is a number that indicates the discrimination of a

More information

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data Justin S. Hogg Department of Computational Biology University of Pittsburgh Pittsburgh, PA 15213 jsh32@pitt.edu Abstract

More information

Review. Molecular Evolution and the Neutral Theory. Genetic drift. Evolutionary force that removes genetic variation

Review. Molecular Evolution and the Neutral Theory. Genetic drift. Evolutionary force that removes genetic variation Molecular Evolution and the Neutral Theory Carlo Lapid Sep., 202 Review Genetic drift Evolutionary force that removes genetic variation from a population Strength is inversely proportional to the effective

More information

Evolution by Gene Duplication and Compensatory Advantageous Mutations

Evolution by Gene Duplication and Compensatory Advantageous Mutations Copyright 0 1988 by the Genetics Society of America Evolution by Gene Duplication and Compensatory Advantageous Mutations Tomoko Ohta National Institute of Genetics, Mishima, 41 I Japan Manuscript received

More information

Genetic Variation Reading Assignment Answer the following questions in your JOURNAL while reading the accompanying packet. Genetic Variation 1.

Genetic Variation Reading Assignment Answer the following questions in your JOURNAL while reading the accompanying packet. Genetic Variation 1. Genetic Variation Reading Assignment Answer the following questions in your JOURNAL while reading the accompanying packet. Genetic Variation 1. In the diagram about genetic shuffling, what two phenomena

More information

Stalking the Genetic Basis of a Trait

Stalking the Genetic Basis of a Trait INTRODUCTION The short film Popped Secret: The Mysterious Origin of Corn describes how the evolution of corn was mostly a mystery until George Beadle proposed a bold new hypothesis in 1939: corn evolved

More information

Lecture 10: Introduction to Genetic Drift. September 28, 2012

Lecture 10: Introduction to Genetic Drift. September 28, 2012 Lecture 10: Introduction to Genetic Drift September 28, 2012 Announcements Exam to be returned Monday Mid-term course evaluation Class participation Office hours Last Time Transposable Elements Dominance

More information

Data Mining. CS57300 Purdue University. March 27, 2018

Data Mining. CS57300 Purdue University. March 27, 2018 Data Mining CS57300 Purdue University March 27, 2018 1 Recap last class So far we have seen how to: Test a hypothesis in batches (A/B testing) Test multiple hypotheses (Paul the Octopus-style) 2 The New

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Sequence variation and molecular evolution of BMP4 genes

Sequence variation and molecular evolution of BMP4 genes Sequence variation and molecular evolution of BMP4 genes D.J. Zhang 1,5 *, J.H. Wu 2,3 *, G. Husile 4, H.L. Sun 2,3 and W.G. Zhang 4 1 College of Life Sciences, Inner Mongolia University, Hohhot, China

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS

THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS THE LEAD PROFILE AND OTHER NON-PARAMETRIC TOOLS TO EVALUATE SURVEY SERIES AS LEADING INDICATORS Anirvan Banerji New York 24th CIRET Conference Wellington, New Zealand March 17-20, 1999 Geoffrey H. Moore,

More information

Methods Available for the Analysis of Data from Dominant Molecular Markers

Methods Available for the Analysis of Data from Dominant Molecular Markers Methods Available for the Analysis of Data from Dominant Molecular Markers Lisa Wallace Department of Biology, University of South Dakota, 414 East Clark ST, Vermillion, SD 57069 Email: lwallace@usd.edu

More information

Lecture WS Evolutionary Genetics Part I - Jochen B. W. Wolf 1

Lecture WS Evolutionary Genetics Part I - Jochen B. W. Wolf 1 N µ s m r - - - Mutation Effect on allele frequencies We have seen that both genotype and allele frequencies are not expected to change by Mendelian inheritance in the absence of any other factors. We

More information

Biol Lecture Notes

Biol Lecture Notes Biol 303 1 Evolutionary Forces: Generation X Simulation To launch the GenX software: 1. Right-click My Computer. 2. Click Map Network Drive 3. Don t worry about what drive letter is assigned in the upper

More information

Disclaimer This presentation expresses my personal views on this topic and must not be interpreted as the regulatory views or the policy of the FDA

Disclaimer This presentation expresses my personal views on this topic and must not be interpreted as the regulatory views or the policy of the FDA On multiplicity problems related to multiple endpoints of controlled clinical trials Mohammad F. Huque, Ph.D. Div of Biometrics IV, Office of Biostatistics OTS, CDER/FDA JSM, Vancouver, August 2010 Disclaimer

More information

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) PROGRAM TITLE DEGREE TITLE Master of Science Program in Bioinformatics and System Biology (International Program) Master of Science (Bioinformatics

More information

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human? TEST FORM A Evolution PCB 4673 Exam # 2 Name SSN Multiple Choice: 3 points each 1. The horseshoe crab is a so-called living fossil because there are ancient species that looked very similar to the present-day

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes CORRECTION NOTICE Nat. Biotechnol. doi:10.1038/nbt. 3567 Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes David W Morgens, Richard M Deans, Amy Li & Michael C Bassik In the version

More information

Identification of deleterious mutations within three human genomes

Identification of deleterious mutations within three human genomes Identification of deleterious mutations within three human genomes Sung Chun and Justin C. Fay Genome Res. 2009 19: 1553-1561 originally published online July 14, 2009 Access the most recent version at

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Positive and Negative Selection on Noncoding DNA in Drosophila simulans

Positive and Negative Selection on Noncoding DNA in Drosophila simulans Positive and Negative Selection on Noncoding DNA in Drosophila simulans Penelope R. Haddrill,* Doris Bachtrog, and Peter Andolfattoà *Institute of Evolutionary Biology, School of Biological Sciences, University

More information

Introduction to gene expression microarray data analysis

Introduction to gene expression microarray data analysis Introduction to gene expression microarray data analysis Outline Brief introduction: Technology and data. Statistical challenges in data analysis. Preprocessing data normalization and transformation. Useful

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

Classroom Probability Simulations Using R: Margin of Error in a Public Opinion Poll

Classroom Probability Simulations Using R: Margin of Error in a Public Opinion Poll Classroom Probability Simulations Using R: Margin of Error in a Public Opinion Poll Outstanding Professor Address, Part 2 Fall Faculty Convocation (9/21/04) Bruce E. Trumbo Department of Statistics, CSU

More information

Supplementary Note: Detecting population structure in rare variant data

Supplementary Note: Detecting population structure in rare variant data Supplementary Note: Detecting population structure in rare variant data Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to

More information

VALUE OF SHARING DATA

VALUE OF SHARING DATA VALUE OF SHARING DATA PATRICK HUMMEL* FEBRUARY 12, 2018 Abstract. This paper analyzes whether advertisers would be better off using data that would enable them to target users more accurately if the only

More information

Phasing of 2-SNP Genotypes based on Non-Random Mating Model

Phasing of 2-SNP Genotypes based on Non-Random Mating Model Phasing of 2-SNP Genotypes based on Non-Random Mating Model Dumitru Brinza and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA 30303 {dima,alexz}@cs.gsu.edu Abstract.

More information

Codon usage and secondary structure of MS2 pfaage RNA. Michael Bulmer. Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK

Codon usage and secondary structure of MS2 pfaage RNA. Michael Bulmer. Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK volume 17 Number 5 1989 Nucleic Acids Research Codon usage and secondary structure of MS2 pfaage RNA Michael Bulmer Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK Received October 20,

More information

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror

Protein design. CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror Protein design CS/CME/BioE/Biophys/BMI 279 Oct. 24, 2017 Ron Dror 1 Outline Why design proteins? Overall approach: Simplifying the protein design problem Protein design methodology Designing the backbone

More information

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. Bioinformatics is the marriage of molecular biology with computer

More information

Stalking the Genetic Basis of a Trait

Stalking the Genetic Basis of a Trait OVERVIEW This activity supplements the short film Popped Secret: The Mysterious Origin of Corn. It focuses on the tb1 gene, whose expression is related to phenotypic changes associated with the evolution

More information

Genome-wide association studies (GWAS) Part 1

Genome-wide association studies (GWAS) Part 1 Genome-wide association studies (GWAS) Part 1 Matti Pirinen FIMM, University of Helsinki 03.12.2013, Kumpula Campus FIMM - Institiute for Molecular Medicine Finland www.fimm.fi Published Genome-Wide Associations

More information

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading:

132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, This exposition is based on the following source, which is recommended reading: 132 Grundlagen der Bioinformatik, SoSe 14, D. Huson, June 22, 214 1 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista Graphical Methods for Classifier Performance Evaluation Maria Carolina Monard and Gustavo E. A. P. A. Batista University of São Paulo USP Institute of Mathematics and Computer Science ICMC Department of

More information

Diagnostic Screening Without Cutoffs

Diagnostic Screening Without Cutoffs Diagnostic Screening Without Cutoffs Wesley O. Johnson Young-Ku Choi Department of Statistics and Mark C. Thurmond Department of Medicine and Epidemiology University of California, Davis Introduction Standard

More information

Materials and methods. by University of Washington Yeast Resource Center) from several promoters, including

Materials and methods. by University of Washington Yeast Resource Center) from several promoters, including Supporting online material for Elowitz et al. report Materials and methods Strains and plasmids. Plasmids expressing CFP or YFP (wild-type codons, developed by University of Washington Yeast Resource Center)

More information

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading:

Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, This exposition is based on the following source, which is recommended reading: Grundlagen der Bioinformatik, SoSe 11, D. Huson, July 4, 211 155 12 Gene Prediction Using HMMs This exposition is based on the following source, which is recommended reading: 1. Chris Burge and Samuel

More information

DOES DATA MINING IMPROVE BUSINESS FORECASTING?

DOES DATA MINING IMPROVE BUSINESS FORECASTING? DOES DATA MINING IMPROVE BUSINESS FORECASTING? June 13, 1998 David Chereb, Ph.D. Prepared for: THE 18 TH INTERNATIONAL SYMPOSIUM ON FORECASTING Edinburgh, Scotland INTRODUCTION The purpose of this article

More information

Preference Elicitation for Group Decisions

Preference Elicitation for Group Decisions Preference Elicitation for Group Decisions Lihi Naamani-Dery 1, Inon Golan 2, Meir Kalech 2, and Lior Rokach 1 1 Telekom Innovation Laboratories at Ben-Gurion University, Israel 2 Ben Gurion University,

More information

Package MADGiC. July 24, 2014

Package MADGiC. July 24, 2014 Package MADGiC July 24, 2014 Type Package Title Fits an empirical Bayesian hierarchical model to obtain posterior probabilities that each gene is a driver. The model accounts for (1) frequency of mutation

More information

Michelle Wang Department of Biology, Queen s University, Kingston, Ontario Biology 206 (2008)

Michelle Wang Department of Biology, Queen s University, Kingston, Ontario Biology 206 (2008) An investigation of the fitness and strength of selection on the white-eye mutation of Drosophila melanogaster in two population sizes under light and dark treatments over three generations Image Source:

More information

Estoril Education Day

Estoril Education Day Estoril Education Day -Experimental design in Proteomics October 23rd, 2010 Peter James Note Taking All the Powerpoint slides from the Talks are available for download from: http://www.immun.lth.se/education/

More information

From DNA to Protein: Genotype to Phenotype

From DNA to Protein: Genotype to Phenotype 12 From DNA to Protein: Genotype to Phenotype 12.1 What Is the Evidence that Genes Code for Proteins? The gene-enzyme relationship is one-gene, one-polypeptide relationship. Example: In hemoglobin, each

More information

Chapter 14 Active Reading Guide From Gene to Protein

Chapter 14 Active Reading Guide From Gene to Protein Name: AP Biology Mr. Croft Chapter 14 Active Reading Guide From Gene to Protein This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single

More information

LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS -WHICH POPULATION?

LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS -WHICH POPULATION? LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS -WHICH POPULATION? A. COLLINS Department of Human Genetics University of Southampton Duthie Building (808) Southampton General Hospital

More information

Genomic models in bayz

Genomic models in bayz Genomic models in bayz Luc Janss, Dec 2010 In the new bayz version the genotype data is now restricted to be 2-allelic markers (SNPs), while the modeling option have been made more general. This implements

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Changhui (Charles) Yan Old Main 401 F http://www.cs.usu.edu www.cs.usu.edu/~cyan 1 How Old Is The Discipline? "The term bioinformatics is a relatively recent invention, not

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION v171025 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

Getting Started with Risk in ISO 9001:2015

Getting Started with Risk in ISO 9001:2015 Getting Started with Risk in ISO 9001:2015 Executive Summary The ISO 9001:2015 standard places a great deal of emphasis on using risk to drive processes and make decisions. The old mindset of using corrective

More information

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances

More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances APPENDIX 10B More-Advanced Statistical Sampling Concepts for Tests of Controls and Tests of Balances Appendix 10B contains more mathematical and statistical details related to the test of controls sampling

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

a-dB. Code assigned:

a-dB. Code assigned: This form should be used for all taxonomic proposals. Please complete all those modules that are applicable (and then delete the unwanted sections). For guidance, see the notes written in blue and the

More information

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron Genotype calling Genotyping methods for Affymetrix arrays Genotyping

More information

Computational Haplotype Analysis: An overview of computational methods in genetic variation study

Computational Haplotype Analysis: An overview of computational methods in genetic variation study Computational Haplotype Analysis: An overview of computational methods in genetic variation study Phil Hyoun Lee Advisor: Dr. Hagit Shatkay A depth paper submitted to the School of Computing conforming

More information

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc.

Chapter 12. Sample Surveys. Copyright 2010 Pearson Education, Inc. Chapter 12 Sample Surveys Copyright 2010 Pearson Education, Inc. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular batch of data

More information

Terminology: chromosome; gene; allele; proteins; enzymes

Terminology: chromosome; gene; allele; proteins; enzymes Title Workshop on genetic disease and gene therapy Authors Danielle Higham (BSc Genetics), Dr. Maggy Fostier Contact Maggy.fostier@manchester.ac.uk Target level KS4 science, GCSE (or A-level) Publication

More information

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes Fig. S2. Additional information supporting the use case Application to Comparative Genomics: Erythritol Utilization in Brucella. (A) Genes encoding enzymes involved in erythritol transport and catabolism

More information