MUCH attention has recently been given to positive has been found at sites that code for the surface positions

Size: px
Start display at page:

Download "MUCH attention has recently been given to positive has been found at sites that code for the surface positions"

Transcription

1 Copyright 2004 by the Genetics Society of America DOI: /genetics Detecting Selection in Noncoding Regions of Nucleotide Sequences Wendy S. W. Wong 1 and Rasmus Nielsen Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York Manuscript received October 30, 2002 Accepted for publication December 31, 2003 ABSTRACT We present a maximum-likelihood method for examining the selection pressure and detecting positive selection in noncoding regions using multiple aligned DNA sequences. The rate of substitution in noncoding regions relative to the rate of synonymous substitution in coding regions is modeled by a parameter. When a site in a noncoding region is evolving neutrally 1, while 1 indicates the action of positive selection, and 1 suggests negative selection. Using a combined model for the evolution of noncoding and coding regions, we develop two likelihood-ratio tests for the detection of selection in noncoding regions. Data analysis of both simulated and real viral data is presented. Using the new method we show that positive selection in viruses is acting primarily in protein-coding regions and is rare or absent in noncoding regions. MUCH attention has recently been given to positive has been found at sites that code for the surface positions in the protein (Bonhoeffer et al. 1995; Mindell selection at the molecular level because of its functional importance. Positive selection has been idenbori 1996; Nielsen and Yang 1997; Yamaguchi and Gojo- tified in the coding region in the human immunodeficiency 1998; Yamaguchi-Kabata and Gojobori 2000). virus (HIV)-1 envelope gene (Bonhoeffer et al. In human influenza A (H3N2), positive selection has 1995), the major histocompatibility complex (MHC; been found in the hemagglutinin (HA) gene, which Hughes and Nei 1988), the tumor suppressor gene encodes for a molecule that triggers the humoral im- BRCA1 (Huttley et al. 2000), female reproductive proal. mune response in humans (Fitch et al. 1997). Endo et teins in mammals (Swanson et al. 2001), and many (1996) and Yang and Bielawski (2000) gave more other proteins (Yang and Bielawski 2000). comprehensive lists of genes that are undergoing posi- Several methods can be used to detect selection acting tive selection. on protein-coding regions. One of the common apgenes While positive selection has been found in the viral proaches is to estimate the nonsynonymous rate (d N )to that code for proteins that interact with the host synonymous rate (d S ) ratio (the most frequently used immune system, very little is known regarding selection symbols in this article are given in Table 1). The d N /d S in noncoding regions. A variety of research has shown ratio is sometimes referred to as (e.g., Goldman and that viral noncoding regions play an important role Yang 1994). When a codon site undergoes negative in gene regulation and function (Shiroki et al. 1995; selection, synonymous substitutions occur at a faster rate Walker et al. 1995; Takayoshi et al. 1998). Further- than nonsynonymous substitutions, and therefore more, Carter and Roizman (1996) showed that viral 1. However, when there is no selection (neutrality), the introns are involved in alternative splicing and regularate of synonymous substitutions is equal to the rate of tion of their own gene expression. The functional imnonsynonymous substitutions, i.e., 1. Alternatively, portance of the noncoding regions suggests that selec- if the site undergoes positive selection, new mutations tion may be acting on them. However, since most are beneficial and 1. interaction between the host immune system and viruses Inferences regarding have been used to demonselection is at the level of proteins and peptides, very little positive strate the presence of selection in many viral systems. is expected in noncoding regions compared Viruses may escape an existing immune response due to that in coding regions. Unfortunately, this and other to mutations in the proteins involved in interactions hypotheses regarding selection in the noncoding region with the immune system. As a result several viral proteins have not been subject to scientific evaluation because have been observed to evolve under strong positive sethis no appropriate statistical method has been available for lection. In the HIV-1 envelope gene, positive selection purpose. We here extend the Nielsen and Yang (1998) and Yang et al. (2000) methods for detecting positive selection in coding regions to noncoding regions. We model 1 Corresponding author: 434 Warren Hall, Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY the evolution in coding regions and the evolution in sww8@cornell.edu noncoding regions jointly and assume that the neutral Genetics 167: ( June 2004)

2 950 W. S. W. Wong and R. Nielsen Symbol TABLE 1 Definitions of symbols used Definition d N Number of nonsynonymous substitutions per nonsynonymous site d S Number of synonymous substitutions per synonymous site Nonsynonymous/synonymous rate ratio (d N /d S ) Transition/transversion rate ratio q ij (C) Rate of transition from codon i to codon j q ij (N) Rate of transition from nucleotide i to nucleotide j j Stationary frequency of codon j jk Stationary frequency of the nucleotide at position k in codon j Nucleotide substitution rate multiplier in the noncoding region n Number of nucleotide sequences N N Number of nucleotides in the noncoding region N C Number of codons in the coding region l Log-likelihood (synonymous) nucleotide substitution rate is constant jk if i k j k by a synonymous transversion in both the coding and noncoding regions of the same q ij (C) jk if i k j k by a synonymous transition gene. Under this assumption, we introduce a new parameter to model the evolution in the noncoding jk if i k j k by a nonsynonymous transition. jk if i k j k by a nonsynonymous transversion region. is the nucleotide substitution rate in the non- (1) coding region, normalized by the synonymous nucleotide substitution rate in the coding region. Therefore, Additionally, q ij (C) 0 if codons i and j differ in more when a site is subject to neutral selection, 1. Simiq ii (C) j i q ij (C) to fulfill the mathematical requirement than one position. The diagonal entries are defined as larly, 1 indicates positive selection, while 1 suggests the presence of negative selection. The intertransition/transversion rate ratio., as defined before, that the row sums must equal 0. The parameter is the pretation of is, therefore, similar to the interpretation of in models of evolution in coding regions. Using is the nonsynonymous/synonymous (d N /d S ) rate ratio. such a combined model we are able to develop a test to jk is the stationary distribution of j k, the nucleotide in detect selection that is applicable to noncoding regions. the kth position of the codon. For instance, q CAC,CAG (C) Three different simulated data sets are used to test G, since it is a nonsynonymous transversion (from the validity of the models. As an illustration of the histidine to glutamine and C G is a transversion). method, we also compile 13 viral data sets from publicato be the product of the stationary frequencies of nucle- The stationary frequency of codon i (i 1 i 2 i 3 ) is assumed tions and GenBank to examine the hypothesis that posiotides i 1, i 2, and i 3, divided by the sum of the stationary tive selection mainly targets coding regions of viral gefrequencies of the sense nomes. codons: MODEL OF THE CODING REGION i i 1 i2 i3, where c 1 TAA TAG TGA. c (2) We model the evolution of coding regions using a Here we assume a universal genetic code, but with slight continuous-time Markov chain model of codon evolu- modification the method can be applied to other getion. This model was proposed by Goldman and Yang netic codes. (1994) and Muse and Gaut (1994). Like their models, To reduce the number of free parameters, we restrict the state space is given by the 61 sense codons, since our model to be time reversible, i.e., i q ij (C) j q ji (C) for the 3 stop codons are not allowed in the model. The any i, j. It is easy to prove that this is indeed the case using rate matrix is 61 61, Q (C) {q ij (C) }, where q ij (C) (i j) standard methods. Furthermore, we take advantage of is the rate of transition from codon i to codon j. The Felsenstein s (1981) pruning algorithm to save computransition rate from codon i to codon j is assumed to tational time. be proportional to the stationary distribution of the Note that this model is different from the Goldman substituted nucleotide in codon j. If we represent codon and Yang (1994) model. In our model the codon transii as a triplet i 1 i 2 i 3 and codon j as j 1 j 2 j 3 (i 1, i 2, i 3, j 1, j 2, j 3 tion substitution rates depend on the stationary nucleo- {T, C, A, G}), and if codons i and j differ by exactly one tide frequencies, whereas in the Goldman and Yang nucleotide in position k, then (1994) model, the codon transition rates depend on

3 Detecting Selection in Noncoding Regions 951 the stationary codon frequencies. and the stationary nucleotide substitution rates ( T, C, A, G ) govern the neutral mutation rates while models the effect of selection. MODEL OF THE NONCODING REGION q (N) ij T C A G T C A G C T A G A T C G, (3) G T C A the three-category model (see Table 2). In the neutral model, we assume that there is no positive selection. Therefore, can only be 1. In the two-category model, can be 1, and it can be 1. In the three-category model, can take on values in three categories: 1, 1, and 1; that is, 0 with probability p 0 1 with probability p 1 2 with probability p 2, The Hasegawa-Kishino-Yano (HKY)85 model (Hasegawa et al. 1985) is used in the noncoding region, with a parameterization that allows comparisons of substitu- tion rates between coding and noncoding regions. Here, the unit of evolution is one nucleotide. Substitutions between nucleotides are described by a continu- ous-time Markov process. The instantaneous substitu- tion rate from nucleotide i to j (i j) is given by where 0 0 1, 1 1, 1 2, and p 1 p 2 p 3 1. The neutral model is a special case of the three-category model in which p 2 0 and a special case of the two-category model in which 2 1. Since the neutral model is nested within the two other models, a likelihood-ratio test of the hypothesis of no positive selection can be performed by comparing the maximum-likelihood value obtained under the two- and three-category models to the maximum-likelihood value obtained un- der the neutral model. where COMBINING BOTH THE NONCODING AND i is the equilibrium frequency of nucleotide i, CODING REGIONS where i {A, C, T, G}. is the rate of substitution relative to the neutral selection rate. As in the codon model, We assume that the DNA sequences are related the diagonal entries are q ii (N) j i q ij (N). through a single phylogenetic tree with a known topol- Under this model, the estimate of at a site indicates ogy, and we assume no recombination within each rethe type of selection acting on it. As in the models by gion analyzed. The true topology of the tree is rarely Gaut and Weir (1994), Muse (1995), Nielsen and known; however, the codon-based likelihood methods Yang (1998), and Yang et al. (2000) and in similar for detecting positive selection have been shown to be work by other authors, we assume that the synonymous highly robust to the assumptions regarding the topology substitution rate in the coding region reflects the neua of the phylogenetic tree (Yang et al. 2000). In practice, tral nucleotide substitution rate. Then good estimate of the topology of the tree will suffice. Joint optimization of the continuous parameters and 1 if the site undergoes negative selection the tree topology is currently not computationally feasi- 1 if the site undergoes neutral selection ble for the codon-based models. 1 if the site undergoes positive selection. (4) In our joint model of coding and noncoding regions, When 1, the rate of nucleotide substitution at a the individual sites (codon sites in the case of the coding site is equal to the synonymous codon substitution rate region and nucleotide sites in the case of the noncoding in the coding region, which in turn equals the synonygiven region) are assumed to be independent. Therefore, mous nucleotide substitution rate. To illustrate this, conlog the model and a particular phylogenetic tree, the sider C G changes in both the noncoding and coding likelihood can be calculated as the sum of the log regions, and assume that the C G change in the likelihoods among sites, N coding region is a change of codon TCC (Serine) to N l log{f(x h t 1, t 2,..., 1, 2, 3, p 1, p 2, p 3,, T, C, A, G )} TCG (Serine). In the noncoding region, q C,G (N ) G h (N) 1 since the C G change is a transversion, whereas in the coding region, q TCC,TCG (C) N C G since the change is a log{f(x h t 1, t 2,...,,, T, C, A, G )}, (5) h (C) 1 synonymous transversion. Likewise, 1 implies that the substitution rate is greater than the neutral nucleo- where N N is the number of nucleotides in the noncoding tide substitution rate, whereas 1 implies that the region, N C is the number of codons in the coding region, nucleotide substitution rate is lower than the neutral and t 1, t 2... are the branch lengths of the tree. rate. The log likelihood defined in Equation 5 can be opti- In our study we consider three rate classes of sites mized using standard optimization techniques. The (negative, neutral, and positively evolving) in the noncoding region. We have implemented three models, namely the neutral model, the two-category model, and popular local unconstrained optimization procedure [Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm] adopted from Numerical Recipes in C (Press et al. 1992)

4 952 W. S. W. Wong and R. Nielsen TABLE 2 Models of variable among sites Model P a Estimable parameters Constraints Neutral 2 0, p 0 0 1, 1 1, p 1 1 p 0 Two-category 3 0, 1, p 0 0 1, 1 1, p 1 1 p 0 Three-category 4 0, 2, p 0, p 1 0 1, 1 1, 2 1, p 2 1 p 0 p 1 a P, number of estimable parameters in the distribution. The maximum-likelihood framework allows us to test the null hypothesis of no positive selection using likeli- hood-ratio tests. Two different likelihood-ratio tests can be performed by comparing the neutral model against the three-category model and the neutral model against the two-category model. The three-category model has two more parameters ( 3, the category that allows 1, and p 3, the proportion of 3 ) than the neutral model. Comparing twice the log-likelihood difference between these two models with the 2 distribution with 2 d.f. may be used to approximate P values of this test. However, because one of the parameters is on the boundary of the parameter space and another parameter is not estimable under the null hypothesis, the true asymptotic distribu- tion of the likelihood-ratio test statistic is not known under the null hypothesis. The resulting test will therefore be conservative. A better test, similar to test II in Swanson et al. (2003) for the coding region, is to com- pare the neutral model against the two-category model. In this test twice the difference in log-likelihood is as- ymptotically distributed as a 50:50 mixture of a point mass at 0 and a 2 distribution with 1 d.f. (Chernoff 1954; Self and Liang 1987). The benefits of this test are that the reduction in the degrees of freedom may in some cases increase the power of the test and that the true asymptotic distribution of the test statistic is actually known. is used in our program EvoNC. However, since the parameters in our models are constrained (all parameters are constrained to be 0, the proportions of in each category have to add 1 to 1, and some s have to be 1 and some 1), it is a constrained optimization problem. We replaced the original objective function by a quadratic penalty function. The quadratic penalty function consists of the original objective function and a penalty term for each constraint. The penalty term is a multiple of the square of the constraint violation when the current parameter vector violates the constraint and is 0 otherwise (Nocedal and Wright 1999). A barrier term was also incorporated for each parameter to ensure that the vector of parameters lies in the interior of the parameter space. In Nielsen and Yang (1998) and Yang et al. (2000) was allowed to vary among sites. In this study we do not pursue such models as our primary interest is in the noncoding region. Allowing to vary among sites in the codon region is likely to increase the likelihood of the model. It might affect the parameter estimates of since and may be correlated. However, implementing a model that allows variable will make the optimization procedure even more computationally intensive. LIKELIHOOD-RATIO TESTS AND POSTERIOR DECODING Nucleotide sequence data were simulated using Evolver in the PAML package (Yang 1997) to verify EvoNC. Each simulated data set consisted of 15 se- quences of 1400 nucleotides. Each sequence was composed of 200 noncoding nucleotide sites and 400 codon sites (1200 nucleotides). In the noncoding region, positive, neutral, and negative selected data were simulated by varying the branch lengths. In Evolver, the tree length specifies the expected number of substitutions per site along the branches in the tree. Therefore, if we let the neutral sites have a total tree length equal to 1, then the sites that evolve five times faster would have the In general, if we have a model with k categories of, let the corresponding proportions of nucleotide sites in the three categories be p 1...p k, under the constraint that p 1... p k 1. The probability of observing data (x h ) at site h is then f(x h ) k p i f(x h i ). (6) i 1 The conditional probability can be calculated using an empirical Bayes approach as shown in Nielsen and Yang (1998). The posterior probability that the nucleotide site h belongs to category k is given by prob( i x h ) p if(x h i ) f(x h ) We assign sites to categories by choosing the category k that maximizes prob( k x h ). SIMULATION p if(x h i ) k l 1p l f(x h l ). (7) total tree length equal to 5. Similarly, the sites that evolve five times slower can be simulated by letting the total tree length be 0.2. In the coding region, was set to be 0.4. Both regions shared the same 5. The

5 Detecting Selection in Noncoding Regions 953 TABLE 3 The coding region of each of the data sets was analyzed Percentage of of the three categories in the four simulation by codeml in the PAML package version 3.12 data sets (Yang 1997). Models M7, M8 (Yang et al. 2000), and M8a (Swanson et al. 2003) were used in the study to determine if positive selection exists in the coding region. Neutral set Conserved set For each of the data sets, two tests were performed Positive set for the coding region. Test 1 compares twice the log- Positive set likelihood ratio difference between M7 and M8 to a 2 distribution with 2 d.f.. As noted in Yang et al. (2000), The total number of sites in the noncoding region is 200 in each set. this test is conservative. Test 2 compares twice the log- likelihood difference between M8 and M8a to a distribution, as suggested in Swanson et al. (2003). proportions of the three categories of were chosen The selection in the noncoding region was investiwith an intention to reflect the distribution in real data gated using EvoNC. Both the coding and the noncoding sets. The distribution of in the four data sets is shown regions were used in the program. Similar to the coding in Table 3. region, two tests were performed for each of the data sets. Test 1 compares twice the log-likelihood ratios between DATA ANALYSIS the neutral and the three-category model to a 2 2 distribution, whereas test 2 compares twice the log- The viral data sets were assembled by searching Gen likelihood ratios between the neutral and two-category Bank for previously published data sets. We used Clustal model to a distribution. The critical value of W version 1.8 (Thompson et al. 1994) for multiple align- the likelihood-ratio test statistic, for a test performed at ments of the data sets. To prevent false positive results the 5% significance level, is then because of misalignments, data with poor alignments and/or with detectable recombination were discarded. For the same reason, because interspecific viral sequences are often too diverged to give a reasonably RESULTS good alignment, only intraspecific viral data were used. Simulation data: The simulated data were analyzed We also made sure that the data did not consist of using the three models (the neutral model, the twopublished results of overlapping reading frames. category model, and the three-category model, as men- Table 4 gives the summarized source and the details tioned above), and the log-likelihood of each model of the 13 data sets and the GenBank accession numbers was used to perform likelihood-ratio tests. Estimates of are available from the authors upon request. NEIGHthree-category all parameters of the model were obtained from the BOR in the PHYLIP package version 3.6 (Felsenstein model. Each site was categorized acall 2002) was used for estimating phylogenetic trees. The cording to the posterior probabilities calculated using alignment gaps were removed by our program EvoNC. Equation 8. The estimates of parameters are summa- TABLE 4 Details of the 13 viral data sets Virus a n N N b N C Reference Hog cholera virus polyprotein Vilcek and Belak (1997) Dengue virus type Direct GenBank search Ebola glycoprotein 5 (3 ) Sanchez et al. (1996) Foot-and-mouth disease Direct GenBank search Hepatitis A Fujiwara et al. (2001) Hepatitis C 9 (3 ) Salemi and Vandamme (2002) Japanese encephalitis virus 20 (3 ) Direct GenBank search Mammalian orthoreovirus Breun et al. (2001) Newcastle disease virus nucleocapsid protein Seal et al. (2002) Poliovirus Direct GenBank search Rabies virus glycoprotein Badrane and Tordo (2001) TT virus ORF Luo et al. (2002) West Nile virus Direct GenBank search a Complete genome of the virus unless specified otherwise. b 5 noncoding region unless specified otherwise.

6 954 W. S. W. Wong and R. Nielsen TABLE 5 True and estimated values of parameters of the simulated data under the three-category model True values of the parameters Estimates of parameters Data set d N /d S s and p s d N /d S s and p s Neutral NE a, p , p , , p , p , 2 NE, (p ) , (p ) Conserved , p , p , , p , p , 2 NE, (p ) , (p ) Positive , p , p , , p , p , 2 5, p , (p ) Positive , p , p , , p , p , 2 2, p , (p ) a NE, not estimable. The Bayesian classification of sites performed quite well for the first three simulated data sets. In the neutral data set, EvoNC classified all 200 sites correctly. In the conserved set, 14 of the actual neutral sites were miscate- gorized as conserved sites while 3 of the actual conserved sites were miscategorized as neutral sites. A total of 183 sites were classified correctly. In positive data set 1, the 5 positively selected sites were all included in the esti- mated set of positively selected sites. Three neutral sites were also falsely included in this set. In positive data set 2, 1 out of the 5 positively selected sites was classified as being neutral and 16 out of the 195 negatively selected and neutral sites were classified as being positively selected. The main reason the method performs worse when there are only a few posi- tively selected sites is that the estimates of ^2 are more unreliable. When the maximum-likelihood estimates of parameters have large variance, the empirical Bayes estimates of posterior probabilities may have reduced accu- racy. Nevertheless, a close examination revealed that the falsely classified sites have low posterior probabilities. Therefore, by adjusting the cutoff probability the accuracy of the method can be controlled. These results suggest that our method is capable of picking up strong positive selection, even though as few as 2.5% of the sites are positively selected. On the other hand, when only a small portion of the sites are undergo- ing weak positive selection, the classification of sites does not have great accuracy. The viral data sets: The results of the PAML analysis are summarized in Table 6. Five out of the 13 viral data sets have significant evidence for positive selection, using test 1 for the coding region (M7 vs. M8): glycoprotein of Ebola virus, foot-and-mouth disease polyprotein, hepatitis C polyprotein, New Castle disease virus nucleo- rized in Table 5, and the classification of sites in conserved and positive sets is shown in Figure 1. The estimated values of match quite well with the actual values. In the neutral data set, estimates of ^0 ^2 ^1 1 were obtained for all sites for all models. In this boundary of the parameter space, the proportion of sites in the different categories is not identifiable. In the conserved data set, maximum-likelihood estimates of ^ and ^2 ^ were obtained for the three-category model, which were virtually identical to the true values of 0.2 and 1.0. Again, the proportion of sites in category 1 and 2 is not identifiable. The estimate of the proportion of conserved sites was pˆ0 0.51, which was slightly larger than the true proportion In the positive data set 1, estimates of ^ and ^ were obtained for the three-category model, which differed slightly from the true values of 0.2 and 5.0, respectively. The estimated proportion of neutral sites was less than expected (0.15 vs. 0.25). On the other hand, pˆ and pˆ2 0.06, which were both larger than the true values of and In the positive data set 2, the estimates from the three-category model are ^ and ^2 1.55, and the corresponding proportions are pˆ and pˆ Again, the estimated proportions were slightly larger than the true values. In the neutral and conserved data sets, all three models gave approximately the same maximum-log-likeli- hood value in these two sets. In these two cases, the neutral null hypothesis was true and was not rejected by the likelihood-ratio test. In the positive data set 1, the maximum-log-likelihood values for the three models were l in the neutral model in the two-category model in the three-category model. Both tests reject the false null hypothesis of no positive selection. However, from the maximum-log-likelihood shown below, only test 2 (neutral model vs. two-category model) was significant in positive data set 2. This is probably due to the fact that only 5 of 200 sites were slightly positively selected in the neutral model l in the two-category model in the three-category model.

7 Detecting Selection in Noncoding Regions 955 Figure 1. Predicted vs. actual values of in two of the simulated data sets. The three-category model was used for prediction. Predicted parameter values are the MAP (maximum a priori) estimates. capsid protein, and poliovirus polyprotein. Among these 5 data sets, 2 were not significant using test 2 for the coding region (M8a vs. M8). This may be because the true value of was only slightly 1 in the positively selected sites or, possibly, because the beta distribution assumed in model M7 did not fit the true distribution of well. For instance, in the foot-and-mouth disease data set, was equal to 1.01 in the positively selected sites. Clearly, this cannot be interpreted as good evi- dence for positive selection in this virus. On the other hand, for example, 11% of the sites in Ebola glycoprotein have 5.92, which is probably a good indication that this protein is undergoing positive selection. In the noncoding regions there was little or no evidence of positive selection (Table 6). For most data sets, the log-likelihoods of all three models were almost exactly the same (Table 7). One exception was the Japa- nese encephalitis virus. The test of the two-category

8 956 W. S. W. Wong and R. Nielsen TABLE 6 Log-likelihood values under models M7, M8 (Yang et al. 2000), and M8a (Swanson et al. 2003) and parameter estimates under model M8 by PAML in the coding region Log-likelihood values of the models Estimates of parameters p, q, Data set M7 M8 M8a and in M8 Classical swine fever virus p , p 0.00, q 1.52, (p ), 0.59 Dengue virus type p , p 3.06, q 99.00, (p ), 0.69 Ebola glycoprotein p , p 0.49, q 7.58, (p ), 5.92 Foot-and-mouth disease p , p 0.13, q 2.34, (p ), 1.01 Hepatitis A p , p 1.78, q 99.00, (p ), 1.19 Hepatitis C p , p 0.27, q 2.36, (p ), Japanese encephalitis virus p , p 0.09, q 0.92, (p ), 0.06 Mammalian orthoreovirus p , p 0.02, q 0.42, (p ), 0.01 Newcastle disease virus p , p 0.23, q 1.59, (p ), Poliovirus p , p 0.30, q 2.83, (p ) Rabies virus glycoprotein p , p 0.33, q 3.88, (p ), 0.69 TT virus ORF p , p 0.80, q 0.99, (p ), 1.13 West Nile virus p , p 0.62, q 99.00, (p ), 0.14 model vs. the neutral model in the 3 -untranslated re- should be noted that most of the data sets in the study gion of Japanese encephalitis virus was marginally sig- had a low number of sequences and perhaps had low nificant (P 0.05), but the test of the three-category sequence diversity as well. Anisimova et al. (2001) model vs. the neutral model (P 0.22) was not. Nam showed that such data sets would reduce the power of et al. (2002) described this region as the variable region the tests. Therefore, data sets with a low number of since it showed a high degree of sequence variation and sequences should be reexamined when more data bedeletion. However, they also pointed out that despite come available. the fact that the region was highly variable, the predicted The results of our analysis of simulated data suggest RNA structures all had a similar type loop at the 5 that the new method provides accurate parameter estiterminus. mates. The likelihood-ratio tests also performed well After correcting for multiple testing by the Bonfer- and detected selection from 15 sequences when only roni procedure, we conclude that the likelihood-ratio 2.5% of the sites were undergoing positive selection. tests provide no evidence for positive selection in the When more than one category of was present, the viral data that we examined. This result confirms our program miscategorized 10% of the sites. In small expectation that positive selection occurs primarily in data sets the classification of sites may not have the the coding regions of viruses. highest accuracy. A similar conclusion has been reached regarding classification of sites in coding regions (Anisimova DISCUSSION et al. 2001, 2002). In the analysis of real data, it is advisable to confirm the presence of positive selection Because positive selection is expected to occur in in particular sites by employing additional structural, regions where viruses interact with the host immune functional, or evolutionary information. system, this type of selection is expected to be much Our model is based on the assumption that there is more predominant in coding than in noncoding regions. no selection acting on the synonymous sites and the rate Our study confirmed this belief. However, it of substitution is constant among sites. This assumption

9 Detecting Selection in Noncoding Regions 957 TABLE 7 Log-likelihood values and parameter estimates of the 13 viral data sets under the three models Estimates from the three-category model Data set l 1 l 2 l 3 s and p s Classical swine fever virus , p , p , , (p ) Dengue virus type , p , p , , (p ) Ebola glycoprotein , p , p , , (p ) Foot-and-mouth disease , p , p , , (p ) Hepatitis A , p , p , , (p ) Hepatitis C , p , p , , (p ) Japanese encephalitis virus , p , p , , (p ) Mammalian orthoreovius , p , p , , (p ) Newcastle disease virus , p , p , , (p ) Poliovirus , p , p , , (p ) Rabies virus glycoprotein , p , p , , (p ) TT virus ORF , p , p , , (p ) West Nile virus , p , p , , (p ) would be violated if there is codon usage bias in the We are grateful to Charles F. Aquadro and David B. Shmoys for coding region. A lot of the viruses may have codon usage their valuable comments. This work was supported by National Science bias due to the overlapping reading frames and/or RNA Foundation (NSF) grant DEB , NSF/National Institutes of structural constraints in coding regions. We do not know Health grant DMS/NIGMS , and Human Frontier in Science how much the bias would have affected the parameter Program grant RGY0055/2001-M. estimates but this is a question that could be addressed using simulations. The models presented here do not allow rate variation LITERATURE CITED among the lineages in the phylogeny. This may result Anisimova, M., J. P. Bielawski and Z. Yang, 2001 Accuracy and power of the likelihood ratio test in detecting adaptive molecular in a loss of statistical power of the likelihood-ratio tests. evolution. Mol. Biol. Evol. 18: Our method will not be able to identify positively sepower Anisimova, M., J. P. Bielawski and Z. Yang, 2002 Accuracy and of bayes prediction of amino acid sites under positive lected sites if positive selection occurs in only a few selection. Mol. Biol. Evol. 19: lineages, while the negative selection dominates the rest. Badrane, H., and N. Tordo, 2001 Host switching in Lyssavirus history In the future, we would like to incorporate rate variation from the Chiroptera to the Carnivora orders. J. Virol. 75: among the lineages into the program as well Bonhoeffer, S., E. C. Holmes and M. A. Nowak, 1995 Since the optimization procedure (BFGS algorithm) HIV diversity. Nature 376: 125. Causes of used here is a local optimization procedure, it is possible Breun, L. A., T. J. Broering, A. M. McCutcheon, S. J. Harrison, that the likelihood is trapped at a local minimum. While C. L. Luongo et al., 2001 Mammalian reovirus L2 gene and lambda2 core spike protein sequences and whole-genome comwe do not have a rigorous solution to the problem, it parisons of reoviruses type 1 Lang, type 2 Jones, and type 3 is advisable to run EvoNC with different sets of initial Dearing. Virology 287: values. If they all result in the same log-likelihood it is Carter, K. L., and B. Roizman, 1996 Alternatively spliced mrnas predicted to yield frame-shift proteins and stable intron 1 RNAs very likely that one has found the global minimum. of the herpes simplex virus 1 regulatory gene alpha 0 accumulate EvoNC is currently under development for incorpo- in the cytoplasm of infected cells. Proc. Natl. Acad. Sci. USA 93: rating new models and functionalities. The beta version Chernoff, H., 1954 On the distribution of the likelihood ratio. for the models tested in this article is available from the Ann. Math. Stat. 25: authors upon request. Endo, T., K. Ikeo and T. Gojobori, 1996 Large-scale search for

10 958 W. S. W. Wong and R. Nielsen genes on which positive selection may operate. Mol. Biol. Evol. Salemi, M., and A. M. Vandamme, 2002 Hepatitis C virus evolutionary 13: patterns studied through analysis of full-genome sequences. Felsenstein, J., 1981 Evolutionary trees from DNA sequences: a J. Mol. Evol. 54: maximum likelihood approach. J. Mol. Evol. 17: Sanchez, A., S. G. Trappier, B. W. Mahy, C. J. Peters and S. T. Felsenstein, J., 2002 Phylogenetic Inference Package (PHYLIP), Version Nichol, 1996 The virion glycoproteins of Ebola viruses are 3.6. University of Washington, Seattle. encoded in two reading frames and are expressed through tran- Fitch, W. M., R. M. Bush, C. A. Bender and N. J. Cox, 1997 Long scriptional editing. Proc. Natl. Acad. Sci. USA 93: term trends in the evolution of H(3) HA1 human influenza type Seal, B. S., J. M. Crawford, H. S. Sellers, D. P. Locke and D. J. A. Proc. Natl. Acad. Sci. USA 94: King, 2002 Nucleotide sequence analysis of the Newcastle dis- Fujiwara, K., O. Yokosuka, K. Fukai, F. Imazeki, H. Saisho et al., ease virus nucleocapsid protein gene and phylogenetic relation Analysis of full-length hepatitis A virus genome in sera ships among the Paramyxoviridae. Virus Res. 83: from patients with fulminant and self-limited acute type A hepatilikelihood estimators and likelihood ratio tests under nonstan- Self, S., and K.-Y. Liang, 1987 Asymptotic properties of maximum tis. J. Hepatol. 35: Gaut, B. S., and B. S. Weir, 1994 Detecting substitution-rate heterodard conditions. J. Am. Stat. Assoc. 82: Shiroki, K., T. Ishii, T. Aoki, M. Kobashi, S. Ohka et al., 1995 A new geneity among regions of a nucleotide sequence. Mol. Biol. Evol. cis-acting element for RNA replication within the 5 noncoding 11: region of poliovirus type 1 RNA. J. Virol. 69: Goldman, N., and Z. Yang, 1994 A codon-based model of nucleotide Swanson, W. J., Z. Yang, M. F. Wolfner and C. F. Aquadro, 2001 substitution for protein-coding DNA sequences. Mol. Biol. Evol. Positive Darwinian selection drives the evolution of several female 11: reproductive proteins in mammals. Proc. Natl. Acad. Sci. USA Hasegawa, M., H. Kishino and T. Yano, 1985 Dating of the human- 98: ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Swanson, W. J., R. Nielsen and Q. Yang, 2003 Pervasive adaptive Evol. 22: evolution in mammalian fertilization proteins. Mol. Biol. Evol. Hughes, A. L., and M. Nei, 1988 Pattern of nucleotide substitution 20: at major histocompatibility complex class I loci reveals overdomi- Takayoshi, I., S. M. Tahara and M. M. C. Lai, 1998 The 39-untransnant selection. Nature 335: lated region of hepatitis C virus RNA enhances translation from Huttley, G. A., S. Easteal, M. C. Southey, A. Tesoriero, G. G. an internal ribosomal entry site. J. Virol. 72: Giles et al., 2000 Adaptive evolution of the tumour suppressor Thompson, J. D., D. G. Higgins and T. J. Gibson, 1994 CLUSTAL BRCA1 in humans and chimpanzees. Australian Breast Cancer W: improving the sensitivity of progressive multiple sequence Family Study. Nat. Genet. 25: alignment through sequence weighting, position-specific gap Luo, K., H. He, Z. Liu, D. Liu, H. Xiao et al., 2002 Novel variants penalties and weight matrix choice. Nucleic Acids Res. 22: 4673 related to TT virus distributed widely in China. J. Med. Virol. 67: Vilcek, S., and S. Belak, 1997 Organization and diversity of the Mindell, D. P., 1996 Positive selection and rates of evolution in 3 -noncoding region of classical swine fever virus genome. Virus immunodeficiency viruses from humans and chimpanzees. Proc. Genes 15: Natl. Acad. Sci. USA 93: Walker, P. A., L. E. Leong and A. G. Porter, 1995 Sequence and Muse, S. V., 1995 Evolutionary analyses of DNA sequences subject structural determinants of the interaction between the 5 -nonto constraints of secondary structure. Genetics 139: coding region of picornavirus RNA and rhinovirus protease 3C. Muse, S. V., and B. S. Gaut, 1994 A likelihood approach for compar- J. Biol. Chem. 270: ing synonymous and nonsynonymous nucleotide substitution Yamaguchi, Y., and T. Gojobori, 1997 Evolutionary mechanisms rates, with application to the chloroplast genome. Mol. Biol. Evol. and population dynamics of the third variable envelope region of 11: HIV within single hosts. Proc. Natl. Acad. Sci. USA 94: Nam, J. H., S. L. Chae, S. H. Park, Y. S. Jeong, M. S. Joo et al., 2002 Yamaguchi-Kabata, Y., and T. Gojobori, 2000 Reevaluation of High level of sequence variation in the 3 noncoding region of amino acid variability of the human immunodeficiency virus type Japanese encephalitis viruses isolated in Korea. Virus Genes 24: 1 gp120 envelope glycoprotein and prediction of new discontinuous epitopes. J. Virol. 74: Yang, Z., 1997 PAML: a program package for phylogenetic analysis Nielsen, R., and Z. Yang, 1998 Likelihood models for detecting by maximum likelihood. Comput. Appl. Biosci. 13: positively selected amino acid sites and applications to the HIV-1 Yang, Z., and J. P. Bielawski, 2000 Statistical methods for detecting envelope gene. Genetics 148: molecular adaptation. Trends Ecol. Evol. 15: Nocedal, J., and S. J. Wright, 1999 Numerical Optimization, pp. Yang, Z., R. Nielsen, N. Goldman and A. M. Pedersen, 2000 Co Springer-Verlag, New York. don-substitution models for heterogeneous selection pressure at Press, W. H., S.A. Teukolsky, W. T. Vetterling and B. P. Flannery, amino acid sites. Genetics 155: Numerical Recipes: The Art of Scientific Computing, pp Cambridge University Press, Cambridge/London/New York. Communicating editor: J. J. Hein