Efficiencies of the NJp, Maximum Likelihood, and Bayesian Methods of Phylogenetic Construction for Compositional and Noncompositional Genes

Size: px
Start display at page:

Download "Efficiencies of the NJp, Maximum Likelihood, and Bayesian Methods of Phylogenetic Construction for Compositional and Noncompositional Genes"

Transcription

1 Article Efficiencies of the NJp, Maximum Likelihood, and Bayesian Methods of Phylogenetic Construction for Compositional and Noncompositional Genes Ruriko Yoshida 1 and Masatoshi Nei*,2,3 1 Department of Statistics, University of Kentucky 2 Department of Biology and Institute of Molecular Evolutionary Genetics, Pennsylvania State University, University Park 3 Department of Biology, Temple University *Corresponding author: nxm2@psu.edu. Associate editor: Takashi Gojobori Abstract At the present time it is often stated that the maximum likelihood or the Bayesian method of phylogenetic construction is more accurate than the neighbor joining (NJ) method. Our computer simulations, however, have shown that the converse is true if we use p distance in the NJ procedure and the criterion of obtaining the true tree (Pc expressed as a percentage) or the combined quantity (c) ofavalueof Pc and a value of Robinson Foulds average topological error index (d T ). This c is given by Pc (1 d T /d T max) ¼ Pc (m 3 d T /2)/(m 3), where m is the number of taxa used and d T max is the maximum possible value of d T,whichisgivenby2(m 3).Thisneighborjoiningmethodwithp distance (NJp method) will be shown generally to give the best data-fit model. This c takes a value between 0 and 1, and a tree-making method giving a high value of c is considered to be good. Our computer simulations have shown that the NJp method generally gives a better performance than the other methods and therefore this method should be used in general whether the gene is compositional or it contains the mosaic DNA regions or not. Key words: neighbor joining method with p distance (NJp method), maximum likelihood method, Bayesian method, probability of obtaining the correct tree, topological error index (dt), c value, b value. Introduction One of the most successful areas of molecular study of evolution has been the construction of phylogenetic trees (Gojobori and Bernardi 2000; Nei and Kumar 2000; Felsenstein 2004). At present, there are four major methods of tree construction, that is, parsimony (Fitch 1971; Hartigan 1973), neighbor joining (NJ; Saitou and Nei 1987), maximum likelihood (ML; Felsenstein 1981), and Bayesian (Ronquist and Huelsenbeck 2003) methods. The parsimony method is less popular than the others, because the detailed reconstruction of past evolutionary history becomes increasingly difficult as the evolutionary time increases. Several computer simulations have suggested that among the remaining three methods the ML or Bayesian method is more efficient than the NJ method in obtaining the true tree (Kuhner and Felsenstein 1994; Guindon and Gascuel 2003). In contrast, Tateno et al (1994), Takahashi and Nei (2000), and Rosenberg and Kumar (2001) found cases where NJ performs better than ML. In these studies, NJ trees are commonly constructed by using corrected and continuous distance estimates based on mathematical models. Actually, the standard errors of corrected evolutionary distances are larger than those of the uncorrected p distances, and this has been one of the reasons why a poor performance of the NJ method has occurred. Another factor for a low performance of NJ was variation of substitution rates among sites (Guindon and Gascuel 2002),whereagammavariableisoftenused.Weshould, therefore, re-examine the performance of NJ relative to that of the other methods. Because the performance of NJ depends on the standard errors of the distance estimates used (Sch oniger and von Haeseler 1993; Takahashi and Nei 2000), NJ with p distance (NJp method) is expected to give a good performance. Another advantage of the NJp method is that it is applicable to any mathematical model including compositional genes. IntheMLandBayesianmethodsoftreemaking,itisnecessary to use a certain mathematical model, but in reality it is unlikely that any type of mathematical model is applicable to the entire regions of DNA for a long evolutionary time. This is particularly problematic when many different DNA regions of a gene or different genes of a genome are used in constructing phylogenetic trees (Roch and Steel 2015). If we consider these situations, it is possible that the NJ method of tree construction with p distance (NJp method) shows a better performance. For these reasons, it would be interesting to conduct computer simulations considering a range of evolutionary scenarios. If these simulations show that the NJp method is better than or as good as the ML and Bayesian method, it will simplify the current practice of phylogenetic analysis ß The Author Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please journals.permissions@oup.com 1618 Mol. Biol. Evol. 33(6): doi: /molbev/msw042 Advance Access publication February 28, 2016

2 NJp, ML, and Bayesian Methods. doi: /molbev/msw042 considerably, because the NJp method is technically much simpler than the latter methods. In recent years many authors have considered the evolution of compositional genes or mosaic genes, in which a gene (or genome) consists of several DNA regions (e.g., Ota and Nei 1994; Nei et al. 2000). These types of genes have been typically analyzed by considering only one homologous region of DNA for all species used. However, because the other regions also have some phylogenetic information, all regions should be used for tree construction. In this study, we assume that each gene consists of 1 6 DNA regions and different DNA regions evolve independently after sequence divergence, whether they belong to different populations or to the same population. Therefore, the differences between DNA regions are expected to increase as time goes on. In ML or Bayesian methods, one DNA sequence may be chosen from each population because of the large computational time required and therefore an incorrect tree may be obtained. If polymorphic sequences exist in a population, one has to have all or most of the polymorphic sequences and examine which model fits the data best. In this article, it is also assumed that the best datafit model is the model to be adopted, following Posada and Crandall s (2001) spirit. Note that, like the ML and Bayesian methods, the NJp method is a tree-making method, where the branching pattern of genes and branch lengths are determined. It should also be noted that one DNA region can be longandcoveroneentiregeneorthateachdnaregioncan be short but the entire DNA regions may compose one gene. In the latter case the entire gene is mosaic and the similarity between two genes can be measured by the overall nucleotide or amino acid similarity of the genes. Of course, a particular DNA region may be different from the other regions and the entire set of DNA regions is considered to be a newly added or the corresponding region was newly lost. In this article, the performances of the NJp, ML, and Bayesian methods are compared first under the assumption that a specific substitution model applies to a given DNA region and examine how well each of the tree-making methods performs in recovering the true tree topology. The other one is to study the performances of the three methods for the case where different DNA regions evolve in different ways, and a tree topology is constructed by using all DNA regions. In this article, we focus on the construction of the correct tree topology, measuring the accuracy of the tree produced by the empirical probability of obtaining the correct tree topology (Pc) (the number of replications giving the correct topology divided by the total number of replications 100) and by Robinson and Foulds (1981) topological error index (d T )of the tree obtained(see alsopenny and Hendy 1985). In general, if Pc is high and d T is low or c is high, the tree is considered to be reliable. Results In this study, we set up a given model tree and examine how accurately different tree-making methods recover the model (true) tree. Following the previous studies (e.g., Kuhner and Felsenstein 1994; TakahashiandNei2000), we first generated the model tree with 30 DNA sequences (fig. 1A) usingthe branching process. In this case the total number of expected nucleotide substitutions from the root to the tip of the tree was assumed to be 1.0 per site. Once the model tree was determined, we generated a set of 30 DNA sequences of a given length following Hasegawa et al. (1985) nucleotide substitution model with the gamma parameter of a ¼ 1. This model will be called the Hasegawa Kishino Yano (HKY) þ G model in the following. We considered 600, 1,200, or 6,000 nt per sequence and let the sequences evolve following the model tree in figure 1A. The initial nucleotide sequences were randomly generated by using the nucleotide frequencies as specified in figure 1. In this simulation, we constructed 100 replicate trees for each model tree and each substitution model used. We used the computer program PHYLIP (Felsenstein 2005)andPAML (Yang 2007) for constructing NJ trees and PHYML (Guindon and Gascuel 2003) for constructing ML trees. We used MrBayes (Ronquist and Huelsenbeck 2003) for constructing Bayesian trees. In the case of Bayesian trees, the computational time required was enormous, so that the number of replicate trees obtained was often smaller than 100, as mentioned later. The sample trees were generated by various tree construction methods (Saitou and Nei 1987; Sch oniger and von Haeseler 1993; Tateno et al. 1994; Takahashi and Nei 2000). At this point, it should be noted that the construction method of NJ trees is slightly different from that of the minimum evolution trees (Rzhetsky and Nei 1993). In the former, the principle of additive trees is required for each step of finding neighbors and a least-squares approach (ordinary least squares[ols]method)isusedforconstructingatree(saitou and Nei 1987; Studier and Kepler 1988; Gascuel and Steel 2006). The results of this simulation are presented in table 1. In the case of 600 nt, the probability of obtaining the correct topology (Pc) orc was low, but both Pc and c rapidly increased as the number of nucleotides increased. The d T value was high when the number of nucleotides was small, but it decreased when the number of nucleotides was large. In the caseofthenjmethodwith600nt,njtreesshowedthe highest Pc and the smallest d T value despite the fact that the initial DNA sequences were generated by using the HKY þ G model. Actually, we also used the Jukes Cantor (JC) and Kimura s 2-parameter (K2P) models without the gamma variable, because some authors (e.g., Takahashi and Nei 2000) showed that a simpler method often produces a better performance. Indeed, in table 1 the JC and K2P models show smaller d T values than the HKY þ GmodelwhentheNJ procedure is used. However, this was not the case when the ML method was used, and the d T values were similar for all the six substitution models considered here. Furthermore, the ML (HKY þ G) model, which is often used in an ML analysis, did not show the largest Pc value despite the fact the initial sequence data were obtained by using the HKY þ G model. This has occurred apparently because ML trees were obtained by maximizing the likelihood by changing various parameter values 1619

3 Yoshida and Nei. doi: /molbev/msw042 FIG. 1. Model trees used for computer simulations. (A) Thirty-sequence tree was generated by using the branching process. The total number of nucleotide substitutions from the root to the tip of the tree was 1.0 per nucleotide site. The HKY model of nucleotide substitution with a gamma parameter of (a) ¼ 1.0 was used under the assumption of molecular clock. The frequencies of nucleotides A, T, C, and G were assumed to be f A ¼ 0.15, f T ¼ 0.15, f C ¼ 0.35, and f G ¼ 0.35, respectively, with a transition/transversion rate ratio of 4.0. The number of nucleotides per sequence was assumed to be 600, 1,200, or 6,000, and 300 replicate trees were obtained to evaluate the efficiencies of different tree-making methods. (B), (C), and (D) Genic sequences of 6 DNA regions (R 1,R 2,...,R 6 ) each with 600 nt were assumed to evolve following the phylogenetic tree given on the left-hand side of the figure. Here, the expected numbers of nucleotide substitutions for branches a, b, c, d, e, f, g, h, i, j, and k were 0.1, 0.2, 0.2, 0.3, ,0.6, 0.7, 0.8, 0.9, and 1.0, respectively. Note that tree D represents a case of high degree of sequence divergence, and the expected distance between the bottom sequence and an upper one is 2.0 per site. In this simulation the Tamura Nei model of nucleotide substitution with a gamma variable of a ¼ 1 was used for original sequence generation, and the substitution parameters for the six DNA regions were as follows. Region R 1 : k 1 ¼ k 2 ¼ 2, f A ¼ f T ¼ f C ¼ f G ¼ Region R 2 :k 1 ¼ k 2 ¼ 4, f A ¼ f T ¼ 0.15, f C ¼ f G ¼ Region R 3 :k 1 ¼ k 2 ¼ 8, f A ¼ f T ¼ 0.1, f C ¼ f G ¼ 0.4. Region R 4 :k 1 ¼ 2, k 2 ¼ 4, f A ¼ 0.1, f T ¼ 0.2, f C ¼ 0.3, f G ¼ 0.4. R 5 :k 1 ¼ 2, k 2 ¼ 8, f A ¼ 0.3, f T ¼ 0.2, f C ¼ 0.1, f G ¼ 0.4. Region R 6 :k 1 ¼ 2, k 2 ¼ 4, f A ¼ 0.4, f T ¼ 0.35, f C ¼ 0.05, f G ¼ 0.2. Genic sequences of six genes each with 600 nt were assumed to evolve following the tree structure given on the left-hand side of the figure. The substitution patterns of the six DNA regions were the same as those of figure 1B. irrespective of the initial values. Nevertheless, the d T values were similar to those of the NJp method. Table 1 shows the results for 1,200 nt and 6,000 nt as well. Although the Pc value is now higher than that for the case of 600 nt and the d T value is lower than that of 600 nt, the conclusion about the relative efficiencies of the three treemaking methods is essentially the same. In other words, the accuracies of ML and Bayesian trees are essentially the same as that of the NJp method. These results are interesting becausewecannowsaythatthesimplenjpmethodproduces phylogenetic trees more efficiently or as efficiently as the ML or Bayesian method. However, before we reach this conclusion more generally, we have to know whether this conclusion is applicable to the case of various types of model trees and different nucleotide substitution models. Actually, although their purpose was different from ours and was to examine the efficiencies of different tree search algorithms for parsimony, minimum evolution, and ML trees, Takahashi and Nei (2000) had studied our problem by considering various molecular clock and nonmolecular clock 1620

4 NJp, ML, and Bayesian Methods. doi: /molbev/msw042 Table 1. Probability of Obtaining the Correct Topology (Pc), Topological Error Index (d T ), and c Value for Model Tree A in Figure nt 1,200 nt 6,000 nt Methods Pc(d T ) c Pc(d T ) c Pc(d T ) c NJ p 21(2.48) 20 47(1.36) 46 96(0.08) 96 NJ JC 15(2.96) 14 35(1.70) 34 88(0.26) 88 NJ K2p 17(3.42) 16 33(1.88) 32 85(0.36) 84 NJ HKYþG 3(7.06) 3 12(3.94) 11 52(1.24) 51 ML JC 24(2.66) 23 37(1.64) 36 92(0.16) 92 ML K2p 23(2.54) 22 42( (0.14) 93 ML HKYþG 20(2.68) 19 45(1.32) 44 91(0.20) 91 ML HKYþGþI 20(2.64) 19 43(1.34) 42 91(0.20) 91 ML GTRþG 21(2.52) 19 45(1.30) 44 91(0.20) 91 ML GTRþGþI 20(2.66) 19 43(1.36) 42 91(0.20) 91 Bayesian HKYþGþI 14(2.34) 14 39(1.18) 38 92(0.16) 92 NOTE. The HKY þ G model with a ¼ 1 was used for generating sequence data. Results from 100 replicate trees are presented. Table 2. Probability of Obtaining the Correct Topology (Pc), Topological Error Index (d T ), and c Value for Model Tree B in Figure 1. Methods All Six Regions Region 1 Region 2 Region 3 With 3,600 nt Pc(d T ) c Pc(d T ) c Pc(d T ) c Pc(d T ) c Pc(d T ) c NJ P 97(0.06) 97 83(0.34) 82 76(0.62) 74 82(0.42) (0.00) 100 NJ JC 94(0.12) 94 74(0.54) 73 66(0.84) 64 71(0.72) (0.00) 100 NJ K2P 85(0.30) 84 59(1.04) 57 64(0.88) 62 60(0.96) (0.00) 100 NJ TN 91(0.20) 90 76(0.52) 75 53(1.36) 51 58(1.02) (0.00) 100 NJ K2PþG 74(0.56) 73 45(1.52) 43 50(1.32) 48 47(1.36) (0.00) 100 NJ TNþG 76(0.54) 75 43(1.70) 41 20(3.36) 18 7(6.60) 5 99(0.02) 99 ML JC 92(0.16) 92 65(0.84) 63 66(0.78) 64 72(0.68) (0.00) 100 ML K2P 89(0.22) 88 70(0.68) 68 58(0.75) 57 73(0.68) (0.00) 100 ML TN 88(0.24) 87 69(0.72) 67 75(0.62) 73 76(0.64) (0.00) 100 ML TNþG 88(0.24) 87 70(0.70) 68 77(0.54) 76 76(0.64) (0.00) 100 ML GTRþG 88(0.24) 87 71(0.68) 69 77(0.54) 76 75(0.66) (0.00) 100 ML TNþGþI 88(0.24) 87 67(0.78) 65 74(0.66) 72 75(0.66) (0.00) 100 ML GTRþGþI 87(0.26) 86 70(0.70) 68 75(0.62) 73 74(0.68) (0.00) 100 Bayesian HKYþGþI 15(4.15) 13 60(0.55) 59 40(0.70) 39 65(0.90) (0.00) 100 (20 replications) (20 replications) (20 replications) (20 replications) (20 replications) NOTE. The Tamura Nei (TN) model with a ¼ 1 was used for generating sequence data. Results from 100 replications are presented except for the Bayesian method. In the case of 3,600 nt, the entire sequences were generated by assuming the substitution pattern of region 1. cases. Their assumption for making nonmolecular clock trees was similar to that of Kuhner and Felsenstein (1994) and was to assume that a branch length of a tree obtained under the assumption of a molecular clock is now changed to vary according to a gamma variable with a ¼ 1(seeTakahashi and Nei 2000 for details). At any rate, their results showed that the NJp method has the highest Pc and the lowest d T values for both constant and nonconstant rate cases. Furthermore, a similar conclusion had also been obtained by a small-scale study of Saitou and Nei (1987) (see their tables 3 and 6). Therefore, these studies make our conclusion stronger. We now focus on the efficiencies of tree-making methods considering different substitution models. We first considered the efficiency of Hasegawa et al. (1985) model with a gamma variable (HKY þ G model). In this case, a particular model of nucleotide substitution is used (see Nei and Kumar 2000, chapters 6 8). Table 1 shows the results when the initial DNA sequences were generated by the HKYþ Gmodelbut the trees were constructed by using various substitution models. In the case of the NJ procedure the probability of obtaining the correct topology (Pc) is low when genes with 600 nt are used, but it goes up to the order of 96% when 6,000 nt are used. In contrast, d T is 2.48 to 7.06 for NJ and about2.6formlwhen600ntareused. MrBayes gives somewhat different values of Pc and d T values even if 100 replications (15 days of computer computation for this method in table 1 only) were conducted. However, these differences appear to reflect the expected variation in the presence of sampling errors. Actually, because of a large computational time required, we computed the Pc and c values for the initial 20 replications. They were Pc ¼ 20% and c ¼ 20% for 600 nt, Pc ¼ 35% and c ¼ 34% for 1,200 nt, and Pc ¼ 90% and c ¼ 90% for 6,000 nt. If we consider this amount of chance variation, our Pc and c values in table 1 are not necessarily underestimates. At any rate, they clearly indicate that the Bayesian method gives a similar Pc and d T values as those of the ML method and a longer computational time does not necessarily help to increase the accuracy of Bayesian trees. For this reason we computed only 20 replications for this method in tables 2 4. One might suspect that the best data-fit method would be the ML (HKYþ G) method because the initial sequences were generated by this method, but it is not because Pc is 1621

5 Yoshida and Nei. doi: /molbev/msw042 Table 3. Probability of Obtaining the Correct Topology (Pc), Topological Error Index (d T ), and c Value for Model Tree C in Figure 1. Methods All Six Regions Region 2 Region 3 Region 4 Pc(d T ) c Pc(d T ) c Pc(d T ) c Pc(d T ) c NJ P 99(0.02) 99 84(0.42) 83 66(0.68) 64 96(0.08) 95 NJ JC 94(0.12) 94 75(0.60) 73 67(0.74) 64 93(0.16) 92 NJ TN 90(0.22) 89 56(1.10) 53 59(0.98) 56 85(0.34) 83 NJ TNþG 53(1.60) 50 20(2.90) 18 4(6.08) 3 51(1.24) 46 ML TNþG 79(0.46) 78 83(0.38) 82 77(0.48) 75 89(0.24) 87 ML TNþGþI 76(0.50) 75 83(0.40) 82 76(0.52) 74 88(0.26) 86 ML GTRþG 80(0.42) 79 83(0.40) 82 76(0.52) 74 88(0.26) 86 ML GTRþGþI 76(0.52) 75 82(0.42) 81 76(0.52) 74 89(0.24) 87 Bayesian HKYþGþI 75(0.40) 74 75(0.40) 74 85(0.25) 84 85(0.15) 84 (20 replications) (20 replications) (20 replications) (20 replications) NOTE. The TN þ G model with a ¼ 1 was used for generating sequence data. Results from 100 replications are presented except for the Bayesian method. In the cases of region 2, 3, and 4, the trees were constructed for 15, 12, and 9 DNA sequences out of all 18 sequences, respectively, because the other regions are missing in the remaining sequences. Table 4. Probability of Obtaining the Correct Tree (Pc), Topological Error Index (d T ), and c Value for Model Tree D in Figure 1. Methods All Six Regions Region 1 Region 2 Region 3 With 3,600 nt Pc(d T ) c Pc(d T ) c Pc(d T ) c Pc(d T ) c Pc(d T ) c NJ P 44(1.52) 39 38(1.62) 34 31(2.00) 27 24(2.06) 20 95(0.01) 95 NJ JC 43(1.68) 37 34(1.94) 29 31(2.26) 26 23(2.26) 19 94(0.12) 94 NJ K2P 34(2.00) 29 29(2.22) 24 32(2.18) 27 21(2.60) 17 90(0.22) 90 NJ TN 36(1.96) 31 29(2.22) 24 19(3.16) 15 18(2.58) 15 92(0.16) 92 NJ JCþG 37(1.98) 32 27(2.54) 22 27(2.60) 22 17(2.72) 14 89(0.24) 89 NJ K2PþG 32(2.18) 27 23(2.62) 19 29(2.44) 24 15(2.88) 12 86(0.30) 86 NJ TNþG 30(2.54) 25 18(3.42) 14 8(4.98) 5 4(6.18) 2 88(0.26) 88 ML TNþG 43(1.34) 39 44(1.54) 39 42(1.54) 37 43(1.62) 38 83(0.36) 83 ML GTRþG 45(1.30) 41 43(1.52) 38 42(1.60) 37 42(1.70) 37 86(0.28) 86 ML TNþGþI 41(1.48) 37 40(1.60) 35 40(1.64) 35 41(1.80) 36 85(0.32) 85 ML GTRþGþI 41(1.5) 37 40(1.58) 35 36(1.78) 31 38(1.84) 33 87(0.26) 87 Bayesian HKYþGþI 70(0.50) 68 50(0.90) 47 25(1.3) 23 30(1.55) (0.00) replications 20 replications 20 replications 20 replications 20 replications NOTE. Results from 100 replications are presented except for the Bayesian method. In the case of 3,600 nt, the whole sequences were generated by assuming the substitution pattern of region 1. The TN þ G model with a ¼ 1 was used for generating the DNA sequence data. neither the highest nor d T is the lowest. This again shows that the NJp method is better than the ML and Bayesian methods in producing the correct tree. Figure 1B is for studying the effects of compositional genes (or mosaic genes), and table 2 shows the results of the study. In this case, the NJ p procedure shows the highest Pc and the lowest d T value. This finding remains the same whether different DNA regions are used or not. Actually, both Pc and c values are higher when all DNA regions are considered than when one homologous DNA region is considered. Thus, our conclusion about compositional genes is the same as that of noncompositional genes. This finding indicates that compositional genes could be more useful than noncompositional genes for identifying the best tree-making method. In table 2, the Pc and d T values for different DNA regions are also presented. Obviously, they are nearly the same on the average, but they also show considerable variation among the DNA regions because of random mutation and genetic drift. Figure 1C is for studying the effects of the number of DNA regions in a gene, and the results of the study are presented in table 3. In this table the entire set of six DNA regions is first considered, and then regions 2, 3, and 4 are considered separately. It is clear that Pc or c is higher when all regions are considered than when each region is considered separately. This indicates that the whole gene is better than each homologous DNA region in phylogenetic analysis, even if the content of total gene varies considerably with the DNA region. Of course, this result might be due to the fact that the total length of all DNA regions is longer than that of an individual DNA region. We therefore conducted another simulation, considering the length of one long DNA region with 3,600 nt, which is equal to the sum of the nucleotides for six individual DNA regions. The result of this study is presented in the column of 3,600 of table 2. It is clear that the accuracy of tree construction is now very high in all methods examined and the high accuracy of trees for the combination of six regions, we have seen earlier was mainly due to the length of the gene. However, our earlier observation is still interesting, because it indicates that we can use the entire compositional gene without worrying whether it is compositional or not. Nevertheless, our result depends on the tree structure given on the left side of the figure. We therefore conducted another simulation for the tree structure given on the left side of figure 1D. The results of the simulation are presented in 1622

6 NJp, ML, and Bayesian Methods. doi: /molbev/msw042 table 4. Here, again the Pc, d T,andc values for all six DNA regions and those for three individual DNA regions are separately presented. In the case of the NJp method the Pc and c values for all DNA regions are higher than those for individual regions. The results again show that the combination of six DNA regions is better than individual DNA regions in recovering the true tree. In particular, the NJp method for all six regions shows the highest Pc and c values in all NJ procedural cases. In the case of the ML method the Pc and c values are generally quite high and often exceed those of NJ procedures. Particularly, if we consider the individual DNA regions, the Pc and c values are higher in ML than in NJ. This superiority of ML trees over NJ trees may have occurred because the extent of sequence divergence is very high in this case. Note that the expected distance (2k) between the bottom sequence and a higher sequence is 2.0 per site. It is therefore advisable to compute both NJ and ML trees when the sequence divergence is very high and choose the tree that shows the higher accuracy by such as the bootstrap test. However, if we are interested in the combination of six regions as we are here, we can consider only the NJ tree, because Pc and c are both high. At any rate, we recommend that the NJp method be generally used for computing a tree. Discussion In this article, we have seen that the NJp method is generally better than the ML and Bayesian methods in obtaining the correct topology (Pc) or a high value of c. This has occurred partlybecauseweusedthecriterionpc or c. However, unlike the ML value or the AIC (Akaike information criterion), which is applicable only for the ML method, our criterion can be used for any method, whether the tree is constructed by the NJ or ML procedure. Therefore, our conclusion should be acceptable as a general procedure. Of course, if there is any method that shows a higher value of Pc or c, we should seriously consider the adoption of the method in tree construction. The initial purpose of this study was to find a quick method for constructing a reliable tree for a large data set, because it would be increasingly important in the future to analyze a large data set. Since the NJ procedure is one of the simplest tree-making methods, we believe we have achieved our initial purpose at least partially. In Saitou and Nei s (1987) article it is stated that they used the principle of minimum evolution in their construction of trees. In practice, however, they used the principle of additive trees and the OLS estimates of evolutionary distances, as indicated by Studier and Kepler (1988) and Gascuel and Steel (2006). Since the divergence of real genes often occurs as a birth death process (Nei and Rooney 2005) or a niche filling process (Nei 2013, p. 9 and 186) in evolution and these processes are similar to additive evolution of genes in phylogenetic construction, this might be the reason why the NJp method generally performed better than the ML procedure. Although there are many mathematical papers (e.g., Levy et al 2006) on this point, we are not going to argue about their merits and demerits in this article. However, we should discuss the discrepancy between our conclusion and Guindon and Gascuel s (2003). First, using a function of d T,namely,b ¼ d T /(2(m 3)) as a criterion of goodness of fit of a model to data, Guindon and Gascuel (2003) contended that the ML method is better than the NJ method. Here b takes a value between 0 and 1, and a smaller value is supposed to be better. However, they never computed NJp, which shows the highest Pc or c value. Instead, apparently considering the K2P model, they used the b value for comparing NJ and ML trees. In this case, because the b value was generally smaller for ML trees than for NJ trees, they concluded that ML is better than NJ. However, the opposite conclusion is obtained if we use the NJp method (see tables 1 4 of this article, tables 2 and 3 of Takahashi and Nei 2000, andtables 3 and 6 of Saitou and Nei 1987). The p distance is not normally computed in ML methods, but it is easily computable for NJ methods. It is therefore possible to compare the Pc or c valuesbetweenthenjpandmlmethods, but they did not. For this reason, Guindon and Gascuel obtained the opposite conclusion. Also see the results of extra simulations about ML JC,ML 2K2P,and ML TN in tables 1 and 2, which were conducted at the time of studying ML trees. Second, their average difference in b between the NJ and ML procedures was very small, and they had to simulate 5,000 trees each tree composed of 40 sequences to reveal the difference in b. Yet the average difference in b between NJ and ML trees was only 0.1 when the p distance between two sequences was about 0.5. This suggests that b has a large standard error. In fact, Bryant and Steel (2009) showed that the distribution of d T or b is very wide and close to a Poisson distribution. For the unreliability of b or d T for comparing different tree-making methods, see also the article by Russo et al (1996).InourcasethePc for the NJp method was almost always higher than that of the ML procedure. Here, the relationship between b and c is given by c ¼ Pc (1 b) orby b ¼ 1 c/pc. Because the values of Pc and c are always close to each other, b must be very small. Of course, whether c is better than Pc or not is debatable. Mathematically Pc is not dependent on d T,butinpracticeit includes the sampling error of the tree obtained and therefore Pc could be a sufficient quantity for measuring the reliability of the tree obtained. It should also be noted that if we consider the unreliability of the b value we may not need the c value. However, since Pc and c are highly correlated with each other, we used both of them here. Itshouldalsobenotedherethatwearenotthefirsttouse the Pc value for measuring the accuracy of a tree. This criterion was used by Tateno et al. (1982), Sourdis and Krimbas (1987), Saitou and Nei (1987), and Takahashi and Nei (2000). We are merely suggesting that a combination of the NJp method and Pc is important in finding the efficient tree-making method and that in this case the NJp method is generally better than the ML and Bayesian methods. Here, we have not considered branch length errors seriously, but we believe these errors are of the secondary importance. Saitou and Nei (1987) were ambiguous about their recommendation of p or the corrected distance d so that many authors have 1623

7 Yoshida and Nei. doi: /molbev/msw042 used corrected distances continuously without realizing that it has a significant effect on tree making. Tamura et al. (2004) argued that the accuracy of obtaining the correct trees is higher when the transition/transversion rate ratio is estimated by their simultaneous estimation (SE) method than when it is estimated by the independent estimation method. Their contention is generally true, but in our case the NJp method was better than the others even if we did not consider the SE method. Therefore, it does not create any problem for our conclusion. Of course, one may incorporate the SE method into our procedure to strengthen our conclusion. If our conclusion that the NJp method is generally better than the ML or Bayesian method is accepted, we can simplify our phylogenetic construction considerably because the NJp method is a technically much simpler than the others. Of course, our conclusion may not hold in some cases. In these cases one can compute Pc or c and then choose the best treemaking method available. One of the surprising results in our computer simulations is that the accuracy of the Bayesian method does not necessarily increase with the increase of computational time. In our case a simulation of about 20 replications seemed to be sufficient for evaluating the Pc or c value. This suggests that the Bayesian tree has a relatively small variance of tree obtained. Of course, in the case of the Bayesian method one replication requires a considerable amount of computational time. Acknowledgments This work was conceived mainly by M.N., but the computer simulation was done by R.Y. M.N. wrote the article using information of computer simulation. The results of this work were first presented in a symposium honoring Naoyuki Takahata s work on molecular evolution by M.N. in Hayama, Japan, on March 11, 2015 (Nei 2015). However, the publication of the results was delayed because of a health problem of M.N. The authors would like to thank Naoko Takezaki, Sachiyo Kajitani, and Yukako Katsura for their help in preparation of this article. References Bryant D, Steel M Computing the distribution of a tree metric. IEEE/ACM Trans Comput Biol Bioinform. 6: Felsenstein J Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 17: Felsenstein J Inferring phylogenies. Sunderland (MA): Sinauer Associates, Inc. Felsenstein J PHYLIP (phylogeny inference package). Version 3.6. Distributed by the author. Department of Genome Sciences. Seattle: University of Washington. Fitch WM Towards defining the course of evolution: Minimum change for a specific tree topology. Syst Zool. 20: Gascuel O, Steel M Neighbor-joining revealed. MolBiolEvol. 23: Gojobori T, Bernardi G Preface. Gene. 259: Guindon S, Gascuel O Efficient biased estimation of evolutionary distances when substitution rates vary across sites. MolBiolEvol. 19: Guindon S, Gascuel O A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52: Hasegawa M, Kishino H, Yano T Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. JMolEvol. 22: Hartigan JA Minimum evolution fits to a given tree. Biometrics 29: Kuhner MK, Felsenstein J A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol. 11: Nei M Mutation-driven evolution. Oxford: Oxford University Press. Nei M Impacts of molecular biology on the study of evolution: personal reflections. Pennsylvania: Pennsylvania State University. Nei M, Kumar S Molecular evolution and phylogenetics. Oxford: Oxford University Press. Nei M, Rogozin IB, Piontkivska H Purifying selection and birth-anddeath evolution in the ubiquitin gene family. Proc Natl Acad Sci U S A. 97: Nei M, Rooney AP Concerted and birth-and-death evolution of multigene families. Annu Rev Genet. 39: Nei M, Takezaki N, Sitnikova T Assessing molecular phylogenies. Science 267:255. Ota T, Nei M Divergent evolution and evolution by the birth-anddeath process in the immunoglobulin v H gene family. Mol Biol Evol. 11: Penny D, Hendy MD The use of tree comparison metrics. Syst Zool. 34: Rosenberg MS, Kumar S Incomplete taxon sampling is not a problem for phylogenetic inference. Proc Natl Acad Sci USA. 98: Russo CAM, Takezaki N, Nei M Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. MolBiolEvol. 13: Rzhetsky A, Nei M Theoretical foundation of the minimumevolution method of phylogenetic inference. Mol Biol Evol. 10: Tamura K, Nei M, Kumar S Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc Natl Acad Sci U S A. 101: Posada D, Crandall KA Selecting the best-fit model of nucleotide substitution. Syst Biol. 50: Robinson DF, Foulds LR Comparison of phylogenetic trees. Math Biosci. 53: Roch S, Steel M Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 100: Ronquist F, Huelsenbeck JP MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: Saitou N, Nei M The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 4: Sch oniger M, von Haeseler A A simple method to improve the reliability of tree reconstructions. MolBiolEvol. 10: Sourdis J, Krimbas C Accuracy of phylogenetic trees estimated from DNA sequence data. Mol Bioy Evol. 4: Studier J, Keppler K A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol. 5: Takahashi K, Nei M Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. MolBiolEvol. 17: Tateno Y, Nei M, Tajima F Accuracy of estimated phylogenetic trees from molecular data. I. Distantly related species. J Mol Evol. 18: Tateno Y, Takezaki N, Nei M Relative efficiencies of the maximumlikelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol Biol Evol. 11: Yang Z PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24: