STATISTICAL APPLICATIONS IN PLANT BREEDING AND GENETICS CARL ALAN WALKER

Size: px
Start display at page:

Download "STATISTICAL APPLICATIONS IN PLANT BREEDING AND GENETICS CARL ALAN WALKER"

Transcription

1 STATISTICAL APPLICATIONS IN PLANT BREEDING AND GENETICS By CARL ALAN WALKER A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY IN CROP SCIENCE WASHINGTON STATE UNIVERSITY Department of Crop and Soil Sciences MAY 2012

2 To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of CARL ALAN WALKER find it satisfactory and recommend that it be accepted. Kimberly Garland-Campbell, Ph.D., Chair Fabiano Pita, Ph.D. J. Richard Alldredge, Ph.D. Richard Gomulkiewicz, Ph.D. Daniel Skinner, Ph.D. ii

3 ACKNOWLEDGEMENT I would like to thank my committee members for their advice and assistance with this research and with writing this dissertation. I would like to thank all the members of both the Campbell and Steber labs for their advice when I presented my work in lab meetings. I began the project presented in Chapter 3 as part of a paid internship with Dow AgroSciences. I would like to thank the members of the Dow AgroSciences Quantitative Genetics group for their assistance during that internship, especially Kelly Robins who provided some initial programs and data. I would also like to acknowledge Bruce Walsh, Rebecca Doerge, and Radu Totir for the valuable advice they gave me at conferences where I presented my work. I would not have been able conduct this research without the funding for these projects by the Washington Grain Commission and USDA project Finally I d like to thank my parents for all their help getting me this far and my wife Elizabeth for her help editing and moral support. iii

4 STATISTICAL APPLICATIONS IN PLANT BREEDING AND GENETICS ABSTRACT by Carl Alan Walker, Ph.D. Washington State University May 2012 Chair: Kimberly Garland-Campbell Statistical analysis has many applications ensuring the validity and reproducibility of plant breeding and genetics research. Crop plant germplasm collections are often too large to be of use regularly. A core subset with fewer accessions can increase utility while maintaining most of the genetic diversity of the complete collection. This study evaluated methods for selecting core subsets using sparse data. Cores were selected by forming clusters of accessions based on distances estimated with phenotypic data. Accessions were randomly selected relative to the number of accessions in each cluster. The method using all the available data to calculate distances, average linkage clustering, and sampling in proportion to the natural logarithm of cluster size produced the most diverse cores. Evaluations of genotypes in varied environmental conditions are referred to as multiple environment trials (MET) and often necessitate estimation of effects of genotypes within environments. Empirical best linear unbiased predictions can provide more accurate estimates of these effects, depending upon the mixed model used. An objective of this work was to simulate and analyze MET data sets to determine which models provide the most accurate estimates in varied MET conditions. Simulated MET were fit with mixed models with or without genetic relationship matrices (GRM) and with structures of varying complexity used to model relationships among environments. The model that included a GRM and a constant variance-constant correlation structure was the most accurate for the iv

5 largest number of scenarios. More complex models were the most effective for a smaller subset of scenarios, most involving many genotypes and low experimental error. Statistical analyses were applied in consultation with other researchers for two projects studying Fusarium crown rot of wheat and one on cold tolerance of wheat. Heritability and genetic correlations were calculated for Fusarium resistance assays in field, growth chamber, and terrace bed settings. Factor analysis was used to estimate latent factors from field characteristic variables, which were used as predictor variables in linear mixed models and generalized linear mixed models. Cold tolerance among genotypes was assessed with logistic regression. v

6 TABLE OF CONTENTS ACKNOWLEDGEMENT... III ABSTRACT... IV TABLE OF CONTENTS... VI LIST OF TABLES... IX LIST OF FIGURES... X LITERATURE REVIEW... 1 CORE SUBSETS OF GERMPLASM COLLECTIONS... 1 MIXED MODELS FOR MULTIPLE ENVIRONMENT TRIALS... 6 HERITABILITY AND GENETIC CORRELATION DIMENSION REDUCTION FOR LINEAR MODELING EXTREME COLD TOLERANCE IN WHEAT REFERENCES METHODS FOR SELECTING GERMPLASM CORE SUBSETS USING SPARSE PHENOTYPIC DATA ABSTRACT INTRODUCTION MATERIALS AND METHODS RESULTS DISCUSSION Conclusion APPENDIX vi

7 REFERENCES COMPARISON OF LINEAR MIXED MODELS FOR MULTIPLE ENVIRONMENT PLANT BREEDING TRIALS ABSTRACT INTRODUCTION METHODS Simulations Analyses RESULTS AND DISCUSSION Justification of Approach Choice of a Default Model Models for Specific Scenarios DISCUSSION Conclusions APPENDIX: REAL DATA AS A BASIS FOR SIMULATIONS REFERENCES CONSULTING PROJECTS HERITABILITY AND GENETIC CORRELATION ANALYSES FOR FUSARIUM CROWN ROT RESISTANCE ASSAYS OF WHEAT MAPPING POPULATION Abstract Discussion of Statistical Methods LINEAR MODELING OF THE RELATIONSHIPS BETWEEN WHEAT FIELD CHARACTERISTICS AND FUSARIUM CROWN ROT OBSERVATIONS vii

8 Abstract Discussion of Statistical Methods LOGISTIC REGRESSION ANALYSIS OF WHEAT COLD TOLERANCE TESTING Summary Discussion of Methods REFERENCES viii

9 LIST OF TABLES Table 1. Measurement levels and missing value percentages of variables evaluated on the Triticum aestivum L. subsp. aestivum complete collection Table 2. Removal percentages by variable for simulating data sets with missing values by removing values from the "complete collection" Table 3. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 1 (see Table 2) from accessions selected randomly from a uniform distribution Table 4. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 1 (see Table 2) from accessions selected as a contiguous group Table 5. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 2 (see Table 2) from accessions selected randomly from a uniform distribution Table 6. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 2 (see Table 2) from accessions selected as a contiguous group ix

10 LIST OF FIGURES Figure 1. Plot of cumulative means, over simulations, of median recovery of interquartile range, over 1000 potential core subsets per simulation. Simulations were generated by removing values from randomly chosen individual accessions with missingness rates given by set 1. The values of the means of all 200 simulations are shown in Table Figure 2. Plot of cumulative means, over simulations, of median recovery of interquartile range, over 1000 potential core subsets, ranked across methods within each simulation. Simulations were generated by removing values from randomly chosen individual accessions with missingness rates given by set 1. The mean ranks, over all 200 simulations, are shown in Table Figure 1. Means, over simulations, of model ranks, where models were ranked in terms of RMSEP within each simulation. All scenarios evaluated are included, and index denotes each scenario s position in the order. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 76, 154, 230, 304, 380, 456, 532, 608, and 682, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 2. A standardized version of Figure 1, where models have been ranked within each scenario in terms of their mean ranks. The order of scenarios is the same as Figure x

11 Figure 3. The same as figure 2, but only the models GRM_CorV and GRM_CorH. The order of scenarios is the same Figure 4. Equivalent to Figure 3, with only scenarios with high (2.0) error variance included. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 39, 78, 116, 154, 192, 230, 268, 306, and 343, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs Figure 5. Equivalent to Figure 3, with only scenarios with low (0.5) error variance included. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 37, 76, 114, 150, 188, 226, 264, 302, and 339, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs...89 Figure 6. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of relationships among environments. Scenarios are ordered CS A, then CS B, with the indices of the final scenarios of each group equal to 76 and 150, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and xi

12 then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 7. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of correlations among environments and heterogeneous variances of genotype effects within environments. Scenarios are ordered CS A H, then CS B H, with the indices of the final scenarios of each group equal to 78 and 154, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 8. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of correlations among environments and extremely heterogeneous variances of genotype effects within environments. Scenarios are ordered CS A VH, then CS B VH, with the indices of the final scenarios of each group equal to 76 and 152, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 9. Equivalent to Figure 3, only including scenarios simulated with a Toeplitz pattern of correlations among environments. Scenarios are ordered Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 76, 152, and 226, xii

13 respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 10. Equivalent to Figure 3, only including scenarios simulated with a Toeplitz pattern of correlations among environments, 100 or 150 genotypes, 5 to 20 environments, and low (0.5) error variance. Scenarios are ordered Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 14, 29, and 43, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, and then 20 environments. Within each number of environments, the numbers of genotypes are ordered 100 and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs Figure 11. Equivalent to Figure 3, only including scenarios simulated with 25 genotypes. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 24, 48, 72, 96, 120, 144, 168, 192, and 216, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 12. Equivalent to Figure 3, only including scenarios simulated with MAD or unreplicated designs. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 50, 102, xiii

14 154, 203, 255, 307, 357, 409, and 461, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then Figure 13. A standardized version of Figure 1, where only models not including GRM have been ranked within each scenario in terms of their mean ranks. The order of scenarios is the same xiv

15 CHAPTER 1 LITERATURE REVIEW Crop science research relies on statistical methods to assist in making objective decisions based on complex and subtle patterns in nature that are not always obvious from raw observations or observed results from experiments. Among other tasks in crop science research, statistical methods may be used to group genotypes based on phenotypic data, make accurate predictions about the future performance of breeding material, estimate relationships from observational data, and test hypotheses from designed experiments. Core Subsets of Germplasm Collections Crop plant germplasm collections are maintained to conserve genetic variation and to provide useful plant material for researchers and plant breeders. An example is the collection of wheat (Triticum aestivum L. subsp. aestivum) accessions maintained as part of the National Small Grains Collection of the USDA- ARS National Plant Germplasm System ( For many researchers and plant breeders, germplasm collections are often too large and too lacking in descriptive data to be of practical use. A well-characterized core collection, or core subset (these terms will be used interchangeably), that consists of a reduced number of accessions (usually about 10% of the total) can provide increased accessibility and utility while still maintaining most of the genetic diversity of the complete collection (Brown, 1989). Users of core subsets generally seek a diverse sample that varies for one or more characteristics. For example, Wang et al. (2010) evaluated a rice core subset for resistance to the blast fungal disease and identified known and novel genetic sources for resistance. These researchers utilized the core subset to access the genetic diversity of the complete collection without needing to evaluate the large numbers of very similar accessions held in the complete collection. For desired alleles that are evenly distributed throughout the whole collection, at any level of abundance, a simple random sample of the complete collection is the most appropriate. This is because every 1

16 accession selected for the core subset would have an equal chance of having the specific allele. If desired alleles are instead localized to certain parts of the collection, preferentially selecting a portion from each heterogeneous group present in the complete collection increases the likelihood of selecting these unevenly distributed alleles (Brown, 1989). For this reason, most researchers have constructed core collections by grouping accessions and then selecting accessions within groups. A number of different methods and types of data have been used to group accessions and select core collections. Passport data, i.e. the location of cultivation or collection, has been used to stratify the complete collection, followed by selection from within each stratum. This technique was used to select a core subset for the complete wheat collection described above (USDA ARS, National Genetic Resources Program, 2009), and this method has also been used to develop other core collections (Skinner et al., 1999; Huamán et al., 2000; Dahlberg et al., 2004; Yan et al., 2007). Other methods for selecting core collections have included stratification based on geographic origin, followed by further grouping based on cluster analysis of phenotypic traits (Basigalup et al., 1995; Rao and Rao, 1995; Igartua et al., 1998; Tai and Miller, 2001; Upadhyaya et al., 2001, 2006; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Dwivedi et al., 2008). Stratification of collections has also been conducted using cluster analysis without prior geographic grouping (Diwan et al., 1995; Franco et al., 1997, 1998, 1999, 2005; Grenier et al., 2001a; Li et al., 2004; Anderson, 2005; Holbrook and Dong, 2005; Weihai et al., 2008; Upadhyaya et al., 2008). A study which compared methods for selecting core subsets using relatively complete phenotypic data demonstrated that selection based on clustering using those data was superior to selection based on geographic origin alone (Diwan et al., 1995). Researchers have also conducted cluster analysis based on genotypic data, either based on actual genotyping (Franco et al., 2006; Wang et al., 2006; Balfourier et al., 2007; Escribano et al., 2008; Hao et al., 2008) or predicted genotypic effects based on modeling of phenotypic data (Hu et al., 2000; Li et al., 2004). Combinations of genotypic and phenotypic data have also been used to group accessions (Franco et al., 2010). Grouping based on genotype data would be expected to better reflect the genetic relationships among accessions. However, researchers are limited in the number of accessions that can be 2

17 evaluated and the depth of genotyping possible. Such limitations may prevent the selection of core subsets from large collections based on genetic data or may result in cores that are not as diverse as those selected based on non-genetic data The clustering method used and the data used in the clustering process have also varied. Choice of clustering method determines the way in which variable data or distance calculations are used to group accessions and different method choices can result in dramatic differences in final grouping. Ward s minimum variance method is one clustering method used by many researchers to construct cores (Franco et al., 1997; Hu et al., 2000; Upadhyaya et al., 2006, 2008, 2001; Anderson, 2005; Holbrook and Dong, 2005; Reddy et al., 2005; Kang et al., 2006; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Dwivedi et al., 2008). Other clustering methods that have been used include unweighted pair-group method using arithmetic average (UPGMA), also known as the average linkage method (Hu et al., 2000; Huamán et al., 2000; Li et al., 2004; Franco et al., 2006; Weihai et al., 2008); complete linkage (Hu et al., 2000); and the Ward-Modified Location Method (Franco et al., 1998, 1999, 2005). Authors have constructed clusters based on a variety of phenotypic variables. In many cases these variables have been uniformly quantitative, and authors often either used Euclidian distances or principle components to determine relationships among accessions and construct clusters (Diwan et al., 1995; Igartua et al., 1998; Holbrook and Dong, 2005; Kang et al., 2006; Bhattacharjee et al., 2007; Upadhyaya et al., 2008). However, a smaller number of researchers have used both categorical and quantitative variables in cluster analysis (Franco et al., 1997, 1998, 1999, 2005; Kroonenberg et al., 1997). Grouping via geographic information and/or cluster analysis serves two purposes. The first is to aid in selecting a core with reduced redundancy as described above. The second benefit of grouping is that it provides structure to the accessions and connections to the reserve collection, which is the set of accessions from the complete collection that are not included in the core. If breeders find lines in the core collection that are of interest, they can trace connections from these lines to sets of additional accessions in the reserve with similar characteristics. Ideally these accessions will be genetically similar to the accessions in the core, although this will depend on the effectiveness of the grouping. Milkas et al. 3

18 (1999) reported using core and reserve collections of common bean in such a way to discover sources of white mold resistance beyond a set found in a core subset. Following the stratification and clustering of the complete collection, a set of accessions is chosen from each group and compiled into a core. Generally accessions are chosen at random from each stratum; however, some researchers have suggested that direct, or only partially random, selection of all or a portion of the accessions in a core can increase diversity (Basigalup et al., 1995; Skinner et al., 1999; Huamán et al., 2000; Rodiño et al., 2003; Yan et al., 2007; Weihai et al., 2008). Several core subsets have been selected using proportional sampling, a random selection method that determines quantities of accessions from each group in proportion to the number of accessions in each group (Basigalup et al., 1995; Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Dahlberg et al., 2004; Holbrook and Dong, 2005; Reddy et al., 2005; Bhattacharjee et al., 2007; Dwivedi et al., 2008). This proportional sampling method is the most effective choice if the numbers of accessions in each group in the complete collection perfectly reflect the true genetic diversity of all the genotypes in the world that could fit in that group. In reality, the selection of accessions for germplasm collections may differ markedly from such perfection, largely due to constraints on collection activities. Sampling methods that take relatively fewer samples from larger clusters reduce redundancy and increase variability, as larger clusters tend to have greater redundancy among accessions (Brown, 1989). Common implementations of such sampling strategies include selection in proportion to the square root (Huamán et al., 2000; Wang et al., 2006) and natural logarithm of group size (Grenier et al., 2001b; Yan et al., 2007). Selecting equal numbers of accessions from each group, regardless of the size of the group, is the most extreme method for attempting to reduce redundancy. Rather than basing sampling strategy on the relationships between group sizes and diversity, some sampling methods attempt to increase diversity by selecting more accessions from groups with greater relative diversity. An example of this is selecting sample numbers relative to the mean distance among accessions in each cluster (Franco et al., 2005). One aspect of the core collections developed in the studies referenced above is that they were constructed using complete or nearly complete data sets of geographical, phenotypic, or genotypic data. 4

19 Unfortunately, many germplasm collections only have complete, or even mostly complete, data for a few variables. Grouping based only on a few variables is unlikely to maintain the allelic diversity of genes that affect other traits. Therefore, it is desirable to utilize all the variables for which we have even limited information. One method for doing so is to use Gower s distance (Gower, 1971), as this metric allows the calculation distances between accessions based on variables, of any measurement level (nominal, ordinal, interval, or ratio), for which both accessions have values, and is not affected by variables for which either accession has a missing value. The goal of a core subset is to represent the diversity of a complete collection with a reduced number of accessions. Therefore, the best method for selecting core subsets is the one that results in the most diversity for a given number of accessions. A wide variety of tests and calculations have been used to evaluate diversity of core subsets and to compare them to complete collections under the assumption that core subsets and complete collections are independent samples of some larger population. These methods have included chi-square tests of independence of collection type and country of origin, marker alleles, and nominal phenotypic variables (Tai and Miller, 2001; Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Reddy et al., 2005; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Agrama et al., 2009). Differences between the distribution of quantitative variables for proposed core subsets and complete collections have been tested using the Levene test and the Newman-Keuls test (Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Reddy et al., 2005; Kang et al., 2006; Bhattacharjee et al., 2007; Agrama et al., 2009). However, the validity of statistical tests of differences between complete collections and core subsets is questionable, since these are not independent samples in two respects. First, the complete collection is not a random sample of all the germplasm for the species in the collection (due to limitations in collection activities), and so statistics calculated on the complete collection should not be considered estimates of the 5

20 population of all germplasm for that species. Instead, the complete collection should be considered a population of interest for which we can calculate exact parameter values. Second, even if the complete collection is incorrectly considered a random sample, the core subset is not independently sampled; it is a subset of the observations in the complete collection, violating an assumption of these inference tests. Aside from proper statistical testing, the other consideration when evaluating core subsets is how distributions of variables in the core subset should reflect the complete collection. Many researchers have sought core subsets that match the mean values observed in the complete collection (Hu et al., 2000; Upadhyaya et al., 2006, 2008; Weihai et al., 2008; Parra-Quijano et al., 2011). However, achieving the same mean values as the original collection does nothing to further the goal of increased diversity, in fact it can result in selection against more diverse core subsets. Unless the distributions of quantitative variables in the complete collection are symmetrical, a core subset with reduced redundancy will have a mean that is shifted toward the skew. Selecting against such a change will favor methods that either omit extreme values on the skewed end or reduce redundancy less. Mixed Models for Multiple Environment Trials Evaluations of genotypes in varied environmental conditions are referred to as multi-environment trials (MET), and are used in advanced stages of plant breeding programs to identify genotypes with superior performance across environments and within specific environments or sets of environments. Yield data from MET often show genotype by environment interactions (GE). That is, genotypes respond differently to different environments. Genotype by environment interactions can occur for all response variables measured in MET, such as biomass or testweight, and can be analyzed in the same way as yield. We will use yield as our example response variable. 6

21 When G E occurs, the average yield of a genotype across all environments is no longer sufficient information upon which to base selections. Genotype by environment interactions occur in two forms. The less problematic form is changes of scale or interaction without rank changes. As the name implies, this occurs when the absolute yield differences among genotypes are not consistent from one environment to another, but the rankings of the genotypes remain constant. With this type of G E, there is still a genotype that is superior to all the others, but this difference may not be significant in all environments. The second form of G E is cross-over interaction, occurring when genotypes have different rankings in different environments. This necessitates evaluating genotypes in each environment separately. Breeders will then often select those genotypes with consistent relatively high performance across environments. Observed genotype yields in particular environments can be thought of as a sum of pattern and noise, where pattern is the yield expected whenever that genotype is grown in that environment and noise is defined as the deviation of the particular observation from the true pattern. The goal of statistical modeling is to find a model that explains the true pattern of genotype responses in each environment, and there are many methods that have been devised to do so. A traditional approach to the analysis of GE is a two-way analysis of variance (ANOVA) model where genotype, environment, and their interaction are treated as fixed effects with the model: y ijk g e ( ge) i j ij ijk (1) where y ijk is the yield (or other response variable) of the k th replicate of the i th genotype in the j th environment, μ is the overall mean, g i is the fixed effect of the i th genotype, e j is the fixed effect of the j th environment, (ge) ij is the interaction between the i th genotype and the j th environment, and ijk is the experimental error associated with the ijk th observation; i = 1 N g, j = N e, k = 1 N r. In this approach, a significant interaction necessitates the estimation of G E effects using the simple mean across replicates of each genotype within each environment. These are referred to as the cell means. The major disadvantage of this fixed effects approach is that these estimates are usually based on very little data (usually two to four datapoints, depending on the number of replicates) and so are less predictively 7

22 accurate than alternative estimators. This approach cannot be used to estimate GE effects when genotypes are not replicated within environments, since the effect of GE and experimental error are confounded. Confounding also occurs with replication if all replicates, or all but one, of any combination of genotype and environment are missing. Various alternatives have been shown to be superior to this traditional approach, including approaches with a fixed effects framework. One of the earliest approaches was joint regression analysis or the Finlay-Wilkinson model (Yates and Cochran, 1938; Finlay and Wilkinson, 1963) where a regressor is estimated for each genotype on the mean of all genotypes in each environment. More recently, the additive main effects and multiplicative interaction (AMMI; Gauch and Zobel, 1988; Gauch, 1988) and sites regression (SREG; Cornelius and Crossa, 1999) model families demonstrated improved predictive accuracy over the cell means. These two model families use sums of multiplicative terms, derived from singular value decomposition, replacing (ge) ij, in the case of AMMI, or g i +(ge) ij for SREG. The cell means model can be considered a case of the AMMI model where all possible multiplicative terms are included in the model. The AMMI and SREG models have been shown to be relatively equivalent in terms of predictive accuracy (Cornelius and Crossa, 1999). Like the analysis of G E in a fixed effects ANOVA, these models cannot be used when data from any genotype and environment combination is missing. Another approach is to use best linear unbiased prediction (BLUP) of random effects from a twoway mixed ANOVA model specified as in (1), but with genotypes or environments and G E treated as random effects. This model can be specified in matrix notation as: y = Xβ + Zγ + e, (2) where y is the vector of observations, β and γ are the vectors of fixed and random effects, respectively, X and Z are design matrices, and e is the vector of experimental error. The random effects vector, γ, consists of a subvector for genotype (and/or environment) main effects and a subvector for G E effects. Alternatively, γ can be limited to only G E effects. The random effects are assumed to follow a 8

23 multivariate normal distribution with mean of 0 and a variance-covariance matrix G. Hill and Rosenburg (1985) set G = σ 2 I, that is, constant variance and no covariance. Hill and Rosenburg determined that the use of BLUP improved predictive accuracy over cell means, which they attributed to its shrinkage property. That is, the predictions from the BLUP method are shrunk towards the mean, but the bias this introduces is offset by a reduction in variance (Piepho et al., 2008). Assuming that G = σ 2 I does not allow G to reflect any relationships among environments. Additionally, it does not take into account relationships among genotypes known from pedigree or marker data. This limits the accuracy with which estimates of G can reflect reality, and thus limits the accuracy of predicted breeding values, because information from correlated environments is not included in the BLUP calculations. Further, the model used by Hill and Rosenburg (1985) assumes that genotypes are independent, but in most MET at least a portion of the genotypes are related and therefore would be expected to show some correlation in their effects. Breeders keep detailed pedigrees of the lines in their breeding programs and so are able to predict the degree of additive genetic relationship among genotypes by calculating a genetic (also numerator, kinship or additive) relationship matrix (given the symbol A) using the coefficient of coancestry (Mrode and Thompson, 2005). Henderson (1973) proposed a method for using pedigree information, through the inverse of A, to calculate BLUP from mixed models of dairy cattle sires. Following Henderson s (1976) description of a method for quickly calculating A -1 without first generating A, animal breeders began using pedigree information with BLUP to make selections. Animal breeders now routinely use BLUP with pedigree data, but adoption by plant breeders has been slower. Examples of use by plant breeders include selection of soybean parents and crosses (Panter and Allen, 1995a; b), and selection of parents in peanuts (Pattee et al., 2001). Molecular marker data can also be used to generate a genetic relationship matrix (Bernardo, 1994, 1995; Villanueva et al., 2005; Hayes et al., 2009). Such genetic relationship matrices are estimates of realized relationship matrices which reflect the way the proportion of the genome that is identical by decent between two individuals can differ from the value predicted by the pedigree due to Mendelian sampling, especially if multiple rounds of selfing have occurred after a cross. Bernardo (1996) used coefficients of coancestry calculated 9

24 from pedigrees to calculate BLUP and observed high predictive accuracy as measured by crossvalidation. Piepho et al. (2008) review and provide examples of BLUP based on pedigree data without using the coefficient of coancestry. The predictive accuracy of mixed models may also be improved by increasing the complexity of the variance-covariance matrix of the random G E effect (G ge ) beyond an identity matrix. Note that G ge is a submatrix of G and is equal to G when random main effects are not included in the model. Smith et al. (2001) suggested that G ge can often be assumed separable such that G ge = G e I g, where I g is an identity matrix. The specific year and location combinations that are used as environments in MET can easily be thought of as random samples from a population of possible environments, but these environments do not behave independently in most MET. Instead, groups of environments have similar conditions and genotype responses. For example, locations in close proximity would be expected to have similar weather, resulting in more favorable yields for similar genotypes. In that case, G e with non-zero covariances between environments may be beneficial. Additionally, it may be more accurate to model responses in each environment with a different variance (heterogeneous variances). The most general way of doing so is to allow separate parameters for each variance and covariance. This is referred to as an unstructured matrix and it has a total of j (j + 1)/2 parameters, where j is the number of environments. Unfortunately, this means that the number of parameters to be estimated increases in a greater than linear rate with the number of environments, so the use of an unstructured matrix is often impossible for large numbers of environments and may be unstable for fewer environments. In order to reduce the number of parameters that must be estimated, various simpler structures for G e can be fit. For instance, we may assume no covariance among environments, but allow for heterogeneous variances among environments; a diagonal structure. Alternatively, one can fit the same variance to all environments and a single covariance to all pairs of environments, referred to as a compound symmetric structure. When used to evaluate faba bean MET datasets, Piepho (1994) determined that the BLUP predictions, using a compound symmetric structure for G e, were more predictively accurate than those of 10

25 any AMMI family model, including the cell means model. Many other more complex structures can be used to model G e. One such structure is the factor analytic model (FA) which is a mixed model version of the multiplicative model family proposed by Gollob (1968) and Mandel (1971). The fixed effects version of this model is usually referred to as the AMMI model family, which was mentioned earlier. The FA structure provides a compromise between the diagonal and unstructured matrices by finding a few common factors that best explain correlations between environments and then fitting the residual variation for each environment after the common factors are fit. Piepho (1997) used this model to analyze MET using the form: y ij g b e, i i j ij where y ij is the mean observed yield (or response) for the ij th genotype and environment combination, g i is the fixed main effect of the i th genotype, b i is a score for genotype i, e j is a main effect for environment j, and ij is the error for the ij th genotype and environment combination, which includes both experimental error and unexplained interaction. He considered environmental effects and genotype scores random, so b i, e j, and ij are independently normally distributed with mean zero and variances of σ 2 b, σ 2 e and σ 2, respectively. The variance-covariance matrix of genotype means in environment j (y j ) is equal to σ 2 ej + λ λ + D where J is a square matrix of ones, λ is a vector with elements equal to α i σ β (α i must be estimated along with the variance components) and D is equal to σ 2 I. This model can be expanded to include multiple factors in the interaction term. When Piepho fit the model to a MET of 10 wheat varieties in 17 environments, it had a similar -2 log-likelihood and fewer parameters compared to a generalized version of the Finlay-Wilkinson model, and so was considered superior. Smith et al. (2001) also fit a model that included a factor analytic structure for a variancecovariance matrix using the basic matrix formulation of the mixed model given in (2), and modeling the variance-covariance matrix of the G E interaction as separable such that G ge = G e I g as described 11

26 above. A factor analytic model for G e was modeled including the random effect of genotypes within environments (γ) as: u ( Λ I) f δ g, where Λ is a matrix whose columns are known as loadings, f is a vector that can be partitioned into factors corresponding to the columns of Λ, and δ is a vector of residuals (or specific variances). The vectors f and δ have independent multivariate normal distributions with mean 0, and variance-covariances of I and Ψ I, respectively. The variance-covariance matrix for γ is: var var γ Λ Ivar f Λ I varδ γ ΛΛ Ψ I Smith et al. (2005) showed that the model used by Piepho (1997) can be specified in a matrix algebraic form similar to that used by Smith et al. (2001). However, Smith et al. (2001) considered genotypes to be random effects with fixed effects for environments, did not include a main effect for genotype in some models, used heterogeneous specific variances (Ψ as opposed to Piepho s σ 2 I), and included a spatial model for within-field variation. Both of these models assumed that genetic effects were independent. Fitting this model to a MET with 172 genotypes in 7 locations, Smith et al. found that a two factor model fit the data nearly as well as an unstructured G e as judged by a likelihood ratio test. As described above, researchers have attempted to improve the predictive accuracy of analyses of MET by either incorporating pedigree data or FA structures into models of G E variance. Crossa et al (2006) and Oakey et al. (2007) went one step further and combined pedigree data with a FA structure for environmental covariances. Crossa et al. modeled the variance covariance matrix of effects of genotypes within environments as: γ var g A, 1 where A is the additive relationship matrix, and Σ g1 is a structure that models genetic variance and covariance across environments. Crossa et al. used multiple structures ranging from independent and identical variances to FA structures. Oakey et al. used a model similar to Smith et al. (2001), except for a 12

27 different model for spatial effects and a different model for the variance-covariance matrix of effects of genotypes within environments: γ var G A G D G I, a where A and D are the additive and dominance relationship matrices, and G a, G d, and G i are structures that model genetic variance and covariance across environments specific to additive, dominance, and residual non-additive effects. Oakey et al. fit models with diagonal, compound symmetric, or FA structures for G a, G d, and G i. Crossa et al. (2006) used a similar model. Kelly et al. (2009) utilized a similar model, including the use of an additive relationship matrix and a FA structure for G a, but did not fit dominance effects. These authors all found that that models with FA structures resulted in better AIC scores than simpler or more complex structures when fit to real data sets. Heritability and Genetic Correlation d i Heritability is a useful concept in both breeding and genetics, but the use of the word heritability can cause confusion due to varying definitions and methods of calculation. Heritability is the proportion of phenotypic variance due to genetic effects. When calculating broad-sense heritability (symbolized H 2 or H), these are total genetic effects; whereas for narrow-sense heritability (h 2 ), we only consider additive genetic effects (Falconer and Mackay, 1996): where is the total genetic variance and is the phenotypic variance, and where is the additive genetic variance and is the phenotypic variance. Additive genetic variance is a component of total genetic variance for multi-loci traits, which can be partitioned into additive, dominance, and epistatic variance. Additive genetic variance measures the variance attributable to the individual effects of single alleles. Dominance effects are the deviations from 13

28 the additive effect of each allele that are observed to occur in heterozygous individuals, due to the interactions between alleles at a locus. Epistasis refers to the interactions between genes or loci that deviate from simple additive effects. Epistatic interactions can be among additive and/or dominance effects and can be among two or more loci. Although the occurrence of epistatic effects is widely acknowledged, estimation of epistatic effects is often limited by statistical power or experimental design. Partly for this reason, epistasis is often assumed to be zero or negligible. When epistasis is estimated, the number of interacting loci is often limited to two and dominance interactions may not be evaluated (Reif et al., 2009; Duthie et al., 2010). If parents and offspring can be evaluated in the same environment, covariance between parents and offspring and covariance between full-sib offspring can be used to estimate additive, dominance, and additive*additive variance (Hallauer and Filho, 1981 pp ). Diallel mating designs, with crosses between all pairs of n parents, can be used to estimate and, assuming no epistasis (Hallauer and Filho, 1981 pp ). Theoretically, crosses between three and four parents can be used to isolate,, and, but their use is limited due to the complexity in obtaining parents and crosses (Hallauer and Filho, 1981 pp ). Epistatic effects and the other variance components can be easily estimated for populations in Hardy-Weinberg equilibrium with other strict assumptions, but in reality, these assumptions are often violated. This makes estimation of variance components more difficult and may necessitate other assumptions such as no epistasis (Lynch and Walsh, 1998 pp ). Beyond broad and narrow-sense definitions, the definition of heritability must be further specialized for specific uses. In animal breeding and evolutionary genetics, the individual is the unit of interest; therefore, phenotypic and genetic variance among individuals is used in calculations (Visscher et al., 2008). In plant breeding, many individuals with the same genotype can be produced, allowing for replicated testing. Selection can then occur based on means of individuals with the same genotype. This situation changes the definition of heritability so that the phenotypic variance is adjusted based on the unit of selection and response (Holland et al., 2003). For example, if genotypes are selected based on means over e environments and r replicates within environments, broad sense heritability is: 14

29 where is the variance of genotype by environment interactions, and is the residual error variance (Piepho and Möhring, 2007). An alternate definition of heritability in the breeding context is in terms of the univariate breeder s equation: R = h 2 S, where R is the expected response to selection and S is the selection differential. In this context, narrow-sense heritability is the coefficient of the regression of the response to selection on the selection differential. So by the general definition of regression coefficients, the narrowsense heritability when selection occurs on both parents is: where and is the covariance between the selection unit phenotype and the response unit phenotype, is the variance among selection unit phenotypes, i.e. the phenotypic variance (Holland et al., 2003). In most cases, individuals are assumed to be evaluated in independent environments and genotype effects are specified to have mean 0; therefore,, where is the expectation of the cross-product of the selection and response genotypic values. The assumption that environments are independent may not always be appropriate, and in such situations genetic correlations should be estimated. Phenotypic correlation reflects the relationship between two phenotypes or traits for a set of individuals, and is partitioned into environmental and genetic correlations. Usually, genetic and phenotypic correlations are evaluated for traits measured on the same individual, for example, plant biomass and yield. Such genetic correlations are due to either to pleiotropy or gametic phase disequilibrium between multiple genes affecting multiple traits (Lynch and Walsh, 1998 p. 629). When two traits are measured on individuals or plots of genetically identical plants, the correlation between the traits, across genotypes, is the phenotypic correlation ( ) (Holland, 2006). If the same genotype is grown in multiple environments, the correlation between the traits, across genotypes and averaged over 15

30 all environments, is the genetic correlation ( ). The difference between the two is the correlation between the two traits when measured in the same environment ( ), in this case the same location, year and place in the field. Genotypes can be grown in multiple plots in the same field to parse out the correlation due to position in the field (microenvironment) and year and location combination (macroenvironment). An alternative is to consider the responses of a genotype in different environments to be different traits; the genetic correlation is then defined as the correlation between the responses of a set of genotypes evaluated in two macroenvironments (Falconer, 1952). However, in this situation, the only environmental and phenotypic correlations are on the scale of the microenvironments within the macroevironments, and these correlations cannot be evaluated, since microenvironments cannot be replicated. When treating responses to environments as traits, the genetic correlation is related to G E, where genetic correlations of less than one result from genotype by environment interactions (Lynch and Walsh, 1998 pp ): ignoring nonadditive genetic effects, where is the additive genetic correlation, is the variance of genetic effects, and is the variance of interaction effects (differences in responses to the two environments varying among genotypes). This relationship means that instead of the traditional fixed effects ANOVA approach of modeling genotype by environment interactions as effects specific to each genotype and environment combination, effects of genotypes within environments can be modeled as correlated random effects that covary between environments and genotypes. This random effects approach is described in detail in the Mixed Models for Multiple Environment Trials section. Both heritability and genetic correlation are defined in terms of variance and covariance parameters that are unknown in reality, and this necessitates estimation of these parameters, generally through the use of mixed linear models with restricted maximum likelihood estimation. For heritability estimations, the specific mixed model used depends on the relatedness among selection and response 16

31 individuals and on the structure of evaluation trials (Holland et al., 2003). This complexity and the relationship above between the response to selection and realized heritability, led Piepho and Möhring (2007) to propose a method to simulate values such as response to selection rather than heritability per se. Estimation of genetic correlation is somewhat more straightforward, as it only depends upon relationship and trial structures for the population evaluated (Holland, 2006). Variance, heritability, and genetic correlation are often estimated in a single study and then considered representative of a population, and this is valid, to the extent that the population represented remains the one of interest. If any of these parameters are estimated in a study of one set of genetic material, their application to another may be questionable, depending upon the differences between them. Heritability and genetic variance will change over time due to selection, inbreeding, or mutation, which change allelic frequencies and may change additive effects of alleles on a population basis (Visscher et al., 2008). Heritability estimated for a set of breeding material may change in later years as new material is introgressed and/or selection removes or reduces the frequency of inferior alleles. For example, if, through selection, an allele becomes fixed, that locus will no longer contribute to additive effects for that trait in that population. However, the effect may again become evident if a different allele is reintroduced into the population. Genetic correlation between two environments may vary as weather conditions may be more or less similar year-to-year. Dimension Reduction for Linear Modeling When fitting multiple linear models, one major assumption is that all explanatory variables are independent of each other. A major consequence of violating this assumption is that the parameter estimates for the multicollinear variables will have very large sampling variability and so do not provide reliable information about the true parameter values (Kutner et al., 2004). The goal of many observational studies is to evaluate multiple variables to determine which of them have the greatest effect on a particular response. Statistical analysis of such studies will only be successful if the issue of multicollinearity is resolved. 17

32 Multiple remedial measures are available for addressing multicollinearity, one of which is dimension reduction. Dimension reduction techniques can be used to convert multicollinear variables into a smaller number of independent variables, creating new variables that are functions of the original multicollinear variables. Two dimension reduction techniques commonly used to eliminate multicollinearity are principal components analysis and factor analysis. Principal components analysis (PCA) is used for dimension reduction by replacing the original variables with the first few principal components (Lattin et al., 2002). These principal components are linear combinations of the original variables that are selected one at a time for maximum variance: where indicates the value of u i for which the bracketed quantity is maximized, z i is the vector of scores for the i th scaled principal component, u i is the i th eigenvector, X is the standardized original data, R is the sample correlation matrix, and q is the number of original variables. This maximization problem is solved using an eigenvalue or spectral decomposition of R. Additionally, PCA is equivalent to the singular value decomposition of X. After PCA, the scores of the first few principal components can be used to replace the observations of the original variables. Principal components are mutually independent, and when calculated on a dataset with high multicollinearity, the first few can capture most of the variation in the original variables. This makes PCA a useful option for dimension reduction prior to linear regression. Factor analysis and PCA are very similar but differ in terms of the specific model used for dimension reduction. Like PCA, factor analysis can be used to generate a smaller set of new variables, which capture most of the variation in the original variables by decomposing the correlation matrix of the original variables (Lattin et al., 2002). Unlike PCA, in factor analysis variation not accounted for by the factors is attributed to specific variance terms for each of the original variables. The new variables identified in factor analysis are often referred to as latent factors, and are considered to be the true unmeasurable factors, measured with error by the original variables, that affect the response (Suhr, 2005). 18

33 The number of new variables generated from either PCA or Factor analysis can range from 1 up to the number of original variables and multiple techniques can be used to decide how many to keep (Lattin et al., 2002). A scree plot of eigenvalues versus their associated component/factor can identify the point at which the eigenvalues decrease in a linear fashion. Since the eigenvalues relate to the variance explained by each component/factor, only retaining those that are above this linear trend results in a parsimonious set that account for a large share of the original variation. Alternatively, Kaiser s Rule suggests that only those variables with eigenvalues of greater than 1 should be retained, because each of the standardized original variables has variance of 1 and thus the new variables should account for more variation. Horn s procedure uses cutoffs like Kaiser s Rule, but with cutoffs of the eigenvalues from a PCA of random data, generated with numbers of variables and observations equal to the original data. Alternatively, a number of new variables can be retained such that they explain a user-defined proportion of the variation in the original data or each specific original variable. Following both PCA and factor analysis, rotation of the solution (a transformation of the matrix of principal components/factors) may be used to aid in interpretation (Lattin et al., 2002). In the initial results from PCA and principal factor analysis the first factor is chosen to maximize the variance accounted for and is often partially correlated to many of the original variables. With rotation, new factors can be generated that are highly related to some of the original variables and mostly unrelated to the others. Rotation should be conducted after the final number of components or factors is chosen. The methods of rotation can be divided into two major groups, those that result in orthogonal factors and those that allow non-orthogonal factors. When the goal is to generate new variables to use in linear modeling, orthogonality/independence is usually preferred. The advantage of non-orthogonal rotation is that it allows for new variables that are related to more distinct subsets of the original variables. The easier interpretation of non-orthogonal rotation can be advantageous in an observational study, but any resulting lack of independence will make linear model coefficient estimates inaccurate. 19

34 Extreme Cold Tolerance in Wheat Winter wheat is planted in autumn and requires a period of cold vernalization before flowering can occur. However, not all winter wheat cultivars are able to survive the extreme cold that occurs in some environments, resulting in winterkill that often causes economically important yield losses (Patterson et al., 1990). Extreme cold can occur in winter wheat growing areas of Washington State and can result in observed losses of 70% in an extreme year (Allen et al., 1992). Breeding winter wheat to tolerate extreme cold is therefore an important goal for wheat breeders in Washington. Winter wheat genotypes vary in their ability to survive extreme cold and this cold tolerance is controlled by genes on multiple chromosomes (Sutka, 1994). Expression levels of a large number of genes change when wheat is exposed to extreme cold and these changes vary across genotypes (Skinner, 2009). Additional work will be necessary before breeders will be able to select for superior cold tolerance based on genetic information alone. Due to the complex genetic nature of wheat cold tolerance, differential phenotypic assessments are necessary for the selection of breeding lines with greater cold tolerance. Assessment of cold tolerance in the field is impaired by year to year variation in temperature conditions and variation in conditions across a field, especially due to variable snow cover (Fowler, 1978). For this reason, evaluations under controlled environments are more appropriate. Testing is generally conducted by subjecting vernalized plants at an early growth stage to temperatures that decrease below freezing to a point where survival is differential among genotypes, followed by slow warming to regular greenhouse temperatures. After a number of weeks of regrowth plants are scored for survival. Beyond these generalities, numerous variations have been implemented, including variations in temperature and time at each temperature (Sutka, 1994; Fowler and Limin, 2004; Reddy et al., 2006; Skinner and Mackey, 2009; Skinner and Bellinger, 2010). When saturated soil is exposed to temperatures that decrease rapidly to a point well below freezing, substantial differences in soil temperatures can occur (Skinner and Mackey, 2009). These 20

35 differences can be explained by variation in the amount of soil and water in each container that can occur due to accidental differences in soil packing and heterogeneity in the soil or other planting media. Containers with more soil and water will take longer to cool or warm, especially during the phase change as water freezes. Holding the temperature just below freezing for an extended period of time allows the water in all containers of soil to freeze, reducing variation in temperatures beyond that point. When exposed to temperatures slightly below freezing (-3 C) for extended periods of time, wheat acquires increased tolerance to extreme temperatures as compared to shorter periods just below freezing in a process referred to as sub-zero acclimation (Herman et al., 2006). In the field, such a period of moderate cold does not always precede an extreme cold event. Therefore, testing for cold tolerance including such a sub-zero acclimation period may not precisely reflect tolerance in the field. However, the improved consistency resulting from a sub-zero acclimation period may outweigh this concern. Even with a subzero acclimation period, small differences in temperature may be present among samples, so it may be beneficial to include temperature measurements in analyses. The method for analyzing cold tolerance data depends upon the tolerance rating method and any explanatory variables to be included. Cold tolerance is generally evaluated in terms of numbers of plants surviving a cold event, but it can also be judged on an ordinal scale in terms of quality of regrowth (Sutka, 1994; Vagujfalvi et al., 2003). While both of these methods involve some subjective judgment, binary survival is easier to judge and thus likely to be more consistent among researchers. Binary survival data can be analyzed by treating each plant as an experimental unit or by using the proportion of plants that survived as the response for each group of plants. If each plant is considered an experimental unit, the data may be analyzed using logistic regression. Using the proportion approach, researchers have analyzed survival data using analysis of variance on transformed proportions (Skinner and Mackey, 2009) and have compared genotypes using the temperature at which 50% of the plants are killed (Limin and Fowler, 1993, 2006) or area under the death progress curve over a range of temperatures (Reddy et al., 2006). If phenotypic evaluation of extreme cold tolerance is to be included in a breeding program, evaluations must rapidly segregate large numbers of genotypes into groups with sufficient and insufficient 21

36 cold tolerance. Exact estimation of absolute tolerance levels is not necessary, but it is necessary to ensure that placement of each genotype into each group is due to true genetic differences and not random chance. Therefore, statistical testing that determines if each genotype has significantly different odds of survival as compared to a control is an effective analysis method. Rapid evaluation of large numbers of genotypes necessitates minimizing the number of times each genotype must be grown and placed in a freeze chamber. Using a single cooling profile allows for more efficient testing as compared to testing each genotype at multiple temperatures. However, it is important to identify beforehand a temperature profile that is both differential among genotypes and provides cold tolerance assessments that accurately predict performance in the field. References Agrama, H.A., W. Yan, F. Lee, R. Fjellstrom, M.-H. Chen, M. Jia, and A. McClung Genetic assessment of a mini-core subset developed from the USDA rice genebank. Crop Science. 49(4): Allen, R.E., J.A. Pritchett, and L.M. Little Cold injury observations. Anual Wheat Newsletter. 38. Anderson, W.F Development of a forage bermudagrass (Cynodon sp.) core collection. Grassland Science. 51: Balfourier, F., V. Roussel, P. Strelchenko, F. Exbrayat-Vinson, P. Sourdille, G. Boutet, J. Koenig, C. Ravel, O. Mitrofanova, M. Beckert, and G. Charmet A worldwide bread wheat core collection arrayed in a 384-well plate. Theoretical and Applied Genetics. 114: Basigalup, D.H., D.K. Barnes, and R.E. Stucker Development of a core collection for perennial Medicago plant introductions. Crop Science. 35: Bernardo, R Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Science. 34(1): Bernardo, R Genetic models for predicting maize single-cross performance in unbalanced yield trial data. Crop Science. 35(1): Bernardo, R Best linear unbiased prediction of maize single-cross performance. Crop Science. 36(1):

37 Bhattacharjee, R., I.S. Khairwal, P.J. Bramel, and K.N. Reddy Establishment of a pearl millet [Pennisetum glaucum (L.) R. Br.] core collection based on geographical distribution and quantitative traits. Euphytica. 155: Brown, A.H.D Core collections: a practical approach to genetic resources management. Genome. 31: Cornelius, P.L., and J. Crossa Prediction assessment of shrinkage estimators of multiplicative models for multi-environment cultivar trials. Crop Science. 39(4): Crossa, J., J. Burgueño, P.L. Cornelius, G. McLaren, R. Trethowan, and A. Krishnamachari Modeling genotype environment interaction using additive genetic covariances of relatives for predicting breeding values of wheat genotypes. Crop Science. 46(4): Dahlberg, J.A., J.J. Burke, and D.T. Rosenow Development of a sorghum core collection: refinement and evaluation of a subset from Sudan. Economic Botany. 58(4): Diwan, N., G.R. Bauchan, and M.S. McIntosh Methods of developing a core collection of annual Medicago species. Theoretical and Applied Genetics. 90: Duthie, C., G. Simm, A. Doeschl-Wilson, E. Kalm, P.W. Knap, and R. Roehe Epistatic analysis of carcass characteristics in pigs reveals genomic interactions between quantitative trait loci attributable to additive and dominance genetic effects. Journal of Animal Science. 88(7): Available at (verified 20 January 2012). Dwivedi, S.L., N. Puppala, H.D. Upadhyaya, N. Manivannan, and S. Singh Developing a core collection of peanut specific to Valencia market type. Crop Science. 48: Escribano, P., M.A. Viruel, and J.I. Hormaza Comparison of different methods to construct a core germplasm collection in woody perennial species with simple sequence repeat markers. A case study in cherimoya (Annona cherimola, Annonaceae), an underutilised subtropical fruit tree species. Annals of Applied Biology. 153: Falconer, D.S The Problem of Environment and Selection. The American Naturalist. 86(830): Falconer, D.S., and T.F.C. Mackay Introduction to Quantitative Genetics. 4th ed. Benjamin Cummings. Finlay, K., and G. Wilkinson The analysis of adaptation in a plant-breeding programme. Aust. J. Agric. Res. 14(6): Fowler, D.B Selection for winterhardiness in wheat. II. variation within field trials. Crop Science. 19(6):

38 Fowler, D.B., and A.E. Limin Interactions among factors regulating phenological development and acclimation rate determine low-temperature tolerance in wheat. Annals of Botany. 94(5): Franco, J., J. Crossa, and S. Desphande Hierarchical multiple-factor analysis for classifying genotypes based on phenotypic and genetic data. Crop Sci. 50(1): Franco, J., J. Crossa, S. Taba, and H. Shands A sampling strategy for conserving genetic diversity when forming core subsets. Crop Science. 45: Franco, J., J. Crossa, J. Villasenor, A. Castillo, S. Taba, and S.A. Eberhart A two-stage, three-way method for classifying genetic resources in multiple environments. Crop Science. 39: Franco, J., J. Crossa, J. Villasenor, S. Taba, and S.A. Eberhart Classifying Mexican maize accessions using hierarchical and density search methods. Crop Science. 37: Franco, J., J. Crossa, J. Villasenor, S. Taba, and S.A. Eberhart Classifying genetic resources by categorical and continuous variables. Crop Science. 38: Franco, J., J. Crossa, M.L. Warburton, and S. Taba Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Science. 46: Gauch, H.G Model selection and validation for yield trials with interaction. Biometrics. 44(3): Gauch, H.G., and R.W. Zobel Predictive and postdictive success of statistical analyses of yield trials. Theoret. Appl. Genetics. 76(1): Gollob, H.F A statistical model which combines features of factor analytic and analysis of variance techniques. Psychometrika. 33(1): Gower, J.C A general coefficient of similarity and some of its properties. Biometrics. 27: Grenier, C., P.J. Bramel-Cox, and P. Hamon. 2001a. Core collection of sorghum: I. stratification based on eco-geographical data. Crop Science. 41: Grenier, C., P. Hamon, and P.J. Bramel-Cox. 2001b. Core collection of sorghum: II. comparison of three random sampling strategies. Crop Science. 41: Hallauer, A., and J.B.M. Filho Quantitative Genetics in Maize Breeding. Iowa State University Press, Ames. Hao, C., Y. Dong, L. Wang, G. You, H. Zhang, H. Ge, J. Jia, and X. Zhang Genetic diversity and construction of core collection in Chinese wheat genetic resources. Chinese Science Bulletin. 53(10):

39 Hayes, B.J., P.M. Visscher, and M.E. Goddard Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research. 91(01): 47. Henderson, C.R Sire evaluation and genetic trends. J. Anim Sci. 1973(Symposium): Henderson, C.R A simple method for computing the inverse of a numerator relationship matrix used in prediction of breeding values. Biometrics. 32(1): Herman, E.M., K. Rotter, R. Premakumar, G. Elwinger, R. Bae, L. Ehler-King, S. Chen, and D.P. Livingston III Additional freeze hardiness in wheat acquired by exposure to - 3C is associated with extensive physiological, morphological, and molecular changes. Journal of Experimental Botany. 57(14): Hill, R.R., and J.L. Rosenberger Methods for combining data from germplasm evaluation trials. Crop Science. 25(3): Holbrook, C.C., and W. Dong Development and evaluation of a mini core collection for the U.S. peanut germplasm collection. Crop Science. 45: Holland, J.B Estimating Genotypic Correlations and Their Standard Errors Using Multivariate Restricted Maximum Likelihood Estimation with SAS Proc MIXED. Crop Science. 46(2): Holland, J.B., W.E. Nyquist, and C.T. Cervantes-Martínez Estimating and Interpreting Heritability for Plant Breeding: An Update. : Hu, J., J. Zhu, and H.M. Xu Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. Theoretical and Applied Genetics. 101: Huamán, Z., R. Ortiz, and R. Gómez Selecting a Solanum tuberosum subsp. andigena core collection using morphological, geographical, disease and pest descriptors. American Journal of Potato Research. 77: Igartua, E., M.P. Gracia, J.M. Lasa, B. Medina, J.L. Molina-Cano, J.L. Montoya, and I. Romagosa The Spanish barley core collection. Genetic Resources and Crop Evolution. 45: Kang, C.W., S.Y. Kim, S.W. Lee, P.N. Mathur, T. Hodgkin, M.D. Zhou, and J.R. Lee Selection of a core collection of Korean sesame germplasm by a stepwise clustering method. Breeding Science. 56: Kelly, A., B.R. Cullis, A.R. Gilmour, J.A. Eccleston, and R. Thompson Estimation in a multiplicative mixed model involving a genetic relationship matrix. Genetics Selection Evolution. 41:

40 Kroonenberg, P.M., B.D. Harch, K.E. Basford, and A. Cruickshan Combined analysis of categorical and numerical descriptors of australian groundnut accessions using nonlinear principal component analysis. Journal of Agricultural, Biological, and Environmental Statistics. 2(3): Kutner, M., C. Nachtsheim, J. Neter, and W. Li Applied Linear Statistical Models. 5th ed. McGraw-Hill/Irwin, San Francisco. Lattin, J., D. Carroll, and P. Green Analyzing Multivariate Data. 1st ed. Duxbury Press. Li, C.T., C.H. Shi, J.G. Wu, H.M. Xu, H.Z. Zhang, and Y.L. Ren Methods of developing core collections based on the predicted genotypic value of rice (Oryza sativa L.). Theoretical and Applied Genetics. 108: Limin, A.E., and D.B. Fowler Inheritance of cold hardiness in Triticum aestivum synthetic hexaploid wheat crosses. Plant Breeding. 110(2): Limin, A.E., and D.B. Fowler Low-temperature tolerance and genetic potential in wheat (Triticum aestivum L.): response to photoperiod, vernalization, and plant development. Planta. 224(2): Lynch, M., and B. Walsh Genetics and Analysis of Quantitative Traits. 1st ed. Sinauer Associates. Mahalakshmi, V., Q. Ng, M. Lawson, and R. Ortiz Cowpea [Vigna unguiculata (L.) Walp.] core collection defined by geographical, agronomical and botanical descriptors. Plant Genetic Resources: Characterization and Utilization. 5(3): Mandel, J A new analysis of variance model for non-additive data. Technometrics. 13(1): Miklas, P.N., R. Delorme, R. Hannan, and M.H. Dickson Using a subsample of the core collection to identify new sources of resistance to white mold in common bean. Crop Science. 39: Mrode, R.A., and R. Thompson Linear models for the prediction of animal breeding values. 2nd ed. CABI, Cambridge, MA. Oakey, H., A.P. Verbyla, B.R. Cullis, X. Wei, and W.S. Pitchford Joint modeling of additive and non-additive (genetic line) effects in multi-environment trials. Theoretical and Applied Genetics. 114: Panter, D.M., and F.L. Allen. 1995a. Using best linear unbiased predictions to enhance breeding for yield in soybean: I. choosing parents. Crop Science. 35(2): Panter, D.M., and F.L. Allen. 1995b. Using best linear unbiased predictions to enhance breeding for yield in soybean: II. selection of superior crosses from a limited number of yield trials. Crop Science. 35(2):

41 Parra-Quijano, M., J.M. Iriondo, E. Torres, and L.D. la Rosa Evaluation and Validation of Ecogeographical Core Collections using Phenotypic Data. Crop Science. 51(2): 694. Pattee, H.E., T.G. Isleib, D.W. Gorbet, F.G. Giesbrecht, and Z. Cui Parent selection in breeding for roasted peanut flavor quality. Peanut Science. 28(2): Patterson, F.L., G.E. Shaner, H.W. Ohm, and J.E. Foster A historical perspective for the establishment of research goals for wheat improvement. Journal of Production Agriculture. 3(1): Piepho, H.-P Best Linear Unbiased Prediction (BLUP) for regional yield trials: a comparison to additive main effects and multiplicative interaction (AMMI) analysis. Theoret. Appl. Genetics. 89(5). Piepho, H.-P Analyzing genotype-environment data by mixed models with multiplicative terms. Biometrics. 53(2): Piepho, H.-P., and J. Möhring Computing Heritability and Selection Response From Unbalanced Plant Breeding Trials. Genetics. 177(3): Piepho, H.-P., J. Mohring, A.E. Melchinger, and A. Buchse BLUP for phenotypic selection in plant breeding and variety testing. Euphytica. 161: Rao, K.E.P., and V.R. Rao The use of characterisation data in developing a core collection of sorghum. p In Core Collections of Plant Genetic Resources. John Wiley & Sons, Chichester. Reddy, L., R.E. Allan, and K.A. Garland Campbell Evaluation of cold hardiness in two sets of near-isogenic lines of wheat (Triticum aestivum) with polymorphic vernalization alleles. Plant Breeding. 125(5): Reddy, L.J., H.D. Upadhyaya, C.L.L. Gowda, and S. Singh Development of core collection in pigeonpea [Cajanus cajan (L.) Millspaugh] using geographic and qualitative morphological descriptors. Genetic resources and crop evolution. 52: Reif, J.C., B. Kusterer, H.-P. Piepho, R.C. Meyer, T. Altmann, C.C. Schön, and A.E. Melchinger Unraveling Epistasis With Triple Testcross Progenies of Near- Isogenic Lines. Genetics. 181(1): Available at (verified 20 January 2012). Rodiño, A.P., M. Santalla, A.M. De Ron, and S.P. Singh A core collection of common bean from the Iberian peninsula. Euphytica. 131: Skinner, D.Z Post-acclimation transcriptome adjustment is a major factor in freezing tolerance of winter wheat. Functional & Integrative Genomics. 9(4): Skinner, D.Z., G.R. Bauchan, G. Auricht, and S. Hughes A method for the efficient management and utilization of large germplasm collections. Crop Science. 39:

42 Skinner, D.Z., and B.S. Bellinger Exposure to subfreezing temperature and a freeze-thaw cycle affect freezing tolerance of winter wheat in saturated soil. Plant and Soil. 332: Skinner, D.Z., and B. Mackey Freezing tolerance of winter wheat plants frozen in saturated soil. Field Crops Research. 113(3): Smith, A., B. Cullis, and R. Thompson Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics. 57(4): Smith, A.B., B.R. Cullis, and R. Thompson The analysis of crop cultivar breeding and evaluation trials: an overview of current mixed model approaches. The Journal of Agricultural Science. 143(06): Suhr, D Principal Component Analysis vs. Exploratory Factor Analysis. In Proceedings of the Thirtieth Annual SAS Users Group International Conference. SAS Institute Inc., Cary, NC. Sutka, J Genetic control of frost tolerance in wheat (Triticum aestivum L.). Euphytica. 77: Tai, P.Y.P., and J.D. Miller A core collection for Saccharum spontaneum L. from the world collection of sugarcane. Crop Science. 41: Upadhyaya, H.D., P.J. Bramel, and S. Singh Development of a chickpea core subset using geographic distribution and quantitative traits. Crop Science. 41: Upadhyaya, H.D., C.L.L. Gowda, R.P.S. Pundir, V.G. Reddy, and S. Singh Development of core subset of finger millet germplasm using geographical origin and data on 14 quantitative traits. Genetic resources and crop evolution. 53: Upadhyaya, H.D., R.P.S. Pundir, C.L.L. Gowda, V.G. Reddy, and S. Singh Establishing a core collection of foxtail millet to enhance the utilization of germplasm of an underutilized crop. Plant Genetic Resources: Characterization and Utilization. 6: 1 8. USDA ARS, National Genetic Resources Program Germplasm Resources Information Network - (GRIN). [Online Database] National Germplasm Resources Laboratory, Beltsville, Maryland.Available at (verified 17 December 2009). Vagujfalvi, A., G. Galiba, L. Cattivelli, and J. Dubcovsky The cold-regulated transcriptional activator Cbf3 is linked to the frost-tolerance locus Fr-A2 on wheat chromosome 5A. Molecular Genetics and Genomics. 269(1): Villanueva, B., R. Pong-Wong, J. Fernández, and M.A. Toro Benefits from Marker- Assisted Selection Under an Additive Polygenic Genetic Model. J ANIM SCI. 83(8):

43 Visscher, P.M., W.G. Hill, and N.R. Wray Heritability in the genomics era - concepts and misconceptions. Nat Rev Genet. 9(4): Wang, X., R. Fjellstrom, Y. Jia, W.G. Yan, M.H. Jia, B.E. Scheffler, D. Wu, Q. Shu, and A. McClung Characterization of Pi-ta blast resistance gene in an international rice core collection. Plant Breeding. 129(5): Wang, L., Y. Guan, R. Guan, Y. Li, Y. Ma, Z. Dong, X. Liu, H. Zhang, Y. Zhang, Z. Liu, R. Chang, H. Xu, L. Li, F. Lin, W. Luan, Z. Yan, X. Ning, L. Zhu, Y. Cui, R. Piao, Y. Liu, P. Chen, and L. Qiu Establishment of Chinese soybean (Glycine max) core collections with agronomic traits and SSR markers. Euphytica. 151: Weihai, M., Y. Jinxin, and D. Sihachakr Development of core subset for the collection of Chinese cultivated eggplants using morphological-based passport data. Plant Genetic Resources: Characterization and Utilization. 6(1): Yan, W., N. Rutger, R.J. Bryant, H.E. Bockelman, R.G. Fjellstrom, M.-H. Chen, T.H. Tai, and A.M. McClung Development and evaluation of a core subset of the USDA rice germplasm collection. Crop Science. 47: Yates, F., and W.G. Cochran The analysis of groups of experiments. The Journal of Agricultural Science. 28(04):

44 CHAPTER 2 METHODS FOR SELECTING GERMPLASM CORE SUBSETS USING SPARSE PHENOTYPIC DATA Carl A. Walker, Harold E. Bockelman, J. Richard Alldredge, Kimberly Garland Campbell* C.A. Walker, Dep. of Crop and Soil Sciences, Washington State Univ., Pullman, WA, ; K.G. Campbell, USDA-ARS, Wheat Genetics, Wheat Genetics, Quality, Physiology, and Disease Research Unit, 209 Johnson Hall, Pullman, WA ; H.E. Bockelman, USDA- ARS, National Small Grains Collection, 1691 S 2700 W, Aberdeen, ID 83210; J.R. Alldredge, Department of Statistics, Washington State University, Pullman, WA, *Corresponding author (kgcamp@wsu.edu). Abbreviations: GRIN, Germplasm Research Information Network; HTAP, High-temperature Adult Plant; RI, recovery of interquartile range; RM, recovery of median; RR recovery of range; RS, recovery of Shannon index; UPGMA, unweighted pair-group method using arithmetic averages. Abstract Crop plant germplasm collections are often too large and too lacking in descriptive data to be of use regularly. A well-characterized core subset that consists of a reduced number of accessions (usually about 10% of the total) can provide increased utility while still maintaining most of the genetic diversity of the complete collection. Most core subsets have been constructed using complete or nearly complete data sets of geographical, phenotypic, or genotypic data, but most large germplasm collections only have complete, or even mostly complete, data for a few 30

45 variables. The main objective of this study was to evaluate methods for selecting core subsets of germplasm collections using sparse geographic and phenotypic data. A subset of variables and accessions with complete data was isolated from the USDA Triticum aestivum collection and was used to simulate multiple collections with sparse data. Core subsets were selected from the simulated data sets using 12 methods, defined by the choice of variables to use in Gower s distance estimations, clustering algorithm, and sampling intensity. Diversity metrics were calculated for each method and simulation. The methods were ranked within each simulation and then compared in terms of these average rankings. We conclude that core subsets can be selected based on sparse phenotypic data, and we recommend that a) Gower s distances should be estimated using all variables available, including those with more than 5% missing data; b) clustering should be conducted using the UPGMA algorithm; and c) clusters should be sampled in proportion to the logarithm of the cluster sizes. 31

46 Introduction Crop plant germplasm collections are maintained to conserve genetic variation and to provide useful plant material for researchers and plant breeders. An example of such a collection, which will be used in this study, is the collection of wheat (Triticum aestivum L. subsp. aestivum) accessions maintained as part of the National Small Grains Collection of the USDA-ARS National Plant Germplasm System ( For many researchers and plant breeders, germplasm collections are often too large and too lacking in descriptive data to be of regular use. A well-characterized core collection, or core subset, that consists of a reduced number of accessions (usually about 10% of the total) can provide increased utility while still maintaining most of the genetic diversity of the complete collection (Brown, 1989). Therefore, the best method for selecting core subsets is the one that results in the most diversity for a given number of accessions. Since desired alleles are unevenly distributed throughout germplasm collections, preferentially selecting a portion from each heterogeneous group present in a complete collection increases the likelihood of selecting these unevenly distributed alleles (Brown, 1989). For this reason, most researchers have constructed core collections by grouping accessions and then selecting accessions within groups. A number of different methods and types of data have been used to group accessions and select core collections. Passport data, i.e. the location of cultivation or collection, has been used to stratify complete collections followed by selection from within each stratum. This technique was used to select a core subset for the complete wheat collection described above (USDA ARS, National Genetic Resources Program, 2009), and to develop other core collections (Skinner et al., 1999; Huamán et al., 2000; Dahlberg et al., 2004; Yan et al., 2007). Other methods for selecting core subsets have included stratification based on geographic origin, followed by 32

47 further grouping based on cluster analysis of phenotypic traits (Basigalup et al., 1995; Rao and Rao, 1995; Igartua et al., 1998; Tai and Miller, 2001; Upadhyaya et al., 2001, 2006; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Dwivedi et al., 2008). Stratification of collections has also been conducted using cluster analysis without prior geographic grouping (Diwan et al., 1995; Franco et al., 1997, 1998, 1999, 2005; Grenier et al., 2001a; Li et al., 2004; Anderson, 2005; Holbrook and Dong, 2005; Weihai et al., 2008; Upadhyaya et al., 2008). A comparison of core subset selection methods using relatively complete phenotypic data, demonstrated that selection based on clustering using those data was superior to selection based on geographic origin alone (Diwan et al., 1995). The clustering method and the data used in the clustering process have also varied, resulting in dramatic differences in final grouping. Clustering methods used include Ward s minimum variance (Franco et al., 1997; Hu et al., 2000; Upadhyaya et al., 2006, 2008, 2001; Anderson, 2005; Holbrook and Dong, 2005; Reddy et al., 2005; Kang et al., 2006; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Dwivedi et al., 2008), unweighted pair-group method using arithmetic average (UPGMA), also known as the average linkage method (Hu et al., 2000; Huamán et al., 2000; Li et al., 2004; Franco et al., 2006; Weihai et al., 2008), complete linkage (Hu et al., 2000), and the Ward-Modified Location method (Franco et al., 1998, 1999, 2005). Core collections have been constructed using cluster analysis based on phenotypic variables followed by random sampling. In many cases these variables have been uniformly quantitative, and either Euclidian distances or principle components were used to determine relationships among accessions and construct clusters (Diwan et al., 1995; Igartua et al., 1998; Holbrook and Dong, 2005; Kang et al., 2006; Bhattacharjee et al., 2007; Upadhyaya et al., 2008). 33

48 Both categorical and quantitative variables have been used to determine clusters in a few cases (Franco et al., 1997, 1998, 1999, 2005; Kroonenberg et al., 1997). More recently, relationships have been determined based on genotypic data (Franco et al., 2006; Wang et al., 2006; Balfourier et al., 2007; Escribano et al., 2008; Hao et al., 2008) or combinations of genotypic and phenotypic data (Franco et al., 2010). Clusters constructed using genotypic data likely result in core subsets that better capture the genetic diversity in the complete collection; however, the germplasm collections for most major crop plants are too large to genotype all accessions. Following the stratification of the complete collection, by cluster analysis, a set of accessions was chosen from each group with the number sampled based on the size or the diversity of each group. Direct, or only partially random, selection of all or a portion of the accessions in a core has been used to increase diversity (Basigalup et al., 1995; Skinner et al., 1999; Huamán et al., 2000; Rodiño et al., 2003; Yan et al., 2007; Weihai et al., 2008). However, most researchers have selected accessions from each group randomly and with equal chance of selection among accessions in a group. Proportional sampling, where the number of accessions was chosen according to the group size, has often been used to select core subsets (Basigalup et al., 1995; Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Dahlberg et al., 2004; Holbrook and Dong, 2005; Reddy et al., 2005; Bhattacharjee et al., 2007; Dwivedi et al., 2008). Other sampling methods, such as selection in proportion to the square root of group size (square root sampling) (Huamán et al., 2000; Wang et al., 2006) and selection based on the natural logarithm of group size (logarthmic sampling) (Grenier et al., 2001b; Yan et al., 2007), took relatively fewer samples from larger groups. These proportional sampling methods reduced redundancy and increased variability, as larger clusters tend to have greater redundancy among 34

49 accessions (Brown, 1989). Diversity was also increased by directly selecting more accessions from groups with greater relative diversity. For example, sampling numbers were determined relative to the mean distance among accessions in each cluster (Franco et al., 2005). One aspect of most core subsets, including those referenced above, is that they were constructed using complete or nearly complete data sets of geographical, phenotypic, or genotypic data. One exception was an approach by Basigalup et al. (1995), who used partial data to manually select accessions with extreme values. Such a manual approach is time consuming and necessarily requires subjective judgments or core subsets that increase in size with the number of variables assessed. Most large germplasm collections have complete, or even mostly complete, data for only a few geographic and phenotypic variables. The USDA soybean collection is comprised of more than 17 thousand accessions, and 106 descriptors, but 75 of those descriptors have data for less than 10 thousand accessions (Randall Nelson, personal communication). The USDA wheat collection is even larger and more sparse (details below). Therefore, methods that have previously been shown to produce diverse core subsets may not be applicable or the most effective use of sparse data describing the largest germplasm collections of major crop plants. While one option is to construct the core subset using only the complete, or nearly complete, geographic and phenotypic data; the sparse data may well delineate differences among heterogeneous accessions that would otherwise be grouped together. How then should sparse data, such as the wheat GRIN database, be used to select core subsets? Gower s distance (Gower, 1971) has been used to calculate distances between accessions based on variables, of any measurement level (nominal, ordinal, interval, or ratio), for which both accessions have values. When either accession has a missing value for a variable, that variable would not be included in the calculation. Thus, sparse data can be used to calculate 35

50 distances, but it may not be appropriate to do so. If data are not missing at random, then parameter estimates calculated from such data will be biased (Graham, 2009). That is, estimates of distances between accessions may be influenced by which variables have missing values. In some cases, however, biased estimates using all data available may be preferable to unbiased estimates using a very small proportion of the total data available, because they result in reduced variance among distance calculations. Intuitively, if two accessions are known to be different for a trait, that information should be included, rather than ignored, when making decisions about relatedness. Decisions about whether or not to use sparse data in distance calculations should be based on the diversity of the resulting core subsets. The main objective of this study was to evaluate methods for selecting core subsets of germplasm collections using sparse geographic and phenotypic data. We compared methods by repeatedly selecting core subsets from a complete collection with data sets simulated to have differing patterns of missing values. We then evaluated the core subsets, selected using each method, in terms of their capture of the diversity present in the complete collection. A second objective was to recommend a strategy for using all data available to construct core germplasm collections. Materials and Methods The Germplasm Research Information Network (GRIN) maintains a database of all species held by the National Genetic Resources Program, including wheat accessions held by the USDA-ARS National Small Grains Collection. The wheat information in the database includes species, identification information (accession number, name and improvement status), passport data (country and state where the accession was collected), growth habit, and 77 other variables reflecting agronomic, descriptive, disease susceptibility, and insect susceptibility data. However, 36

51 as is the case for many species curated in the National Genetic Resources Program, this data set is far from complete, as many agronomic and susceptibility tests have been only been conducted on a portion of the accessions. For accessions of Triticum aestivum subsp. aestivum, the variables range from greater than 99.9% complete for country of origin, to more than 99% missing values (Table 1). Ongoing evaluations are being conducted for several traits. At the time this project was initiated, 41,312 accessions of Triticum aestivum subsp. aestivum were included in the database. A total of 3160 accessions from the USDA wheat collection have complete data for 23 variables (Table 2, first column). This subset of variables and accessions will be referred to as the complete collection and was used to simulate multiple collections with sparse data. To reduce the possibility that our conclusions would rely on a particular set or pattern of missing data, four patterns of missing data were evaluated. Independently for each variable, values were removed by randomly choosing accessions from a uniform distribution, i.e. missing completely at random (MCAR), or, for each variable, values were removed from a randomly chosen contiguous set of accessions (MC). For each of these two methods of removal, two sets of percentages of missing values were assigned to the 23 variables (Table 2). The first set of values was chosen to resemble the general pattern of missing value amounts observed among the variables in the complete collection (Table 1), with the same variables best represented. The second pattern was the reverse of the first and of the data structure of the complete collection. No values were removed from country of origin, since this information is available for almost all accessions. For each of these four patterns of missing data (MCAR-1, MC-1, MCAR-2, and MC-2), 200 independent data sets were simulated by removing values from the complete collection (referred to as the simulations). 37

52 There are important functional differences between MCAR and MC patterns, because accessions in the USDA complete collection and the complete collection are ordered based on when they were added to the collection and frequently characterized in that order. Since accessions were added to the collection in groups that often had a degree of genetic similarity (e.g. cultivars from a single breeding program), missing values from a contiguous set could influence the best choice of method for selecting a core subset using sparse phenotypic data. In order to evaluate the utility of selecting core subsets using the sparse passport and phenotypic data from the GRIN database, multiple core selection methods were applied to the simulated data sets and compared. All methods began with the calculation of a matrix of distances based on Gower s similarity coefficient (Gower, 1971), which was calculated using the DISTANCE procedure in SAS/STAT software (SAS Institute Inc., 2003). The distance between two accessions, i and j, was calculated as 1 S ij, where S ij is the similarity coefficient as defined by Gower (1971). This distance metric was calculated using the variables, both quantitative and qualitative, for which both accessions had an observation. In order to determine if very sparse data can be used when selecting core subsets, distance matrices were calculated either using all 23 variables, including those with many missing values, or only those that were densely populated with 5% or less missing values. Following distance matrix calculation, the CLUSTER procedure in SAS/STAT software (SAS Institute Inc., 2003) was used to construct a hierarchical tree of all the accessions within the complete collection. For comparison purposes, the clustering was conducted using two methods: the Ward s minimum variance method (Ward, 1963) and the unweighted pair-group method using arithmetic averages (UPGMA) (Sokal and Michener, 1958), which is also known as the average linkage method. These clustering methods were selected as part of a core subset 38

53 selection approach, since they were the most commonly recommended in other studies that selected core subsets. In both methods clustering was based on the Gower s distance matrix of each stratum. An R 2 value of 0.6 was set as the minimum value at which the next smallest number of clusters was selected for each stratum. As the focus of this study was the selection of core subsets for use in breeding programs, this R 2 value chosen so that the number of clusters in each stratum produced practical groups rather than attempting to determine genetic relations of accessions. This R 2 cutoff resulted in clusters that retained about 60% of the total variation in Gower s distances among accessions in the simulation. We viewed this percentage as a reasonable compromise between excessively small and homogeneous clusters or large and heterogeneous clusters. Random samples of accessions were selected from each cluster according to three methods for assigning sampling intensities as follows: 1) sampling numbers of accessions in proportion to the number of accessions in each cluster (proportional sampling); 2) sampling in proportion to the square root of the cluster total (square root sampling); 3) sampling in proportion to the natural logarithm of the cluster total (natural log sampling). If the number of selections calculated for a cluster was less than one, then one accession was selected from that cluster, otherwise the number of selections was rounded to the nearest accession. As a result of this rounding, the number of accessions selected for potential core subsets varied among simulations and subsetting methods. The combinations of choice of variables in the distance calculation, clustering method and sampling intensity define 12 different methods for selection of core subsets. Since accessions were randomly selected from clusters according to the selection intensities, an 39

54 extremely large number of potential core subsets could be selected for a given data set and methods. Since the specific accessions selected could influence the diversity of the resulting core subset, 1000 potential core subsets were selected using each of the twelve methods for all 200 simulations. In order to compare methods for selecting core subsets, five metrics were calculated for each potential core subset based on statistics calculated on each variable. These metrics allowed concurrent comparisons of multiple variables of the same type by averaging measures of the percent by which a core subset differed from the complete collection. The metrics were calculated after replacing the values removed in the simulation process for the accessions selected in a given core subset. The first metric was the averaged standardized percent deviation in medians, hereafter referred to as the recovery of median (RecMed), calculated as: 100 RecMed v v k1 ~ x ~ k xk ~ x k, where x~ is the median of the values of the k th ratio variable of the k subset, ~ x is the median of the values of the k th ratio variable for the accessions in the complete k collection, and v is the total number of ratio variables. The second metric, averaged percent recovery of interquartile range, hereafter referred to as recovery of interquartile range (RecIQR), v 100 IQRk calculated as: RecIQR, where IQRk is the interquartile range of the values of the IQR v k1 k th ratio variable of the subset, and k IQR k is the interquartile range of the values of the k th ratio variable for the complete collection. The third metric was the recovery of range (RecRange), also known as the coincidence rate (Hu et al., 2000), and was calculated using the equation of Franco et al. (2005): 100 RecRange v v k1 Rk R k, where R k is the range of the k th ratio or ordinal variables of the accessions in the subset, R k is the range of the k th variable for the complete 40

55 collection, and v is the total number of ratio and ordinal variables. The fourth metric was the recovery of number of categories (RecNCat), calculated as: 100 RecNCat v v k1 Ck C k, where C k and C k are the number of unique values a nominal variable takes in the subset and complete collection, respectively. The fifth metric was the recovery of Shannon index (RecS), calculated v 100 H k as: RecS, where H k and H k are the Shannon diversity (or entropy) indices for the H v k 1 k k th ordinal or nominal variable of the subset and complete collection, respectively. In both cases listed above, the Shannon indices were calculated using the equation: H S i1 p i ln p i, where S is the total number of unique values that occur for a nominal or ordinal variable, and p i is the frequency of the i th value of the variable. Hereafter these five metrics will be collectively referred to as recovery metrics. The recovery metrics described above were calculated for each potential core subset within each of the core subset selection methods and simulations. The median value of each recovery metric was calculated for each set of 1000 potential core subsets, and these medians will be referred to as the Median Recovery Metrics Over Potential core Subsets (MRMOPS). Medians were calculated because the distributions of recovery metrics over potential core subsets were often highly skewed. The methods were ranked, in terms of the MRMOPS, within each simulation. To summarize and provide concise criteria for choosing the method that would be expected to provide the most diverse core subset, the MRMOPS and ranks were averaged, over the simulations, for each method. 41

56 Results Our goal was to select a core subset with most of the diversity of the complete collection without redundancy, so we evaluated how well various methods for selecting cores achieved this goal. There are two major aspects of diversity: a wide range of possible values and an even representation of all values. How these aspects are evaluated depends on the variable of interest, with range calculated on ratio and ordinal variables, interquartile range calculated on ratio variables, number of categories evaluated for nominal variables, and Shannon s index calculated on ordinal and nominal variables. We used the recovery metric calculations to compare the diversity of the complete collection with the diversity recovered by the core subsets. Values in excess of 100% were desired for RecIQR and RecH, since the evenness of core subsets should exceed that of the complete collections, whereas 100% was the maximum possible recovery of range or numbers of categories. The best methods were expected to produce cores with the greatest diversity, as estimated by the MRMOPS values. Multiple patterns of missing data were simulated to ensure our conclusions were not specific to a single pattern of missing data. For each of these simulations, the MRMOPS were calculated and the methods were ranked in terms of the MRMOPS. The method with the consistently highest ranks (lowest numbers) would be expected to produce the most diverse core subsets for other germplasm collections with sparse data. The averages over the simulations of the MRMOPS and the ranks of MRMOPS allow us to choose a best method. Results from simulations with each pattern of missing data indicated similar average rankings of the methods. For the MCAR-1 simulations, the method that used the entire sparse data set, UPGMA clustering, and logarithmic sampling had the best average rank for RecRange, RecNCat, and 42

57 RecH, and the second best rank for RecIQR, for which square root sampling resulted in greater diversity (Table 3). Average ranks near 1.0 indicate that method was consistently the most diverse, as measured by each recovery metric, over all simulations. When the best average rank for a metric is nearer to 2.0, e.g. RecIQR for MCAR-1, a single method was not consistently the best, but the best method placed consistently in the top few ranks. Results from the MC-1 simulations also show that the method that uses all variables, UPGMA clustering, and logarithmic sampling results in the greatest diversity as measured by RecIQR, RecRange, and RecH (Table 4). The near equal average rankings, in terms of RecNCat, for logarithmic and square root sampling indicate that the top ranking mostly switched back and forth between these two methods over the simulations. Results from analyses of the MCAR-2 and MC-2 simulations differ only slightly from the results of MCAR-1 and MC-1 (Tables 5 and 6). For these simulations, core subsets selected using all variables from sparse data sets, UPGMA clustering, and logarithmic sampling were the most diverse in terms of RecIQR, RecNCat, and RecH, but this method did not produce cores with as wide of ranges as some other methods. However, the results for RecRange are probably not meaningful, since the mean MRMOPS for each method did not vary by more than one percent. Due to time constraints, we were unable to evaluate additional patterns of missing values or complete collections from other databases, but the range of conditions evaluated suggest that the best method will be generally consistent in other scenarios. Independent simulations and analyses are necessary to confirm these conclusions. When applying this methodology to real world germplasm collections, multiple potential core subsets could be selected. Here we compared medians over the potential subsets, but the potential cores for each method and simulation varied greatly in terms of diversity. We 43

58 recommend the method with the highest median, since it would be more likely to produce more diverse cores. For real world collections, it may be beneficial to select many potential core subsets and then choose to use one with relatively high values for all the recovery metrics. Since our methodology prevents the selection of core subsets all of the same size, it is possible that the MRMOPS and their rankings may have been influenced by the fluctuations in sizes. In general, including more accessions in a core might be expected to result in greater retention of diversity. When Pearson correlations were calculated over all simulations and methods, MRMOPS calculated on each recovery metric, except RecMed, were positively but weakly associated with core subset sizes (r = 0.355, 0.095, 0.159, 0.232; for RecIQR, RecRange, RecNCat, and RecH, respectively). However, these associations do not appear to be sufficient to explain the rankings of the methods. The methods with the best average rankings were not the methods with the largest mean sizes of core subset (Tables 3-6). We were concerned that 200 simulations per missing value pattern might not have been enough to accurately compare the methods. As illustrated in Figures 1 and 2, 200 simulations were sufficient to produce stable means of the MRMOPS and ranks. If additional simulations were analyzed the means would not be expected to change to a meaningful degree. That is, the only rank changes would be between methods that produce very similar results. Discussion Grouping germplasm accessions via cluster analysis serves two purposes. The first is to aid in selecting a core with reduced redundancy as described above. The second benefit of grouping is that it provides structure and connections to the reserve collection, the set of accessions from the complete collection that are not included in the core. The connection between each accession in the core and a specific group in the reserve collection can be of use to breeders. If breeders find 44

59 lines in the core collection that are of interest, they can trace connections from these lines to sets of additional accessions in the reserve with similar characteristics. Ideally these accessions will be genetically similar to the accessions in the core, although this will depend on the effectiveness of the grouping. Miklas et al. (1999) reported using core and reserve collections of common bean in such a way to discover sources of white mold resistance beyond a set found in a core subset. It is this second benefit that is the greatest argument for using a clustering approach over other approaches that yield diverse core subsets. The goal of a core subset is to provide easier access to the resources of a complete collection by representing the complete collection with a reduced number of accessions. Some researchers have selected core subsets that match the distributions of variables measured on their complete collections. A wide variety of statistical inference tests have been used to evaluate diversity of core subsets and to compare them to complete collections under the assumption that core subsets and complete collections are independent samples of some larger population. These methods have included chi-square tests of independence of collection type and country of origin, marker alleles, and nominal phenotypic variables (Tai and Miller, 2001; Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Reddy et al., 2005; Mahalakshmi et al., 2006; Bhattacharjee et al., 2007; Agrama et al., 2009). Differences between the distribution of quantitative variables for proposed core subsets and complete collections have been tested using the Levene test and the Newman-Keuls test (Upadhyaya et al., 2001, 2006, 2008; Grenier et al., 2001b; Reddy et al., 2005; Kang et al., 2006; Bhattacharjee et al., 2007; Agrama et al., 2009). The validity of these statistical tests of differences between complete collections and core subsets is questionable, since these are not independent samples in two respects. First, complete collections are not random samples of all wheat germplasm (due to limitations in collection 45

60 activities), and so statistics calculated on the complete collection should not be considered estimates of the population of all wheat germplasm. Instead, the complete collection is the population of interest for which we can calculate exact parameter values. Second, accessions in the core subset are not independent of the complete collection; the core is a stratified sample of the complete collection. Therefore, comparisons that avoid statistical tests and acknowledge that the core subset is, in fact, a subset of the complete collection are preferable. Aside from considerations of proper statistical testing, are researchers correct to select cores that match the means of complete collections, a common goal (Hu et al., 2000; Upadhyaya et al., 2006, 2008; Weihai et al., 2008; Parra-Quijano et al., 2011)? One reason to match the mean would be if the core subset was to be evaluated as a sample to make predictions about the complete collection and, by extension, the whole population represented by the complete collection. Although core subsets can do a very good job of matching the distributions of complete collections, the risk associated with this approach is that the complete collection does not effectively represent the actual population of germplasm it was sampled from. Germplasm collections are limited by the manner in which they were collected. For example, commercial breeding programs have rarely contributed material. Additionally, as a result of specific collectors or collection activities, certain countries may be under- or over-represented. Repetition of genotypes may also be a problem, but is often difficult to discern (van de Wouw et al., 2011). Rather than attempting to perfectly match the distributions of germplasm populations, a more achievable and beneficial goal is to select core subsets that capture the diversity maintained in complete collections while excluding redundant accessions. The core subsets that result would be useful for breeders and researchers who wish to evaluate a small set of accessions for a 46

61 new or unevaluated trait, maximizing their likelihood of finding the trait without evaluating excessive numbers of accessions, e.g. Wang et al. (2010). This approach necessarily results in deviation from the distributions of variables in the complete collection, since eliminating redundancy in anything other than a symmetric distribution, will shift the center of a distribution. We have included the RecMed metric in our evaluations for readers who feel that the distributional centers of the complete collection should be maintained in the core subsets (Tables 3-6). This metric results in higher values when the medians of the core subset deviate substantially from the medians of the complete collection. However, we believe that the other four recovery metrics are preferable and effectively identify diverse core subsets with reduced redundancy. Conclusion We conclude that core subsets can be selected based on incomplete phenotypic data sets, and when doing so, we recommend that a) Gower s distances should be estimated using all variables available, including those with more than 5% missing data; b) clustering should be conducted using the UPGMA algorithm; and c) clusters should be sampled in proportion to the logarithm of the cluster sizes. This method, which uses all available data, is expected to produce core subsets that retain much of the diversity of the complete collection while excluding redundant accessions. Appendix The Germplasm Research Information Network (GRIN) maintains a database of wheat accessions held by the National Small Grains Collection. This database includes information on species, identification information (accession number, name and improvement status), passport 47

62 data (country and state where the accession was collected), growth habit, and 77 other variables reflecting agronomic, descriptive, disease susceptibility, and insect susceptibility data. However, this data set is far from complete, as many agronomic and susceptibility tests have not been conducted for the majority of the accessions. For accessions of Triticum aestivum subsp. aestivum, the variables range from greater than 99.9% complete for country of origin, to more than 99% missing values (Table 1). Ongoing evaluations are being conducted for several traits. At the time this project was initiated, 41,312 accessions of Triticum aestivum subsp. aestivum were included in the database. A core subset of this complete collection was previously chosen by curator H. Bockelman in 1995, and additional accessions were added in In 1995, accessions were selected randomly from groups with the same value for the variable country (referring to country of origin). The number selected from each country-group was in proportion to the natural logarithm of the size of each country-group, resulting in the selection of about 10% of the complete collection. In 2006, to reflect the additions to the complete collection, 10% (858) of the accessions added between 1995 and 2006 were selected randomly, without grouping, and added to the core subset. This existing core subset consists of a total of 3992 accessions. In order to select a new core subset, accessions were stratified based on their growth habit (spring, winter, or facultative), and were additionally stratified by components within world macro regions, as defined in the United Nations demographic yearbook publications (United Nations, 2008). The region to which each value of the country variable is assigned is shown in Appendix Table 1. This initial stratification ensures that two accessions from different regions or with differing growth habits cannot be put together in the same group later in the core 48

63 selection process. This is desirable, since it is unlikely that two such accessions would be related. Based on our comparisons of core selection methods, we concluded that the most diverse core subset would be selected using all variables in Gower s distance calculations, UPGMA clustering, and logarithm sampling. Using this method, 2000 potential core subsets were selected from the complete collection. Recovery metrics were calculated on all potential cores and the potential core subsets were ranked for each metric. The potential cores were then compared using the sums of the ranks multiplied by the number of variables used in the calculation of each metric, that is: 11*RI + 44*RR + 12*RC + 45*RS. The core with the lowest value of this comparison metric was selected as the best potential core subset. Instead of directly using this best core subset, it was decided that any new core should use the maximum number possible of accessions from the original core. All accessions selected for both the original and best core were included in the new core. Additional accessions were then preferentially selected from the original core and then the best core to equal the number of accessions from each cluster determined by the logarithm sampling strategy. This resulted in a new core subset with over half of its accessions selected from the existing core, but with superior diversity as measured by the recovery metrics RecRange, RecNcat, and RecH (Appendix Table 2), but not RecIQR. This indicates that the original core has greater evenness in its distribution of ratio variables, but lesser diversity for ordinal and nominal variables as compared to the reselected core subset. 49

64 References Agrama, H.A., W. Yan, F. Lee, R. Fjellstrom, M.-H. Chen, M. Jia, and A. McClung Genetic assessment of a mini-core subset developed from the USDA rice genebank. Crop Science. 49(4): Anderson, W.F Development of a forage bermudagrass (Cynodon sp.) core collection. Grassland Science. 51: Balfourier, F., V. Roussel, P. Strelchenko, F. Exbrayat-Vinson, P. Sourdille, G. Boutet, J. Koenig, C. Ravel, O. Mitrofanova, M. Beckert, and G. Charmet A worldwide bread wheat core collection arrayed in a 384-well plate. Theoretical and Applied Genetics. 114: Basigalup, D.H., D.K. Barnes, and R.E. Stucker Development of a core collection for perennial Medicago plant introductions. Crop Science. 35: Bhattacharjee, R., I.S. Khairwal, P.J. Bramel, and K.N. Reddy Establishment of a pearl millet [Pennisetum glaucum (L.) R. Br.] core collection based on geographical distribution and quantitative traits. Euphytica. 155: Brown, A.H.D Core collections: a practical approach to genetic resources management. Genome. 31: Dahlberg, J.A., J.J. Burke, and D.T. Rosenow Development of a sorghum core collection: refinement and evaluation of a subset from Sudan. Economic Botany. 58(4): Diwan, N., G.R. Bauchan, and M.S. McIntosh Methods of developing a core collection of annual Medicago species. Theoretical and Applied Genetics. 90: Dwivedi, S.L., N. Puppala, H.D. Upadhyaya, N. Manivannan, and S. Singh Developing a core collection of peanut specific to Valencia market type. Crop Science. 48: Escribano, P., M.A. Viruel, and J.I. Hormaza Comparison of different methods to construct a core germplasm collection in woody perennial species with simple sequence repeat markers. A case study in cherimoya (Annona cherimola, Annonaceae), an underutilised subtropical fruit tree species. Annals of Applied Biology. 153: Franco, J., J. Crossa, and S. Desphande Hierarchical multiple-factor analysis for classifying genotypes based on phenotypic and genetic data. Crop Sci. 50(1): Franco, J., J. Crossa, S. Taba, and H. Shands A sampling strategy for conserving genetic diversity when forming core subsets. Crop Science. 45: Franco, J., J. Crossa, J. Villasenor, A. Castillo, S. Taba, and S.A. Eberhart A two-stage, three-way method for classifying genetic resources in multiple environments. Crop Science. 39:

65 Franco, J., J. Crossa, J. Villasenor, S. Taba, and S.A. Eberhart Classifying Mexican maize accessions using hierarchical and density search methods. Crop Science. 37: Franco, J., J. Crossa, J. Villasenor, S. Taba, and S.A. Eberhart Classifying genetic resources by categorical and continuous variables. Crop Science. 38: Franco, J., J. Crossa, M.L. Warburton, and S. Taba Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Science. 46: Gower, J.C A general coefficient of similarity and some of its properties. Biometrics. 27: Graham, J.W Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology. 60(1): Grenier, C., P.J. Bramel-Cox, and P. Hamon. 2001a. Core collection of sorghum: I. stratification based on eco-geographical data. Crop Science. 41: Grenier, C., P. Hamon, and P.J. Bramel-Cox. 2001b. Core collection of sorghum: II. comparison of three random sampling strategies. Crop Science. 41: Hao, C., Y. Dong, L. Wang, G. You, H. Zhang, H. Ge, J. Jia, and X. Zhang Genetic diversity and construction of core collection in Chinese wheat genetic resources. Chinese Science Bulletin. 53(10): Holbrook, C.C., and W. Dong Development and evaluation of a mini core collection for the U.S. peanut germplasm collection. Crop Science. 45: Hu, J., J. Zhu, and H.M. Xu Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. Theoretical and Applied Genetics. 101: Huamán, Z., R. Ortiz, and R. Gómez Selecting a Solanum tuberosum subsp. andigena core collection using morphological, geographical, disease and pest descriptors. American Journal of Potato Research. 77: Igartua, E., M.P. Gracia, J.M. Lasa, B. Medina, J.L. Molina-Cano, J.L. Montoya, and I. Romagosa The Spanish barley core collection. Genetic Resources and Crop Evolution. 45: Kang, C.W., S.Y. Kim, S.W. Lee, P.N. Mathur, T. Hodgkin, M.D. Zhou, and J.R. Lee Selection of a core collection of Korean sesame germplasm by a stepwise clustering method. Breeding Science. 56: Kroonenberg, P.M., B.D. Harch, K.E. Basford, and A. Cruickshan Combined analysis of categorical and numerical descriptors of australian groundnut accessions using nonlinear 51

66 principal component analysis. Journal of Agricultural, Biological, and Environmental Statistics. 2(3): Li, C.T., C.H. Shi, J.G. Wu, H.M. Xu, H.Z. Zhang, and Y.L. Ren Methods of developing core collections based on the predicted genotypic value of rice (Oryza sativa L.). Theoretical and Applied Genetics. 108: Mahalakshmi, V., Q. Ng, M. Lawson, and R. Ortiz Cowpea [Vigna unguiculata (L.) Walp.] core collection defined by geographical, agronomical and botanical descriptors. Plant Genetic Resources: Characterization and Utilization. 5(3): Miklas, P.N., R. Delorme, R. Hannan, and M.H. Dickson Using a subsample of the core collection to identify new sources of resistance to white mold in common bean. Crop Science. 39: Parra-Quijano, M., J.M. Iriondo, E. Torres, and L.D. la Rosa Evaluation and Validation of Ecogeographical Core Collections using Phenotypic Data. Crop Science. 51(2): 694. Rao, K.E.P., and V.R. Rao The use of characterisation data in developing a core collection of sorghum. p In Core Collections of Plant Genetic Resources. John Wiley & Sons, Chichester. Reddy, L.J., H.D. Upadhyaya, C.L.L. Gowda, and S. Singh Development of core collection in pigeonpea [Cajanus cajan (L.) Millspaugh] using geographic and qualitative morphological descriptors. Genetic resources and crop evolution. 52: Rodiño, A.P., M. Santalla, A.M. De Ron, and S.P. Singh A core collection of common bean from the Iberian peninsula. Euphytica. 131: SAS Institute Inc SAS/STAT User s Guide, Version 9. SAS Institute Inc., Cary, NC. Skinner, D.Z., G.R. Bauchan, G. Auricht, and S. Hughes A method for the efficient management and utilization of large germplasm collections. Crop Science. 39: Sokal, R.R., and C.D. Michener A statistical method for evaluating systematic relationships. Kansas University Science Bulletin. 38: Tai, P.Y.P., and J.D. Miller A core collection for Saccharum spontaneum L. from the world collection of sugarcane. Crop Science. 41: United Nations Demographic Yearbook. New York. Upadhyaya, H.D., P.J. Bramel, and S. Singh Development of a chickpea core subset using geographic distribution and quantitative traits. Crop Science. 41:

67 Upadhyaya, H.D., C.L.L. Gowda, R.P.S. Pundir, V.G. Reddy, and S. Singh Development of core subset of finger millet germplasm using geographical origin and data on 14 quantitative traits. Genetic resources and crop evolution. 53: Upadhyaya, H.D., R.P.S. Pundir, C.L.L. Gowda, V.G. Reddy, and S. Singh Establishing a core collection of foxtail millet to enhance the utilization of germplasm of an underutilized crop. Plant Genetic Resources: Characterization and Utilization. 6: 1 8. USDA ARS, National Genetic Resources Program Germplasm Resources Information Network - (GRIN). [Online Database] National Germplasm Resources Laboratory, Beltsville, Maryland.Available at (verified 17 December 2009). Wang, X., R. Fjellstrom, Y. Jia, W.G. Yan, M.H. Jia, B.E. Scheffler, D. Wu, Q. Shu, and A. McClung Characterization of Pi-ta blast resistance gene in an international rice core collection. Plant Breeding. 129(5): Wang, L., Y. Guan, R. Guan, Y. Li, Y. Ma, Z. Dong, X. Liu, H. Zhang, Y. Zhang, Z. Liu, R. Chang, H. Xu, L. Li, F. Lin, W. Luan, Z. Yan, X. Ning, L. Zhu, Y. Cui, R. Piao, Y. Liu, P. Chen, and L. Qiu Establishment of Chinese soybean (Glycine max) core collections with agronomic traits and SSR markers. Euphytica. 151: Ward, J.H Hierarchical grouping to optimize and objective function. Journal of the American Statistical Association. 58: Weihai, M., Y. Jinxin, and D. Sihachakr Development of core subset for the collection of Chinese cultivated eggplants using morphological-based passport data. Plant Genetic Resources: Characterization and Utilization. 6(1): van de Wouw, M., R. van Treuren, and T. van Hintum Authenticity of Old Cultivars in Genebank Collections: A Case Study on Lettuce. Crop Science. 51(2): 736. Yan, W., N. Rutger, R.J. Bryant, H.E. Bockelman, R.G. Fjellstrom, M.-H. Chen, T.H. Tai, and A.M. McClung Development and evaluation of a core subset of the USDA rice germplasm collection. Crop Science. 47:

68 Table 1. Measurement levels and missing value percentages of variables evaluated on the Triticum aestivum L. subsp. aestivum complete collection. Variable Level of measurement % missing Variable Level of measurement % missing awn color nominal 63.3 RWA leaf roll 2 nominal 74.0 awn type ordinal 55.4 SBMV reaction ordinal 79.9 BYDV Davis reaction ordinal 95.3 scab reaction ratio 90.0 BYDV Urb reaction ordinal 69.3 shattering ordinal 78.0 cereal leafbeetle reaction ordinal 69.6 spike density ordinal 74.5 commonbunt M1 reaction ratio 61.3 spike type nominal 74.8 commonbunt M2 reaction ratio 93.3 spikelets per spike ratio 95.7 commonbunt M3 reaction ratio 86.7 stagnospora reaction ordinal 84.7 commonbunt R36 reaction ratio 99.9 state nominal 20.6 commonbunt R39 reaction ratio 97.0 stem rust adult Rosemount ordinal 87.7 commonbunt R43 reaction ratio 99.4 stem rust adult St.Paul ordinal 65.2 commonbunt T1 reaction ratio 85.2 stem rust HJCS reaction nominal 90.4 country nominal 0.02 stem rust HNLQ reaction nominal 91.1 days to flowering ratio 4.7 stem rust QFBS reaction nominal 82.5 dwarf bunt reaction ratio 56.2 stem rust QSHS reaction nominal 90.2 glume color nominal 62.8 stem rust RHRS reaction nominal 90.5 glume pubescence ordinal 59.6 stem rust RKQS reaction nominal 91.1 growth habit nominal 0.8 stem rust RTQQ reaction nominal 81.7 height ratio 4.8 stem rust TNMH reaction nominal 90.4 Hessian B reaction ordinal 98.9 stem rust TNMK reaction nominal 81.8 Hessian C reaction ordinal 58.3 straw breakage ordinal 70.5 Hessian E reaction ordinal 58.3 straw color nominal 55.6 Hessian GP reaction ordinal 68.5 stripe rust adult Mt.Vernon ordinal 7.3 Hessian L reaction ordinal 89.6 stripe rust adult Pullman ordinal 26.5 kernel color nominal 44.6 stripe rust PST 100 reaction ordinal 88.3 kernel weight ratio 65.6 stripe rust PST 17 reaction ordinal 50.2 kernels per spike ratio 92.9 stripe rust PST 20 reaction ordinal 73.0 leaf pubescence ordinal 61.8 stripe rust PST 25 reaction ordinal 97.9 leaf rust adult reaction ordinal 90.0 stripe rust PST 27 reaction ordinal 70.5 leaf rust reaction ordinal 26.1 stripe rust PST 29 reaction ordinal 71.2 lodging ordinal 57.6 stripe rust PST 37 reaction ordinal 62.8 lysine content ratio 75.1 stripe rust PST 43 reaction ordinal 65.1 powdery mildew reaction ordinal 66.4 stripe rust PST 45 reaction ordinal 62.8 rachis length ordinal 95.6 stripe rust PST 78 reaction ordinal 89.8 RWA 1 chlorosis ordinal 27.7 stripe rust PST 80 reaction ordinal 92.8 RWA 2 chlorosis ordinal 74.0 stripe rust severity Mt. Vernon ratio 7.3 RWA leaf roll 1 nominal 27.7 stripe rust severity Pullman ratio 26.3 Detailed information on variables available at 54

69 Table 2. Removal percentages by variable for simulating data sets with missing values by removing values from the "complete collection". Level of % Variable measurement Set 1 Set 2 country nominal 0 0 days to flowering ratio 5 95 height ratio 5 95 stripe rust adult Mt.Vernon ordinal 5 90 stripe rust severity Mt. Vernon ratio 5 90 state nominal leaf rust reaction ordinal RWA 1 chlorosis ordinal RWA leaf roll 1 nominal stripe rust adult Pullman ordinal stripe rust severity Pullman ratio kernel color nominal Hessian C reaction ordinal Hessian E reaction ordinal kernel weight ratio straw color nominal awn type ordinal lodging ordinal leaf pubescence ordinal glume color nominal 90 5 awn color nominal 90 5 straw breakage ordinal 95 5 commonbunt M1 reaction ratio

70 56 Table 3. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 1 (see Table 2) from accessions selected randomly from a uniform distribution. Core subset selection method RecMed RecIQR RecRange RecNCat RecH Distance Variables Clustering Sampling Mean # of accessions Medians# Ranks Medians Ranks Medians Ranks Medians Ranks Medians Ranks Sparse UPGMA Logarithm Sparse UPGMA Proportional Sparse UPGMA Square Root Sparse Ward Logarithm Sparse Ward Proportional Sparse Ward Square Root Dense UPGMA Logarithm Dense UPGMA Proportional Dense UPGMA Square Root Dense Ward Logarithm Dense Ward Proportional Dense Ward Square Root Recovery metrics: RecMed, recovery of median; RecIQR, recovery of interquartile range; RecRange, recovery of range; RecNCat, recovery of number of unique categories; RecH, recovery of Shannon's Index. Distance calculations were conducted using either all of the variables (Sparse) or only those few with 5% or less missing values (Dense). Clustering of accessions methods: UPGMA, unweighted pair-group method using arithmetic averages; Ward, Ward s minimum variance. Sampling accessions from each cluster in proportion to cluster size (proportional), natural logarithm of the size (logarithm), or square root of the size (square root). # Means, over 200 simulations, of medians of recovery metrics of 1000 potential core subsets. Means, over 200 simulations, of ranks of selection methods. Methods were ranked, with highest values receiving the lowest ranks, within each simulation based on medians, over 1000 potential core subsets, of recovery metrics.

71 57 Table 4. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 1 (see Table 2) from accessions selected as a contiguous group. Core subset selection method RecMed RecIQR RecRange RecNCat RecH Distance Variables Clustering Sampling Mean # of accessions Medians# Ranks Medians Ranks Medians Ranks Medians Ranks Medians Ranks Sparse UPGMA Logarithm Sparse UPGMA Proportional Sparse UPGMA Square Root Sparse Ward Logarithm Sparse Ward Proportional Sparse Ward Square Root Dense UPGMA Logarithm Dense UPGMA Proportional Dense UPGMA Square Root Dense Ward Logarithm Dense Ward Proportional Dense Ward Square Root Recovery metrics: RecMed, recovery of median; RecIQR, recovery of interquartile range; RecRange, recovery of range; RecNCat, recovery of number of unique categories; RecH, recovery of Shannon's Index. Distance calculations were conducted using either all of the variables (Sparse) or only those few with 5% or less missing values (Dense). Clustering of accessions methods: UPGMA, unweighted pair-group method using arithmetic averages; Ward, Ward s minimum variance. Sampling accessions from each cluster in proportion to cluster size (proportional), natural logarithm of the size (logarithm), or square root of the size (square root). # Means, over 200 simulations, of medians of recovery metrics of 1000 potential core subsets. Means, over 200 simulations, of ranks of selection methods. Methods were ranked, with highest values receiving the lowest ranks, within each simulation based on medians, over 1000 potential core subsets, of recovery metrics.

72 58 Table 5. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 2 (see Table 2) from accessions selected randomly from a uniform distribution. Core subset selection method RecMed RecIQR RecRange RecNCat RecH Distance Variables Clustering Sampling Mean # of accessions Medians# Ranks Medians Ranks Medians Ranks Medians Ranks Medians Ranks Sparse UPGMA Logarithm Sparse UPGMA Proportional Sparse UPGMA Square Root Sparse Ward Logarithm Sparse Ward Proportional Sparse Ward Square Root Dense UPGMA Logarithm Dense UPGMA Proportional Dense UPGMA Square Root Dense Ward Logarithm Dense Ward Proportional Dense Ward Square Root Recovery metrics: RecMed, recovery of median; RecIQR, recovery of interquartile range; RecRange, recovery of range; RecNCat, recovery of number of unique categories; RecH, recovery of Shannon's Index. Distance calculations were conducted using either all of the variables (Sparse) or only those few with 5% or less missing values (Dense). Clustering of accessions methods: UPGMA, unweighted pair-group method using arithmetic averages; Ward, Ward s minimum variance. Sampling accessions from each cluster in proportion to cluster size (proportional), natural logarithm of the size (logarithm), or square root of the size (square root). # Means, over 200 simulations, of medians of recovery metrics of 1000 potential core subsets. Means, over 200 simulations, of ranks of selection methods. Methods were ranked, with highest values receiving the lowest ranks, within each simulation based on medians, over 1000 potential core subsets, of recovery metrics.

73 59 Table 6. Comparisons of core subset selection methods in terms of diversity of 1000 potential core subsets selected from 200 complete collections simulated with values removed at the rates given by set 2 (see Table 2) from accessions selected as a contiguous group. Core subset selection method RecMed RecIQR RecRange RecNCat RecH Distance Variables Clustering Sampling Mean # of accessions Medians# Ranks Medians Ranks Medians Ranks Medians Ranks Medians Ranks Sparse UPGMA Logarithm Sparse UPGMA Proportional Sparse UPGMA Square Root Sparse Ward Logarithm Sparse Ward Proportional Sparse Ward Square Root Dense UPGMA Logarithm Dense UPGMA Proportional Dense UPGMA Square Root Dense Ward Logarithm Dense Ward Proportional Dense Ward Square Root Recovery metrics: RecMed, recovery of median; RecIQR, recovery of interquartile range; RecRange, recovery of range; RecNCat, recovery of number of unique categories; RecH, recovery of Shannon's Index. Distance calculations were conducted using either all of the variables (Sparse) or only those few with 5% or less missing values (Dense). Clustering of accessions methods: UPGMA, unweighted pair-group method using arithmetic averages; Ward, Ward s minimum variance. Sampling accessions from each cluster in proportion to cluster size (proportional), natural logarithm of the size (logarithm), or square root of the size (square root). # Means, over 200 simulations, of medians of recovery metrics of 1000 potential core subsets. Means, over 200 simulations, of ranks of selection methods. Methods were ranked, with highest values receiving the lowest ranks, within each simulation based on medians, over 1000 potential core subsets, of recovery metrics.

74 Mean over simulations of medians over cores MedRIQR Index Figure 3. Plot of cumulative means, over simulations, of median recovery of interquartile range, over 1000 potential core subsets per simulation. Simulations were generated by removing values from randomly chosen individual accessions with missingness rates given by set 1. The values of the means of all 200 simulations are shown in Table 3. 60

75 Mean of Ranks RRIQR Index Figure 4. Plot of cumulative means, over simulations, of median recovery of interquartile range, over 1000 potential core subsets, ranked across methods within each simulation. Simulations were generated by removing values from randomly chosen individual accessions with missingness rates given by set 1. The mean ranks, over all 200 simulations, are shown in Table 3. 61

76 62 Appendix Table 1. Assignments of countries to world macro region components. Carribean Central America South America Northern America Eastern Asia South-Central Asia South-eastern Asia Cuba Guatemala Argentina Canada China Afghanistan Indonesia Honduras Bolivia United States Japan Bangladesh Philippines Mexico Brazil Korea, North Bhutan Thailand Nicaragua Chile Korea, South India Panama Colombia Mongolia Iran Ecuador Taiwan Kazakhstan Paraguay Kyrgyzstan Peru Nepal Uruguay Pakistan Venezuela Tajikistan Turkistan Turkmenistan Uzbekistan Western Asia Eastern Europe Northern Europe Southern Europe Western Europe Oceania Unknown Ancient Palestine Belarus Denmark Albania Austria Australia Asia Armenia Bulgaria Estonia Andorra Belgium New Zealand Europe Asia Minor Czech Republic Finland Bosnia and Herzegovina France Uncertain Azerbaijan Czechoslovakia Iceland Croatia Germany Unknown Cyprus Former Soviet Union Ireland Former Yugoslavia Netherlands Georgia Hungary Latvia Greece Switzerland Iraq Moldova Lithuania Italy Israel Poland Norway Macedonia Jordan Romania Sweden Malta Lebanon Russian Federation United Kingdom Portugal Oman Slovakia Slovenia Saudi Arabia Ukraine Spain Syria Yugoslavia Turkey Yemen

77 Appendix Table 2. Comparison of original core subset against the reselected core subset in terms of recovery metrics. Core Subset RecMed RecIQR RecRange RecNCat RecH Original New core Recovery metrics: RecMed, recovery of median; RecIQR, recovery of interquartile range; RecRange, recovery of range; RecNCat, recovery of number of unique categories; RecH, recovery of Shannon's Index. 63

78 CHAPTER 3 COMPARISON OF LINEAR MIXED MODELS FOR MULTIPLE ENVIRONMENT PLANT BREEDING TRIALS Carl A. Walker 1, Fabiano Pita 2, Kimberly Garland Campbell 1,3 1 Dept. of Crop and Soil Sciences, Washington State University; 2 Quantitative Genetics Group, Dow AgroSciences; 3 USDA-ARS, Wheat Genetics, Quality, Physiology, and Disease Research Unit Abstract Evaluations of genotypes in varied environmental conditions are referred to as multiple environment trials (MET) and often show genotype by environment interactions, necessitating estimation of effects of genotypes within environments. Empirical best linear unbiased predictions can provide more accurate estimates of these effects, depending upon the mixed model used. The objective of this work was to simulate and analyze MET data sets to determine which linear models provide the most accurate estimates and determine how the choice of ideal model changes as a result of different MET conditions. Simulations varied in terms of numbers of genotypes and environments, variances and covariances of genotype responses within and between environments, experimental design, and experimental error variance. Simulated MET were fit with mixed models with or without genetic relationship matrices (GRM) and with structures of varying complexity used to model relationships among environments. Estimates from these analyses for effects of genotypes within environments were compared to the simulated values. The model that included a GRM and constant variance-constant correlation structure was the most accurate for the largest number of scenarios. Models including GRM that allowed heterogeneous environmental variances with constant correlations often resulted in greater accuracy when MET were simulated with heterogeneous variances among environments. Factor analytic models with 64

79 GRM were the most accurate only in a subset of scenarios simulated with complex relationships among environments, 100 or more genotypes, less than 40 environments, and low error variance. Introduction Evaluations of genotypes in varied environmental conditions are referred to as multiple environment trials (MET), and are used in advanced stages of plant breeding programs to identify genotypes with superior performance across environments and within specific environments or sets of environments. Yield data from MET often show genotype by environment interactions (G E), and, in practice, are most often are analyzed using a two-way analysis of variance (ANOVA) model where genotype, environment, and their interaction are treated as fixed effects: y ijk g e ( ge) i j ij ijk where y ijk is the yield (or other response variable) of the k th replicate of the i th genotype in the j th environment, μ is the overall mean, g i is the fixed effect of the i th genotype, e j is the fixed effect of the j th environment, (ge) ij is the interaction between the i th genotype and the j th environment, and ijk is the experimental error associated with the ijk th observation; i = 1 N g, j = N e, k = 1 N r. The estimates of genotype within environment effects are the means across replicates of each genotype in each environment (i.e. cell means). The major disadvantage of this approach is that these cell means estimates are usually based on very little data (dependent on the number of replicates) and so are less predictively accurate than some alternative estimators. Additionally, this approach cannot be used to estimate GE effects when genotypes are not replicated within environments, since the effect of GE and experimental error would be confounded. That is, experimental error cannot be separated from the specific effect of each genotype and environment combination. Various estimators have been shown to be more accurate for MET than cell means. These include the Additive Main effects Multiplicative Interaction (AMMI) models (Gauch and Zobel, 1988; 65

80 Gauch, 1988) and sites regression (SREG; Cornelius and Crossa, 1999) model families, which are sometimes referred to as linear-bilinear models. These two fixed-effect model families include sums of multiplicative terms, resulting from singular value decomposition, replacing (ge) ij, in the case of AMMI, or g i +(ge) ij for SREG. The AMMI and SREG models have been shown to be relatively equivalent in terms of predictive accuracy (Cornelius and Crossa, 1999). Like the analysis of G E in a fixed-effects ANOVA, the standard implementation of these models cannot be used when data from any genotype and environment combination is missing. However, the expectation-maximization algorithm has been used to impute missing data with the AMMI model (Gauch and Zobel, 1990). As an alternative to the above models that consider genotype effects within environments as fixed, these effects can be considered random values and modeled using linear mixed models, which have important inherent benefits over fixed-effects models. Mixed models can easily incorporate non-constant error variance structures, including within-field spatial correlation, in the same model as genotype and environment effects. Additionally, mixed models easily handle missing data and, with some specific models, even unreplicated data. Some models can predict genotype effects in environments they were not tested in. Mixed model analyses have a long history in animal breeding (Henderson, 1973), and recent research has demonstrated new approaches to make them very effective in plant breeding. If a mixed linear model is used, genotype effects are estimated as empirical best linear unbiased predictors (BLUPs) calculated using the estimated variance parameters. A very basic mixed model would assume a random effect of genotypes within environments that has a variance-covariance matrix of σ 2 I, where σ 2 is a constant variance parameter and I is an identity matrix. In most breeding programs, plant or animal, at least a portion of the genotypes assessed in a trial are related and therefore would be expected to show some correlation in their effects. Pedigree information can be incorporated into the model through a Genetic Relationship Matrix (GRM) to take advantage of these relationships and improve predictive accuracy (Henderson, 1973). The GRM is also known as the additive relationship matrix, or numerator relationship matrix and is usually symbolized as A, and A = 2[f ii ], where f ii is the coefficient of parentage 66

81 or coancestry between genotypes i and i (Mrode and Thompson, 2005). When a GRM is used in a linear mixed model, the performance of genotypes can be predicted for environments in which they were not grown. The GRM allows the model to use information from related genotypes to predict the unreplicated genotype, because known covariances are modeled between pairs of related genotypes. Another modification that may improve the predictive accuracy of mixed models is to increase the complexity of the variance-covariance matrix of the random G E effect beyond σ 2 I g (Piepho, 1994). The matrix can be described as the product of two other matrices, such that G ge = G e I g, where I g is an identity matrix with dimensions equal to the number of genotypes, and structures of varying complexity can be used to model G e (Smith et al., 2001). The structures for G e reflect patterns of relationships among environments in terms of similar genotype performance. One option for G e is the factor analytic (FA) structure, which increases in complexity with the number of factors used. When using a FA structure researchers must choose how many factors to include. More factors allow for greater flexibility, but may reduce model parsimony. These structures are more parsimonious than unstructured G e when the numbers of environments is sufficiently high compared to the number of factors. Factor analytic structures can be combined with pedigree information to improve model fit, as measured by information criteria (Crossa et al., 2006; Oakey et al., 2007; Kelly et al., 2009; Beeck et al., 2010). Previous studies have analyzed a limited number of real MET data sets that are limited in the scenarios (number of genotypes, number of environments and relationships among genotypes or envrionments) that have been evaluated. A simulation study could determine if the FA model with a GRM is the most effective model for a much wider range of MET. Additionally, relationships between MET conditions and the best choice of model can be thoroughly investigated, since simulations allow conditions to be changed individually. In this work we simulated MET across the ranges of conditions expected in typical MET, and analyzed these simulations using multiple mixed linear models with different variance-covariance structures for the random effects of genotypes within environments. The objective of this work was to determine which of these models would be most effective in breeding programs by consistently providing the most accurate estimates and how the ideal model changes as a result of different MET conditions. 67

82 Methods Simulations Simulations were conducted to generate data sets that resemble MET across a range of conditions. The simulations included randomly generated effects of genotypes within environments and phenotypes of each observation, resulting from the addition of a random experimental error to the genotype within environment effect. These simulated data sets covered a range of scenarios with varying numbers of environments and genotypes, environmental relationship patterns, field trial designs, and magnitudes of experimental error. Genotype effects within each environment were simulated as random samples from multivariate normal distributions with means of 0 and covariance matrices (Σ GE ) that differed among scenarios. The Σ GE are equivalent to correlation matrices multiplied by a constant scalar variance component of unity. The Σ GE were the Kronecker (or direct) product of a matrix of correlations between environments (Σ E ) and a matrix of correlations between genotypes (Σ G ). The Σ E were generated in four sizes: 5 5, 10 10, 20 20, and 40 40, corresponding to scenarios with 5, 10, 20, or 40 environments, respectively. The Σ E were themselves generated as Kronecker products of two matrices: E Y E L. The Σ E matrices for five and ten environments were simply products of E L matrices of size 5 5 or with a 1 1 E Y matrix of unity, whereas the Σ E of 20 and 40 environments were Kronecker products of 4 4 and 8 8 E Y matrices with a 5 5 E L matrix. This allows the Σ E to better reflect possible complex patterns of relationships among large numbers of environments. For example, the five and ten environment scenarios reflect MET with five or ten locations in a single year, whereas the 20 and 40 environment scenarios reflect MET with five locations evaluated over four or eight years. The Σ E are grouped into three classes of patterns of correlations: compound symmetry A (CS A ), compound symmetry B (CS B ), and Toeplitz (Toep). For the compound symmetric classes both the E Y and E L matrices had constant off-diagonal elements of 0.3 and 0.7 for CS A and 0.4 and 0.4 for CS B. Both 68

83 the E Y and E L matrices for the Toeplitz class of patterns had bands of constant correlation parallel to the diagonal with the greatest correlations next to the diagonal. The exact correlations differed by the sizes of Σ E, but in all cases included negative values for the element farthest from the diagonal. These specific correlation values are by no means the only correlation values that could occur in a MET, but were chosen to be similar to values observed in the Washington State University soft white wheat variety trials (details provided in the appendix). The E L were generated with three levels of variance heterogeneity: homogeneous variances (CS A, etc.), heterogeneous variances (CS A H, etc.), and very heterogeneous variances (CS A VH, etc.). To do so, the E L were pre and post multiplied by a diagonal matrix of standard deviations. For the heterogeneous variances the variances ranged from 0.5 to 1.5 for the least and greatest variances, respectively. The very heterogeneous variances ranged from 0.2 to 2. The three to one ratio is often used as a rule-of-thumb cutoff for considering variances to be heterogeneous, but greater levels of heterogeneity can occur among highly variable environments. Four options were considered for Σ G, corresponding to scenarios of 25, 50, 100, and 150 genotypes. In order to choose Σ G, first a GRM was estimated from the pedigree in a Dow AgroSciences early generation study of North American Stiff Stalk maize inbred lines. The four options were overlapping submatrices of this GRM: the first 25, 50, 100, and 150 rows and columns in the GRM. With four options for sizes of Σ E, three classes of patterns for Σ E, three levels of variance heterogeneity, and four options for Σ G, taking all combinations results in a total of 144 different simulated scenarios for Σ GE. For each Σ GE, genotype effects were generated by randomly sampling from a multivariate normal distribution with variance-covariance matrix equal to Σ GE 1000 times. For each set of genotype within environment effects, simulations were generated for three trial designs (RCBD randomized complete block designs, MAD modified augmented designs, and unreplicated designs) and two experimental error variances. Since spatial field effects were not considered, the only effect of the experimental design was to determine the number of replicates of each genotype in each environment. Therefore, other designs commonly used in MET will also have either 69

84 equal or unequal replication, regardless of blocking structure, and so would not add much beyond the designs tested here. For the RCBD scenarios, every genotype was replicated three times. In the MAD scenarios, genotypes were not replicated except for primary and secondary checks that were replicated five and two times, respectively, for every 23 non-check genotypes. In the unreplicated design, each genotype appeared once in each environment. A fixed effect for each environment was simulated by sampling one value from a gamma (3, 2) distribution and multiplying it by 3. This distribution and constant multiplier were chosen to provide environment effects that are of similar magnitude to the simulated genotype effects and error. Every observation had a unique phenotype equal to the effect of the genotype in an environment plus a simulated environment effect and a random experimental error selected from a normal distribution with mean of 0 and two possible error variances (σ 2 e = 0.5 or 2.0). These error variances corresponded to ratios of about 1.7 and 0.4, respectively, of variance of the genotype within environment effects divided by the variance of the experimental error for a given simulation. The ratios values varied slightly around these averages among scenarios with greater values for scenarios with more environments. Analyses A total of 20 related linear mixed models were compared for their ability to predict the simulated genotype effects within environments based on the simulated phenotypic data. Models were fit using the program ASReml-R, release 3.0 (Butler et al., 2009), which is a package for the R programming language (R Development Core Team, 2010). The models were all of the form: y Xβ Zγ ε, where y is the vector of observed phenotypes, β is a vector of fixed environment effects, X is the associated design matrix, γ is the vector of genotype within environment effects, Z is the associated design matrix, and ε is the vector of experimental error terms. The joint distribution of γ and ε is given by: 70

85 71, where is a constant error variance and G is a covariance structure that varies for each of the 20 models and is separable: G = G E B, where both G E and B were varied, resulting in 20 models: B = I or A, where I is an identity matrix and A is a GRM. This study evaluated the ideal situation, when the GRM used perfectly reflects the actual relationships among the genotypes; therefore, A was set equal to the Σ G used to simulate of the data set being analyzed. Ten structures were used to model G E and these are shown below for a five environment example. The simplest structure was independence (no covariance) and identical variances (ID):. A generalization of this is the diagonal structure, where environments are still independent, but each can have different variance (Diag):. A constant covariance can be added, yielding compound symmetric structures with uniform (CorV) or heterogeneous variances (CorH): I 0 0 G 0 0 ε γ 2, ~ MVN R 2 R G E G E

86 72. Six FA structures were compared. Structures were fit with one to three factors and uniform or heterogeneous specific variances (FA1U, FA2U, FA3U, FA1H, FA2H, FA3H): G E = ΛΛ + Ψ, where Λ is a matrix whose columns are the factors and Ψ is a diagonal matrix of specific variances: In an effort to improve convergence rates, models were fit sequentially in the order of the G E structures described above. Parameter estimates from simpler models were used as the starting values of the next more complex structure for which the simple structure was a specific case. If a model did not converge, the next more complex structure was not attempted. The percentage of simulations of a scenario for which a model converged was defined as the convergence rate. In addition to the mixed linear models, estimates of genotype effects within environments were derived from cell means (the mean of the replicates of a genotype in each environment) and Additive Main effects Multiplicative Interaction (AMMI) models. The AMMI models were fixed effects linear models with main effects for genotype and environment. The effects of genotype by environment interaction were replaced with an approximation of the matrix using a reduced set of the principle G E G E or,, Λ or I Ψ

87 components (Gauch, 1988). Only RCBD scenarios were analyzed using the AMMI models, since main effects of genotype and environment cannot be separated from the interaction term if genotypes are not replicated in an environment. The AMMI models were fit with all possible numbers of principle components. The most accurate of the AMMI models for each simulation, as judged by the root mean square error of prediction (described below), were compared to the mixed linear models. However, this was often not the most accurate AMMI model as measured with the correlations between estimates from the model and the simulated data. To evaluate each model s predictive accuracy, Pearson and Spearman correlations between the estimated effects of genotypes within environments and the simulated effects were calculated for each simulation. Additionally, root mean squared errors of prediction (RMSEP) were calculated as: RMSEP 1 p p i γ~i γ i 2, where p is the number of genotype by environment combinations, γ i is the i th effect of a genotype within an environment, and γ~i is the i th predictor. Both the Akaike and Bayesian information criteria (AIC and BIC, respectively) were also calculated for each of the mixed models. The models were ranked based on each of these accuracy measures and the information criteria within each simulation. Models that were not fit or did not fit were all given the same rank value just inferior to the least accurate. The accuracy measurements for each model varied greatly among simulations of the same scenario, whereas the rankings varied to a lesser degree. In order to summarize the results for each scenario, means were taken, over simulations, of the accuracy measurements and the rankings of the models. For each scenario, a different number of simulations were analyzed with every model to ensure that sufficient simulations were analyzed to produce reliable means. Simulations were analyzed sequentially, and following the analysis of each simulation, rankings of RMSEP, Pearson correlations, and Spearman correlations were averaged over the analyzed simulations. When these three cumulative mean rankings did not change by more than 0.05 one simulation to the next or over 10 simulations, no 73

88 additional simulations were analyzed for that scenario. The number of simulations necessary to achieve this stability varied greatly among scenarios. Due to time constraints, the scenarios with the largest data sets were not analyzed and do not appear in the results. Results and Discussion Justification of Approach Since the simulations within each scenario were all random realizations of MET described by the scenario, no particular simulation was more valid than the others and a mean over all the simulations is a valid summary of model performance for a given scenario. A model can be considered to be the best for a given scenario if it has the greatest accuracy averaged over all simulations. However, using mean accuracy excessively weights performance in simulations that result in extremely low or high accuracy for a given model. Alternatively, the best model might be defined as one that has the greatest accuracy in the most simulations. To evaluate this, model accuracy can be ranked within each simulation, and the mean of the ranks for each model calculated. High accuracy values for a given simulation have no additional influence if the model is already top ranked, but an occasional low ranking can still pull the mean rank down. This approach rewards models that perform well consistently rather than those with inconsistent extraordinary performance. This was the approach that we took to summarize our results. Models were ranked with '1' being the best rank and a greater number indicating worse performance. The mean values and ranks of all accuracy measurements for each model and scenario are given in Supplemental Table 1. Results from RMSEP, Pearson correlations, and Spearman correlations differ, but conclusions as to the best model were generally consistent after averaging over simulations. Results from the information criteria were highly variable among simulations. Additionally, with our simulation approach, the true simulated effects of genotypes within environments were available, allowing for direct comparison of the simulated and estimated values. Therefore, the information criteria were of limited value. The Spearman correlations may be more applicable than the Pearson from the perspective of 74

89 breeding, since genotypes are often selected based on their rankings, rather than observed values. However, ranking of genotypes is not the only use of estimates of genotype responses in different environments. Since these estimates are also used to evaluate traits like stability, the deviations of estimates from the true values may be more important than how well the ranks agree. The accuracy measured by RMSEP reflects how much estimates deviated from the simulated values, penalizing more extreme values to a greater extent. This was desirable, as extreme errors in estimation are more likely to cause rank changes among genotypes, leading to changes in selection decisions. In the interest of brevity, we limit our discussion to RMSEP accuracy. The important conclusions from these data are summarized below along with illustrative graphs of the data. Choice of a Default Model The plot of the mean ranks, in terms of RMSEP, for each model in each scenario showed that the model with the genetic relationship matrix and a constant variance-constant correlation structure (GRM_CorV) was the best model in many, but not all, situations (Figure 1). When the mean rankings of the models were graphed, there was a pattern of nine troughs that correspond to the scenarios with the fewest environments (Figure 1). In these scenarios the mean ranking of the best model was a greater number. This indicates that there was less consistency in the top model rankings among simulations. That is, the best model was not ranked first in all simulations of a scenario, or even necessarily always in the top three. However, no other model did as well overall. Because of this pattern, in order to better visualize the best model for each scenario, we graphed the results by ranking the means of the ranks for each scenario. This standardized the graph in Figure 1, so that the best model was in the top row in each scenario (Figure 2). The top row was predominantly GRM_CorV or GRM_CorH. To simplify even more, we graphed results from just the GRM_CorV and GRM_CorH models (Figure 3). The points in the top row indicate which of GRM_CorV or GRM_CorH was the best model for a scenario, whereas blanks in the top row indicate that a model other than these two was the best for a scenario. Generally, the GRM_CorV model appears to be a good default choice of model, i.e. if one decided which model to use 75

90 without additional specific information about a MET, since it was often the best ranked model and was always in the top 10 models. Models for Specific Scenarios When all scenarios were examined together, it was difficult to determine if there were any patterns as to when GRM_CorV, GRM_CorH, or another model was the best, but subsets of scenarios allowed such evaluations. The GRM_CorV model was the most accurate model in most scenarios with high error variance, whereas the GRM_CorH model was superior in few scenarios and the other models were best only rarely (Figure 4). The GRM_CorV was preferred in fewer of the scenarios with low error variance (Figure 5). In these scenarios the GRM_CorH and other models were the best in more cases. These results indicate that as plot-to-plot error in a MET increases, the effectiveness of complex models for G E decreases. The increased noise resulted in inaccurate estimation of the many parameters in the complex models. In contrast, the CorV and CorH structures more accurately estimated fewer parameters, compensating for their oversimplification of the pattern of relationships among environments. One would expect that the G E structure in the most accurate model would be of similar complexity to the pattern used to simulate relationships among environments. While such relationships occurred, they were not entirely predictive. The GRM_CorV model was the best choice in almost all scenarios simulated with a compound symmetric pattern for G E (Figure 6). This was to be expected, since the CorV structure exactly matched the simulated pattern for five or 10 environments, and with 20 or 40 environments the simulated pattern only had two possible values that did not dramatically differ from each other. In the scenarios simulated with compound symmetric correlation patterns and heteroskedasticity, one might expect the CorH structure to be the best choice. However, it was only superior to the CorV structure in some scenarios (Figure 7). An explanation for this is that the CorH structure traded parsimony for flexibility, and in doing so, increased the risk that it modeled noise rather than capturing actual differences in variances. Since the variance heterogeneity was not especially large in these scenarios, there were many occasions where the ability to model this heterogeneity was not valuable. With compound symmetric scenarios that include very heterogeneous environmental variances, 76

91 the GRM_CorH was generally the best, losing out to other models only in a small number of cases (Figure 8). With these scenarios it was more important to model the greater degree of heterogeneity. When scenarios were simulated with Toeplitz patterns the GRM_CorV and GRM_CorH models were still generally the best, with GRM_CorH more often superior as heteroskedasticity increased (Figure 9). Models other than GRM_CorV or GRM_CorH were superior in scenarios with Toeplitz patterns, 100 or more genotypes, less than 40 environments, and low error variance (Figure 10). In these scenarios, one of the models with a GRM and FA structures was almost always superior to the GRM_CorV or GRM_CorH. This was also true with either 50 genotypes or a high error variance (not both) if a RCBD was used (not shown). Unfortunately, prediction of error variance and patterns of relationships among environments is difficult, especially due to dramatic year-to-year variability. The number of genotypes, the number of environments in which they were tested, and the replication as determined by the experimental design had limited effect on the choice of the best model. When only 25 genotypes are included, the GRM_CorV model was preferred over the GRM_CorH and other models in most scenarios (Figure 11). Complex relationships in genotype performance among environments were not influential with so few genotypes. Therefore, it was better to just assume constant correlations of genotype performance among environments. The GRM_CorH and more complex models were more frequently ranked better than GRM_CorV as the number of genotypes increased. The reverse pattern was true for numbers of environments. As the number of environments was increased, the effectiveness of the more complex models decreased. In the MET simulations, the overall range of correlations between pairs of environments was about the same (0.7 to -0.1 for 5 and 10 environments and 0.75 to 0.2 for 20 and 40 environments) for all numbers of environments. As the number of environments was increased, the differences among the correlation parameters decreased resulting in many correlation parameters with similar values, and reducing the benefits of having large numbers of parameters in the model. Such a constraint on the correlations is also likely in reality, unless the large number of environments cover a wider geographic or temporal range that would extend the range of possible correlations beyond those tested here. 77

92 Experimental design, and by extension level of replication had a limited effect on choice of model. FA structures were more accurate with the increased replication of the RCBD designs than in scenarios than with less replication (Figure 12). This is again a case where more data was available to improve the accuracy with which greater numbers of parameters are estimated. It is also informative to look at only the models that did not incorporate a GRM, since in some cases researchers may not have the ability to estimate a GRM. In this study we assumed the ideal case, where the GRMs used in the analyses exactly matched those used in the simulations, but in reality, GRMs are estimated with error. The CorV and CorH models were often the best options for non-grm models, with the CorH model often preferable in scenarios with variance heterogeneity (Figure 13). The FA models were often more accurate for the Toeplitz pattern scenarios, especially for scenarios with low error variance. Therefore, it appeared that the preferred G E structures were similar for the non-grm and GRM models for the same scenarios. Discussion Although we endeavored to simulate a range of scenarios that cover most MET, the scenarios may not all be equally likely. Most MET do not test large numbers of genotypes in many environments, except over multiple years. Even when data from multiple years and locations is analyzed, rarely are the same genotypes tested over all environments. Therefore, our simulations with both 150 genotypes and 40 environments are relatively unimportant. On the other hand, small numbers of genotypes are often tested over many locations and years, and large numbers of genotypes are often tested in few environments. The actual incidence of the various patterns for Σ E is difficult to determine, since a thorough analysis, preferably with cross validation, is necessary to even estimate the manner in which test environments are related in a given MET. The simulated patterns for Σ E are similar to what we would expect for most MET. Some MET include environments that are all fairly similar in terms of genotype performance. For example, elite cultivars are usually ranked similarly in multiple locations in the same region in a single year. Such MET 78

93 would have an underlying pattern of environmental relationships similar to a compound symmetric pattern. MET that include environments that are very dissimilar to the others tested, may be more common. This might occur when a MET includes one year when the weather was different than normal. The Toeplitz patterns for Σ E provide a wide range of correlations, including negative values. One might argue that our simulations should have included scenarios with independence among environments; however, when modeling genotype effects within environments (without fixed genotype effects), independence would only occur if genotype performance in one environment had no relationship to performance in another. This independence would only occur in reality if the environments tested in had completely different climates and/or the genotypes tested only differed in terms of specific response to those environments. Other researchers have previously compared the same mixed models as in this study, usually fit to real data sets. Their results generally match ours for the most closely matching simulations. Piepho (1998) observed that, for five MET, empirical BLUPs based on factor analytic structures were more accurate than least squares estimates of cell means and usually better than predictions from unstructured covariance matrices. We also saw greater accuracy from factor analytic models than the cell means. Crossa et al. (2006) evaluated a wide range of mixed linear models on a single wheat MET with 29 genotypes, 16 locations throughout the world, and a RCBD with three replicates. These researchers concluded that a nine factor, heterogeneous specific variance, factor analytic model with a GRM derived from a pedigree, best fit the data as measured by various information criteria. The factor analytic model was superior to a range of models including those with constant environmental correlation. The trial they analyzed is most comparable to our scenarios with 25 genotypes, 10 or 20 environments, and RCBD designs. For these scenarios, we did not find that FA structure models with GRM were superior to GRM_CorV or GRM_CorH models as measured by average RMSEP ranking. However, rankings of the models based on information criteria scores were highly variable among simulations, suggesting that our differences in conclusions may be due to the chance conditions specific to the single trial investigated by Crossa et al. 79

94 Other researchers have also found that FA models resulted in better AIC scores than simpler or more complex structures when fit to real data sets (Kelly et al., 2007; Oakey et al., 2007; Beeck et al., 2010). Kelly et al. (2007) also used RMSEP values from the analysis of simulated MET to show that FA or unstructured environmental covariance models were superior to constant correlation models, but did not evaluate GRM. These authors all evaluated real or simulated data sets which included relatively large numbers of genotypes, and their results agreed with ours for simulations with 100 or more genotypes with Toeplitz patterns for Σ E. It is likely that the sets of environments these researchers evaluated had complex relationship patterns similar to our Toeplitz patterns and the simulations of Kelly et al. had heterogeneous environmental correlations, making them similar to our Toeplitz patterns. So and Edwards (2011) evaluated 51 maize MET, each with 6 environments (two years) and 187 to 386 genotypes only partially overlapping between years. They fit these 51 with mixed models that did not include GRM and included various environmental covariance matrices for five of the six environments, and performance was predicted in the sixth environment for purposes of cross validation. These authors found that models that allowed for heterogeneous genetic covariances were often inferior to compound symmetric models, in agreement with our results from simulations. Our study and those of other researchers have generally focused on model fit or predictive accuracy of empirical BLUPs, but MET are also analyzed to estimate relationships among environments. Identification of highly correlated environments allows researchers to save resources by avoiding redundant locations. Alternatively, information about highly unrelated environments can aid interpretation of results from these esoteric locations or years. Although our analyses indicated that simpler models often provided more accurate estimates, a simple structure for G E does not allow the estimation of different parameters (variances or covariances) for different environments. In situations where it is important to estimate these differences, factor analytic structures are preferable even if some accuracy is sacrificed. Conclusions 80

95 The most accurate models for analyzing MET always included a GRM, but differed among simulated scenarios in terms of the ideal G E structure for estimating relationships in genotype performance among environments. Complex structures that allow for heterogeneity of environmental variances or correlations were only successful when the pattern used to simulate the data also had heterogeneous variances or correlations. Even when this was true, if error variance was high or the MET had few genotypes, simpler models were often more accurate. These results suggest that while GRM should be used when available, complex structures for environmental relationships, such as FA, should only be used when evaluating 100 or more genotypes and a complex underlying structure is expected along with low error variance. Appendix: Real Data as a Basis for Simulations A simulation study is only effective if the simulations reflect the actual conditions that are being simulated. Obviously, the actual biology of MET is more complex than our simulations, but a properly tuned simulation should be able to approximate actual MET data. In order to create simulations that effectively reflect actual MET in breeding programs, the parameters used in our simulations were chosen based on the analysis of real MET. Yield data from Washington State University soft white wheat variety improvement trials were used as an example MET. The variety trials were grown in RCBD with 4 blocks at each location between 2002 and We limited our analyses to these years to allow the use of a single error variance structure for every year. Although 116 genotypes were evaluated over this time period, the genotypes tested varied among years as only 48 to 54 genotypes were tested each year. Trials were conducted at 21 different locations throughout Washington, but trials were only grown in 18 to 20 locations each year. Rainfall patterns vary greatly across the state, and the testing locations cover the range of precipitation zones where wheat is grown in Washington (Appendix Figure 1). Such a large data set prevented the use of a complex model to fit the entire data set. Instead, various overlapping subsets of environments were evaluated, for example: all locations in a single year; four years 81

96 of data from five locations covering the range of precipitation zones. This allowed us to fit complex models and additionally allowed this MET to approximate a range of smaller MET. We fit a range of linear mixed models to yield data from these trials, and the models were all of the same basic form as used to analyze the simulated data: y Xβ Zγ ε, where y is the vector of observed phenotypes, β is a vector of fixed environment effects, X is the associated design matrix, γ is the vector of genotype within environment effects, Z is the associated design matrix, and ε is the vector of experimental error terms. The joint distribution of γ and ε is given by: γ 0 G ~ MVN, ε 0 0 0, R where R is either a matrix with no covariance and constant variance or variances that are heterogeneous across environments and G is a covariance structure that varies and is separable: G = G E I, where I is of size equal to the number of genotypes, which are assumed independent, and G E is one of seven structures (examples of the first six are provided in the Methods section): independence and identical variance, independence and heteroskedasticity, constant correlation and constant or heterogeneous variances, a one factor FA model with constant or heterogeneous specific variances, and an unstructured model with heterogeneous variances and covariances. Parameters of G E and R were estimated for each structure and data set, and parameter estimates from all data sets were used as a baseline for determining ranges of parameters used in simulations. Values of 2 g 2 e and were estimated for constant genetic variance structures and the ratios of g e ranged from 0.4 to.98 depending on which subset of the data was analyzed. These estimated values were used to choose error variances of 0.5 and 2.0 to use in simulations to achieve similar values of g e. Multiple genetic variances were estimated when structures allowed heteroskedasticity. Ratios between the least

97 and greatest genetic variance ranged from 3:1 to 56:1. Genetic correlations were estimated from models with constant correlations and ranged from 0.21 to When estimated from models with unstructured covariance ratios, the maximum correlation estimated between a pair of environments ranged from 0.71 to 0.89 depending on data subset, whereas the minimums ranged from to It is important to note that the most extreme values for these parameters were estimated for the data subsets that included multiple years and a small number of locations that covered a range of precipitation levels. Generally, cultivars are rarely bred for such a range of climates and so most MET do not cover such wide ranges of conditions. Therefore, our simulations were conducted with less extreme values for correlations and variance heterogeneity ratios. References Beeck, C.P., W.A. Cowling, A.B. Smith, and B.R. Cullis Analysis of yield and oil from a series of canola breeding trials. Part I. Fitting factor analytic mixed models with pedigree information. Genome. 53: Butler, D.G., B.R. Cullis, A.R. Gilmour, and B.J. Gogel Mixed models for S language environments ASReml-R reference manual. Cornelius, P.L., and J. Crossa Prediction assessment of shrinkage estimators of multiplicative models for multi-environment cultivar trials. Crop Science. 39(4): Crossa, J., J. Burgueño, P.L. Cornelius, G. McLaren, R. Trethowan, and A. Krishnamachari Modeling genotype environment interaction using additive genetic covariances of relatives for predicting breeding values of wheat genotypes. Crop Science. 46(4): Gauch, H.G Model selection and validation for yield trials with interaction. Biometrics. 44(3): Gauch, H.G., and R.W. Zobel Predictive and postdictive success of statistical analyses of yield trials. Theoret. Appl. Genetics. 76(1): Gauch, H.G., and R.W. Zobel Imputing missing yield trial data. Theoret. Appl. Genetics. 79(6): Henderson, C.R Sire evaluation and genetic trends. J. Anim Sci. 1973(Symposium):

98 Kelly, A., B.R. Cullis, A.R. Gilmour, J.A. Eccleston, and R. Thompson Estimation in a multiplicative mixed model involving a genetic relationship matrix. Genetics Selection Evolution. 41: Kelly, A.M., A.B. Smith, J.A. Eccleston, and B.R. Cullis The Accuracy of Varietal Selection Using Factor Analytic Models for Multi-Environment Plant Breeding Trials. Crop Science. 47(3): Mrode, R.A., and R. Thompson Linear models for the prediction of animal breeding values. 2nd ed. CABI, Cambridge, MA. Oakey, H., A.P. Verbyla, B.R. Cullis, X. Wei, and W.S. Pitchford Joint modeling of additive and non-additive (genetic line) effects in multi-environment trials. Theoretical and Applied Genetics. 114: Piepho, H.-P Best Linear Unbiased Prediction (BLUP) for regional yield trials: a comparison to additive main effects and multiplicative interaction (AMMI) analysis. Theoret. Appl. Genetics. 89(5). Piepho, H.-P Empirical best linear unbiased prediction in cultivar trials using factoranalytic variance-covariance structures. TAG Theoretical and Applied Genetics. 97(1-2): R Development Core Team R: A language and environment for statistical computing. Available at Smith, A., B. Cullis, and R. Thompson Analyzing variety by environment data using multiplicative mixed models and adjustments for spatial field trend. Biometrics. 57(4): So, Y.-S., and J. Edwards Predictive Ability Assessment of Linear Mixed Models in Multienvironment Trials in Corn. Crop Science. 51(2):

99 Figure 1. Means, over simulations, of model ranks, where models were ranked in terms of RMSEP within each simulation. All scenarios evaluated are included, and index denotes each scenario s position in the order. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 76, 154, 230, 304, 380, 456, 532, 608, and 682, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

100 Figure 2. A standardized version of Figure 1, where models have been ranked within each scenario in terms of their mean ranks. The order of scenarios is the same as Figure 1. 86

101 Figure 3. The same as Figure 2, but only the models GRM_CorV and GRM_CorH. The order of scenarios is the same. 87

102 Figure 4. Equivalent to Figure 3, with only scenarios with high (2.0) error variance included. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 39, 78, 116, 154, 192, 230, 268, 306, and 343, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. 88

103 Figure 5. Equivalent to Figure 3, with only scenarios with low (0.5) error variance included. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 37, 76, 114, 150, 188, 226, 264, 302, and 339, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. 89

104 Figure 6. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of relationships among environments. Scenarios are ordered CS A, then CS B, with the indices of the final scenarios of each group equal to 76 and 150, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

105 Figure 7. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of correlations among environments and heterogeneous variances of genotype effects within environments. Scenarios are ordered CS A H, then CS B H, with the indices of the final scenarios of each group equal to 78 and 154, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

106 Figure 8. Equivalent to Figure 3, only including scenarios simulated with a compound symmetric pattern of correlations among environments and extremely heterogeneous variances of genotype effects within environments. Scenarios are ordered CS A VH, then CS B VH, with the indices of the final scenarios of each group equal to 76 and 152, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

107 Figure 9. Equivalent to Figure 3, only including scenarios simulated with a Toeplitz pattern of correlations among environments. Scenarios are ordered Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 76, 152, and 226, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

108 Figure 10. Equivalent to Figure 3, only including scenarios simulated with a Toeplitz pattern of correlations among environments, 100 or 150 genotypes, 5 to 20 environments, and low (0.5) error variance. Scenarios are ordered Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 14, 29, and 43, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, and then 20 environments. Within each number of environments, the numbers of genotypes are ordered 100 and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. 94

109 Figure 11. Equivalent to Figure 3, only including scenarios simulated with 25 genotypes. Scenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 24, 48, 72, 96, 120, 144, 168, 192, and 216, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the experimental designs are ordered RCBD, MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

110 Figure 12. Equivalent to Figure 3, only including scenarios simulated with MAD or unreplicated designs. cenarios are ordered CS A, CS A H, CS A VH, CS B, CS B H, CS B VH, Toep, ToepH, and then ToepVH, with the indices of the final scenarios of each group equal to 50, 102, 154, 203, 255, 307, 357, 409, and 461, respectively. Within each of these patterns, numbers of environments are ordered 5, 10, 20, and then 40 environments. Within each number of environments, the numbers of genotypes are ordered 25, 50, 100, and then 150 genotypes. Within each number of genotypes, the experimental designs are ordered MAD, and then unreplicated designs. Within each design, error variances are ordered 0.5 then

111 Figure 13. A standardized version of Figure 1, where only models not including GRM have been ranked within each scenario in terms of their mean ranks. The order of scenarios is the same. 97

Conifer Translational Genomics Network Coordinated Agricultural Project

Conifer Translational Genomics Network Coordinated Agricultural Project Conifer Translational Genomics Network Coordinated Agricultural Project Genomics in Tree Breeding and Forest Ecosystem Management ----- Module 4 Quantitative Genetics Nicholas Wheeler & David Harry Oregon

More information

Why do we need statistics to study genetics and evolution?

Why do we need statistics to study genetics and evolution? Why do we need statistics to study genetics and evolution? 1. Mapping traits to the genome [Linkage maps (incl. QTLs), LOD] 2. Quantifying genetic basis of complex traits [Concordance, heritability] 3.

More information

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping - from Darwin's time onward, it has been widely recognized that natural populations harbor a considerably degree of genetic

More information

QTL Mapping Using Multiple Markers Simultaneously

QTL Mapping Using Multiple Markers Simultaneously SCI-PUBLICATIONS Author Manuscript American Journal of Agricultural and Biological Science (3): 195-01, 007 ISSN 1557-4989 007 Science Publications QTL Mapping Using Multiple Markers Simultaneously D.

More information

Introduction to Quantitative Genetics

Introduction to Quantitative Genetics Introduction to Quantitative Genetics Fourth Edition D. S. Falconer Trudy F. C. Mackay PREFACE TO THE THIRD EDITION PREFACE TO THE FOURTH EDITION ACKNOWLEDGEMENTS INTRODUCTION ix x xi xiii f GENETIC CONSTITUTION

More information

Genetics of dairy production

Genetics of dairy production Genetics of dairy production E-learning course from ESA Charlotte DEZETTER ZBO101R11550 Table of contents I - Genetics of dairy production 3 1. Learning objectives... 3 2. Review of Mendelian genetics...

More information

A Primer of Ecological Genetics

A Primer of Ecological Genetics A Primer of Ecological Genetics Jeffrey K. Conner Michigan State University Daniel L. Hartl Harvard University Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Contents Preface xi Acronyms,

More information

Strategy for Applying Genome-Wide Selection in Dairy Cattle

Strategy for Applying Genome-Wide Selection in Dairy Cattle Strategy for Applying Genome-Wide Selection in Dairy Cattle L. R. Schaeffer Centre for Genetic Improvement of Livestock Department of Animal & Poultry Science University of Guelph, Guelph, ON, Canada N1G

More information

Introduction to Quantitative Genomics / Genetics

Introduction to Quantitative Genomics / Genetics Introduction to Quantitative Genomics / Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics September 10, 2008 Jason G. Mezey Outline History and Intuition. Statistical Framework. Current

More information

Quantitative Genetics

Quantitative Genetics Quantitative Genetics Polygenic traits Quantitative Genetics 1. Controlled by several to many genes 2. Continuous variation more variation not as easily characterized into classes; individuals fall into

More information

QTL Mapping, MAS, and Genomic Selection

QTL Mapping, MAS, and Genomic Selection QTL Mapping, MAS, and Genomic Selection Dr. Ben Hayes Department of Primary Industries Victoria, Australia A short-course organized by Animal Breeding & Genetics Department of Animal Science Iowa State

More information

EFFICIENT DESIGNS FOR FINE-MAPPING OF QUANTITATIVE TRAIT LOCI USING LINKAGE DISEQUILIBRIUM AND LINKAGE

EFFICIENT DESIGNS FOR FINE-MAPPING OF QUANTITATIVE TRAIT LOCI USING LINKAGE DISEQUILIBRIUM AND LINKAGE EFFICIENT DESIGNS FOR FINE-MAPPING OF QUANTITATIVE TRAIT LOCI USING LINKAGE DISEQUILIBRIUM AND LINKAGE S.H. Lee and J.H.J. van der Werf Department of Animal Science, University of New England, Armidale,

More information

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus. NAME EXAM# 1 1. (15 points) Next to each unnumbered item in the left column place the number from the right column/bottom that best corresponds: 10 additive genetic variance 1) a hermaphroditic adult develops

More information

Use of marker information in PIGBLUP v5.20

Use of marker information in PIGBLUP v5.20 Use of marker information in PIGBLUP v5.20 Ron Crump and Bruce Tier Animal Genetics and Breeding Unit, a joint venture of NSW Department of Primary Industries and The University of New England. Introduction

More information

High-density SNP Genotyping Analysis of Broiler Breeding Lines

High-density SNP Genotyping Analysis of Broiler Breeding Lines Animal Industry Report AS 653 ASL R2219 2007 High-density SNP Genotyping Analysis of Broiler Breeding Lines Abebe T. Hassen Jack C.M. Dekkers Susan J. Lamont Rohan L. Fernando Santiago Avendano Aviagen

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Embedded partially replicated designs for grain quality testing

Embedded partially replicated designs for grain quality testing University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 009 Embedded partially replicated designs for grain quality testing Alison

More information

Statistical Methods for Quantitative Trait Loci (QTL) Mapping

Statistical Methods for Quantitative Trait Loci (QTL) Mapping Statistical Methods for Quantitative Trait Loci (QTL) Mapping Lectures 4 Oct 10, 011 CSE 57 Computational Biology, Fall 011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 1:00-1:0 Johnson

More information

Quantitative Genetics, Genetical Genomics, and Plant Improvement

Quantitative Genetics, Genetical Genomics, and Plant Improvement Quantitative Genetics, Genetical Genomics, and Plant Improvement Bruce Walsh. jbwalsh@u.arizona.edu. University of Arizona. Notes from a short course taught June 2008 at the Summer Institute in Plant Sciences

More information

http://genemapping.org/ Epistasis in Association Studies David Evans Law of Independent Assortment Biological Epistasis Bateson (99) a masking effect whereby a variant or allele at one locus prevents

More information

Lab 1: A review of linear models

Lab 1: A review of linear models Lab 1: A review of linear models The purpose of this lab is to help you review basic statistical methods in linear models and understanding the implementation of these methods in R. In general, we need

More information

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits Gained an understanding of the relationship between a TRAIT, GENETICS (single locus and multilocus) and ENVIRONMENT Theoretical

More information

Chapter 25 Population Genetics

Chapter 25 Population Genetics Chapter 25 Population Genetics Population Genetics -- the discipline within evolutionary biology that studies changes in allele frequencies. Population -- a group of individuals from the same species that

More information

Final report - SRDC project BS119S. Best linear unbiased prediction as a method for predicting cross potential

Final report - SRDC project BS119S. Best linear unbiased prediction as a method for predicting cross potential Sugar Research Australia Ltd. elibrary Completed projects final reports http://elibrary.sugarresearch.com.au/ Varieties, Plant Breeding and Release 1999 Final report - SRDC project BS119S. Best linear

More information

General aspects of genome-wide association studies

General aspects of genome-wide association studies General aspects of genome-wide association studies Abstract number 20201 Session 04 Correctly reporting statistical genetics results in the genomic era Pekka Uimari University of Helsinki Dept. of Agricultural

More information

Genomic Selection with Linear Models and Rank Aggregation

Genomic Selection with Linear Models and Rank Aggregation Genomic Selection with Linear Models and Rank Aggregation m.scutari@ucl.ac.uk Genetics Institute March 5th, 2012 Genomic Selection Genomic Selection Genomic Selection: an Overview Genomic selection (GS)

More information

Introduction to Business Research 3

Introduction to Business Research 3 Synopsis Introduction to Business Research 3 1. Orientation By the time the candidate has completed this module, he or she should understand: what has to be submitted for the viva voce examination; what

More information

HCS806 Summer 2010 Methods in Plant Biology: Breeding with Molecular Markers

HCS806 Summer 2010 Methods in Plant Biology: Breeding with Molecular Markers HCS Summer Methods in Plant Biology: Breeding with Molecular Markers Lecture 1. This course, breeding with molecular markers, will examine the role of marker assisted selection or genome assisted selection

More information

An Introduction to Population Genetics

An Introduction to Population Genetics An Introduction to Population Genetics THEORY AND APPLICATIONS f 2 A (1 ) E 1 D [ ] = + 2M ES [ ] fa fa = 1 sf a Rasmus Nielsen Montgomery Slatkin Sinauer Associates, Inc. Publishers Sunderland, Massachusetts

More information

Analysis of genotype x environment interaction for yield in some maize hybrids

Analysis of genotype x environment interaction for yield in some maize hybrids Volume 17(2), 192-196, 2013 JOURNAL of Horticulture, Forestry and Biotechnology www.journal-hfb.usab-tm.ro Analysis of genotype x environment interaction for yield in some maize hybrids Grada F *1., Ciulca

More information

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 3,800 116,000 10M Open access books available International authors and editors Downloads Our authors

More information

Introduction to quantitative genetics

Introduction to quantitative genetics 8 Introduction to quantitative genetics Purpose and expected outcomes Most of the traits that plant breeders are interested in are quantitatively inherited. It is important to understand the genetics that

More information

Marker-Assisted Selection for Quantitative Traits

Marker-Assisted Selection for Quantitative Traits Marker-Assisted Selection for Quantitative Traits Readings: Bernardo, R. 2001. What if we knew all the genes for a quantitative trait in hybrid crops? Crop Sci. 41:1-4. Eathington, S.R., J.W. Dudley, and

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

Genetics Effective Use of New and Existing Methods

Genetics Effective Use of New and Existing Methods Genetics Effective Use of New and Existing Methods Making Genetic Improvement Phenotype = Genetics + Environment = + To make genetic improvement, we want to know the Genetic value or Breeding value for

More information

Strategy for applying genome-wide selection in dairy cattle

Strategy for applying genome-wide selection in dairy cattle J. Anim. Breed. Genet. ISSN 0931-2668 ORIGINAL ARTICLE Strategy for applying genome-wide selection in dairy cattle L.R. Schaeffer Department of Animal and Poultry Science, Centre for Genetic Improvement

More information

Evolutionary Mechanisms

Evolutionary Mechanisms Evolutionary Mechanisms Tidbits One misconception is that organisms evolve, in the Darwinian sense, during their lifetimes Natural selection acts on individuals, but only populations evolve Genetic variations

More information

Identifying Genes Underlying QTLs

Identifying Genes Underlying QTLs Identifying Genes Underlying QTLs Reading: Frary, A. et al. 2000. fw2.2: A quantitative trait locus key to the evolution of tomato fruit size. Science 289:85-87. Paran, I. and D. Zamir. 2003. Quantitative

More information

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA. QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA www.biostat.jhsph.edu/ kbroman Outline Experiments, data, and goals Models ANOVA at marker

More information

Modern Genetic Evaluation Procedures Why BLUP?

Modern Genetic Evaluation Procedures Why BLUP? Modern Genetic Evaluation Procedures Why BLUP? Hans-Ulrich Graser 1 Introduction The developments of modem genetic evaluation procedures have been mainly driven by scientists working with the dairy populations

More information

AP BIOLOGY Population Genetics and Evolution Lab

AP BIOLOGY Population Genetics and Evolution Lab AP BIOLOGY Population Genetics and Evolution Lab In 1908 G.H. Hardy and W. Weinberg independently suggested a scheme whereby evolution could be viewed as changes in the frequency of alleles in a population

More information

Lab 2: Mathematical Modeling: Hardy-Weinberg 1. Overview. In this lab you will:

Lab 2: Mathematical Modeling: Hardy-Weinberg 1. Overview. In this lab you will: AP Biology Name Lab 2: Mathematical Modeling: Hardy-Weinberg 1 Overview In this lab you will: 1. learn about the Hardy-Weinberg law of genetic equilibrium, and 2. study the relationship between evolution

More information

LONG-TERM TREE BREEDING 1/ Hyun Kang-

LONG-TERM TREE BREEDING 1/ Hyun Kang- LONG-TERM TREE BREEDING 1/ Hyun Kang- Abstract.--Analysis of variance techniques are not useful for developing long-term tree breeding strategies. Therefore, tree breeders must use the information obtained

More information

Questions we are addressing. Hardy-Weinberg Theorem

Questions we are addressing. Hardy-Weinberg Theorem Factors causing genotype frequency changes or evolutionary principles Selection = variation in fitness; heritable Mutation = change in DNA of genes Migration = movement of genes across populations Vectors

More information

The Evolution of Populations

The Evolution of Populations Chapter 23 The Evolution of Populations PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Economic Weighting of Traits

Economic Weighting of Traits Economic Weighting of Traits 1 Introduction The breeding objective in any livestock species is to improve the overall economic merit of the animals. Many traits contribute to the Total Economic Value of

More information

Bean Bunny Evolution Modeling Gene Frequency Change (Evolution) in a Population by Natural Selection

Bean Bunny Evolution Modeling Gene Frequency Change (Evolution) in a Population by Natural Selection Modeling Gene Frequency Change (Evolution) in a Population by Natural Selection In this activity, you will examine natural selection in a small population of wild rabbits. Evolution, on a genetic level,

More information

QTL Mapping, MAS, and Genomic Selection

QTL Mapping, MAS, and Genomic Selection QTL Mapping, MAS, and Genomic Selection Dr. Ben Hayes Department of Primary Industries Victoria, Australia A short-course organized by Animal Breeding & Genetics Department of Animal Science Iowa State

More information

A simple and rapid method for calculating identity-by-descent matrices using multiple markers

A simple and rapid method for calculating identity-by-descent matrices using multiple markers Genet. Sel. Evol. 33 (21) 453 471 453 INRA, EDP Sciences, 21 Original article A simple and rapid method for calculating identity-by-descent matrices using multiple markers Ricardo PONG-WONG, Andrew Winston

More information

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human?

TEST FORM A. 2. Based on current estimates of mutation rate, how many mutations in protein encoding genes are typical for each human? TEST FORM A Evolution PCB 4673 Exam # 2 Name SSN Multiple Choice: 3 points each 1. The horseshoe crab is a so-called living fossil because there are ancient species that looked very similar to the present-day

More information

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics BST227 Introduction to Statistical Genetics Lecture 3: Introduction to population genetics!1 Housekeeping HW1 will be posted on course website tonight 1st lab will be on Wednesday TA office hours have

More information

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA. QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA www.biostat.jhsph.edu/ kbroman Outline Experiments, data, and goals Models ANOVA at marker

More information

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES by MICHAEL MCCANTS B.A., WINONA STATE UNIVERSITY, 2007 B.S., WINONA STATE UNIVERSITY, 2008 A THESIS submitted in partial

More information

Lecture 1 Introduction to Modern Plant Breeding. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Lecture 1 Introduction to Modern Plant Breeding. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 Lecture 1 Introduction to Modern Plant Breeding Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 1 Importance of Plant breeding Plant breeding is the most important technology developed by

More information

Two-locus models. Two-locus models. Two-locus models. Two-locus models. Consider two loci, A and B, each with two alleles:

Two-locus models. Two-locus models. Two-locus models. Two-locus models. Consider two loci, A and B, each with two alleles: The human genome has ~30,000 genes. Drosophila contains ~10,000 genes. Bacteria contain thousands of genes. Even viruses contain dozens of genes. Clearly, one-locus models are oversimplifications. Unfortunately,

More information

Selection of Rice Varieties for Recommendation in Sri Lanka: A Complex-free Approach

Selection of Rice Varieties for Recommendation in Sri Lanka: A Complex-free Approach World Journal of Agricultural Sciences 6 (): 189-194, 010 ISSN 1817-3047 IDOSI Publications, 010 Selection of Rice Varieties for Recommendation in Sri Lanka: A Complex-free Approach 1 3 S. Samita, M. Anpuas

More information

Additive main effect and multiplicative interaction analysis of grain yield of wheat varieties in Lithuania

Additive main effect and multiplicative interaction analysis of grain yield of wheat varieties in Lithuania Agronomy Research 4(1), 91 98, 2006 Additive main effect and multiplicative interaction analysis of grain yield of wheat varieties in Lithuania P. Tarakanovas 1 and V. Ruzgas 2 1 Lithuanian Institute of

More information

Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln

Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln Genetics of Beef Cattle: Moving to the genomics era Matt Spangler, Assistant Professor, Animal Science, University of Nebraska-Lincoln Several companies offer DNA marker tests for a wide range of traits

More information

Experimental Design and Sample Size Requirement for QTL Mapping

Experimental Design and Sample Size Requirement for QTL Mapping Experimental Design and Sample Size Requirement for QTL Mapping Zhao-Bang Zeng Bioinformatics Research Center Departments of Statistics and Genetics North Carolina State University zeng@stat.ncsu.edu 1

More information

Population Structure and Gene Flow. COMP Fall 2010 Luay Nakhleh, Rice University

Population Structure and Gene Flow. COMP Fall 2010 Luay Nakhleh, Rice University Population Structure and Gene Flow COMP 571 - Fall 2010 Luay Nakhleh, Rice University Outline (1) Genetic populations (2) Direct measures of gene flow (3) Fixation indices (4) Population subdivision and

More information

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics BST227 Introduction to Statistical Genetics Lecture 3: Introduction to population genetics 1 Housekeeping HW1 due on Wednesday TA office hours today at 5:20 - FXB G11 What have we studied Background Structure

More information

The Evolution of Populations

The Evolution of Populations Chapter 23 The Evolution of Populations PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

BLUP and Genomic Selection

BLUP and Genomic Selection BLUP and Genomic Selection Alison Van Eenennaam Cooperative Extension Specialist Animal Biotechnology and Genomics University of California, Davis, USA alvaneenennaam@ucdavis.edu http://animalscience.ucdavis.edu/animalbiotech/

More information

Computational Genomics

Computational Genomics Computational Genomics 10-810/02 810/02-710, Spring 2009 Quantitative Trait Locus (QTL) Mapping Eric Xing Lecture 23, April 13, 2009 Reading: DTW book, Chap 13 Eric Xing @ CMU, 2005-2009 1 Phenotypical

More information

Application of MAS in French dairy cattle. Guillaume F., Fritz S., Boichard D., Druet T.

Application of MAS in French dairy cattle. Guillaume F., Fritz S., Boichard D., Druet T. Application of MAS in French dairy cattle Guillaume F., Fritz S., Boichard D., Druet T. Considerations about dairy cattle Most traits of interest are sex linked Generation interval are long Recent emphasis

More information

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by author] Statistical methods: All hypothesis tests were conducted using two-sided P-values

More information

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill Introduction to Add Health GWAS Data Part I Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill Outline Introduction to genome-wide association studies (GWAS) Research

More information

heritability problem Krishna Kumar et al. (1) claim that GCTA applied to current SNP

heritability problem Krishna Kumar et al. (1) claim that GCTA applied to current SNP Commentary on Limitations of GCTA as a solution to the missing heritability problem Jian Yang 1,2, S. Hong Lee 3, Naomi R. Wray 1, Michael E. Goddard 4,5, Peter M. Visscher 1,2 1. Queensland Brain Institute,

More information

Using Triple Test Cross Analysis to Estimates Genetic Components, Prediction and Genetic Correlation in Bread Wheat

Using Triple Test Cross Analysis to Estimates Genetic Components, Prediction and Genetic Correlation in Bread Wheat ISSN: 39-7706 Volume 4 Number (05) pp. 79-87 http://www.ijcmas.com Original Research Article Using Triple Test Cross Analysis to Estimates Genetic Components, Prediction and Genetic Correlation in Bread

More information

-Is change in the allele frequencies of a population over generations -This is evolution on its smallest scale

-Is change in the allele frequencies of a population over generations -This is evolution on its smallest scale Remember: -Evolution is a change in species over time -Heritable variations exist within a population -These variations can result in differential reproductive success -Over generations this can result

More information

Implementing direct and indirect markers.

Implementing direct and indirect markers. Chapter 16. Brian Kinghorn University of New England Some Definitions... 130 Directly and indirectly marked genes... 131 The potential commercial value of detected QTL... 132 Will the observed QTL effects

More information

Combining Ability define by Gene Action

Combining Ability define by Gene Action Combining Ability define by Gene Action Combining ability is a very important concept in plant breeding and it can be used to compare and investigate how two inbred lines can be combined together to produce

More information

PopGen1: Introduction to population genetics

PopGen1: Introduction to population genetics PopGen1: Introduction to population genetics Introduction MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species over time. The discipline devoted to

More information

Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM

Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I ll answer questions on my material, then Chad will answer questions on his material. Test Information Today s notes, the

More information

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015 Lecture 6: GWAS in Samples with Structure Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 25 Introduction Genetic association studies are widely used for the identification

More information

Haplotype Based Association Tests. Biostatistics 666 Lecture 10

Haplotype Based Association Tests. Biostatistics 666 Lecture 10 Haplotype Based Association Tests Biostatistics 666 Lecture 10 Last Lecture Statistical Haplotyping Methods Clark s greedy algorithm The E-M algorithm Stephens et al. coalescent-based algorithm Hypothesis

More information

Marker Assisted Selection Where, When, and How. Lecture 18

Marker Assisted Selection Where, When, and How. Lecture 18 Marker Assisted Selection Where, When, and How 1 2 Introduction Quantitative Genetics Selection Based on Phenotype and Relatives Information ε µ β + + = d Z d X Y Chuck = + Y Z Y X A Z Z X Z Z X X X d

More information

Genetic Equilibrium: Human Diversity Student Version

Genetic Equilibrium: Human Diversity Student Version Genetic Equilibrium: Human Diversity Student Version Key Concepts: A population is a group of organisms of the same species that live and breed in the same area. Alleles are alternate forms of genes. In

More information

LAB ACTIVITY ONE POPULATION GENETICS AND EVOLUTION 2017

LAB ACTIVITY ONE POPULATION GENETICS AND EVOLUTION 2017 OVERVIEW In this lab you will: 1. learn about the Hardy-Weinberg law of genetic equilibrium, and 2. study the relationship between evolution and changes in allele frequency by using your class to represent

More information

Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats

Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats Agronomy Publications Agronomy 7-2011 Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats Franco G. Asoro Iowa State University Mark A. Newell

More information

University of York Department of Biology B. Sc Stage 2 Degree Examinations

University of York Department of Biology B. Sc Stage 2 Degree Examinations Examination Candidate Number: Desk Number: University of York Department of Biology B. Sc Stage 2 Degree Examinations 2016-17 Evolutionary and Population Genetics Time allowed: 1 hour and 30 minutes Total

More information

Maize breeders decide which combination of traits and environments is needed to breed for both inbreds and hybrids. A trait controlled by genes that

Maize breeders decide which combination of traits and environments is needed to breed for both inbreds and hybrids. A trait controlled by genes that Preface Plant breeding is a science of evolution. The scientific basis of plant breeding started in the 1900s. The rediscovery of Mendelian genetics and the development of the statistical concepts of randomization

More information

Lecture 10: Introduction to Genetic Drift. September 28, 2012

Lecture 10: Introduction to Genetic Drift. September 28, 2012 Lecture 10: Introduction to Genetic Drift September 28, 2012 Announcements Exam to be returned Monday Mid-term course evaluation Class participation Office hours Last Time Transposable Elements Dominance

More information

Linkage Disequilibrium

Linkage Disequilibrium Linkage Disequilibrium Why do we care about linkage disequilibrium? Determines the extent to which association mapping can be used in a species o Long distance LD Mapping at the tens of kilobase level

More information

Outline of lectures 9-11

Outline of lectures 9-11 GENOME 453 J. Felsenstein Evolutionary Genetics Autumn, 2011 Genetics of quantitative characters Outline of lectures 9-11 1. When we have a measurable (quantitative) character, we may not be able to discern

More information

Experimental design of RNA-Seq Data

Experimental design of RNA-Seq Data Experimental design of RNA-Seq Data RNA-seq course: The Power of RNA-seq Thursday June 6 th 2013, Marco Bink Biometris Overview Acknowledgements Introduction Experimental designs Randomization, Replication,

More information

Section 9- Guidelines for Dairy Cattle Genetic Evaluation

Section 9- Guidelines for Dairy Cattle Genetic Evaluation - Guidelines for Dairy Cattle Genetic Evaluation Section 9 Table of Contents Section 9 1 Background... 4 2 Pre-evaluation steps... 4 2.1 Assignment to a breed of evaluation... 4 2.2 Animal identification...

More information

Managing genetic groups in single-step genomic evaluations applied on female fertility traits in Nordic Red Dairy cattle

Managing genetic groups in single-step genomic evaluations applied on female fertility traits in Nordic Red Dairy cattle Abstract Managing genetic groups in single-step genomic evaluations applied on female fertility traits in Nordic Red Dairy cattle K. Matilainen 1, M. Koivula 1, I. Strandén 1, G.P. Aamand 2 and E.A. Mäntysaari

More information

Optimal Method For Analysis Of Disconnected Diallel Tests. Bin Xiang and Bailian Li

Optimal Method For Analysis Of Disconnected Diallel Tests. Bin Xiang and Bailian Li Optimal Method For Analysis Of Disconnected Diallel Tests Bin Xiang and Bailian Li Department of Forestry, North Carolina State University, Raleigh, NC 27695-82 bxiang@unity.ncsu.edu ABSTRACT The unique

More information

Analyzing Ordinal Data With Linear Models

Analyzing Ordinal Data With Linear Models Analyzing Ordinal Data With Linear Models Consequences of Ignoring Ordinality In statistical analysis numbers represent properties of the observed units. Measurement level: Which features of numbers correspond

More information

Introduction to Indexes

Introduction to Indexes Introduction to Indexes Robert L. (Bob) Weaber, Ph.D. University of Missouri, Columbia, MO 65211 Why do we need indexes? The complications of multiple-trait selection and animal breeding decisions may

More information

The genetic improvement of wheat and barley for reproductive frost tolerance

The genetic improvement of wheat and barley for reproductive frost tolerance The genetic improvement of wheat and barley for reproductive frost tolerance By Jason Reinheimer Bachelor of Agricultural Science, University of Adelaide A thesis submitted for the degree of Doctor of

More information

Statistical Methods in Bioinformatics

Statistical Methods in Bioinformatics Statistical Methods in Bioinformatics CS 594/680 Arnold M. Saxton Department of Animal Science UT Institute of Agriculture Bioinformatics: Interaction of Biology/Genetics/Evolution/Genomics Computer Science/Algorithms/Database

More information

Mapping and Mapping Populations

Mapping and Mapping Populations Mapping and Mapping Populations Types of mapping populations F 2 o Two F 1 individuals are intermated Backcross o Cross of a recurrent parent to a F 1 Recombinant Inbred Lines (RILs; F 2 -derived lines)

More information

CLUSTER ANALYSIS: A COMPARISON OF FOUR METHODS IN RICE BEAN [VIGNA UMBELLATE (THUNB.)OHWI & OHASHI]

CLUSTER ANALYSIS: A COMPARISON OF FOUR METHODS IN RICE BEAN [VIGNA UMBELLATE (THUNB.)OHWI & OHASHI] Legume Res., 33 (2) : 95-101, 2010 AGRICULTURAL RESEARCH COMMUNICATION CENTRE www.arccjournals.com / indianjournals.com CLUSTER ANALYSIS: A COMPARISON OF FOUR METHODS IN RICE BEAN [VIGNA UMBELLATE (THUNB.)OHWI

More information

Case-Control the analysis of Biomarker data using SAS Genetic Procedure

Case-Control the analysis of Biomarker data using SAS Genetic Procedure PharmaSUG 2013 - Paper SP05 Case-Control the analysis of Biomarker data using SAS Genetic Procedure JAYA BAVISKAR, INVENTIV HEALTH CLINICAL, MUMBAI, INDIA ABSTRACT Genetics aids to identify any susceptible

More information

CREDIT RISK MODELLING Using SAS

CREDIT RISK MODELLING Using SAS Basic Modelling Concepts Advance Credit Risk Model Development Scorecard Model Development Credit Risk Regulatory Guidelines 70 HOURS Practical Learning Live Online Classroom Weekends DexLab Certified

More information

Genomic Selection Using Low-Density Marker Panels

Genomic Selection Using Low-Density Marker Panels Copyright Ó 2009 by the Genetics Society of America DOI: 10.1534/genetics.108.100289 Genomic Selection Using Low-Density Marker Panels D. Habier,*,,1 R. L. Fernando and J. C. M. Dekkers *Institute of Animal

More information

SYLLABUS AND SAMPLE QUESTIONS FOR JRF IN BIOLOGICAL ANTHROPOLGY 2011

SYLLABUS AND SAMPLE QUESTIONS FOR JRF IN BIOLOGICAL ANTHROPOLGY 2011 SYLLABUS AND SAMPLE QUESTIONS FOR JRF IN BIOLOGICAL ANTHROPOLGY 2011 SYLLABUS 1. Introduction: Definition and scope; subdivisions of anthropology; application of genetics in anthropology. 2. Human evolution:

More information

The Accuracy of Genomic Selection in Norwegian Red. Cattle Assessed by Cross Validation

The Accuracy of Genomic Selection in Norwegian Red. Cattle Assessed by Cross Validation Genetics: Published Articles Ahead of Print, published on August 24, 2009 as 10.1534/genetics.109.107391 The Accuracy of Genomic Selection in Norwegian Red Cattle Assessed by Cross Validation Tu Luan*,

More information