Accuracy of genomic prediction in synthetic populations depending on the. number of parents, relatedness and ancestral linkage disequilibrium

Size: px
Start display at page:

Download "Accuracy of genomic prediction in synthetic populations depending on the. number of parents, relatedness and ancestral linkage disequilibrium"

Transcription

1 Genetics: Early Online, published on November 9, 2016 as /genetics Accuracy of genomic prediction in synthetic populations depending on the number of parents, relatedness and ancestral linkage disequilibrium 3 Pascal Schopp*,1, Dominik Müller*,1, Frank Technow*, Albrecht E. Melchinger* 4 5 September 29, *Institute of Plant Breeding, Seed Science and Population Genetics 1 These authors contributed equally to this work University of Hohenheim Stuttgart, Germany Copyright 2016.

2 27 Running Head: Genomic prediction in synthetics Key Words: genomic prediction, synthetic populations, GBLUP, genetic relationships, linkage disequilibrium Corresponding Author: A.E. Melchinger Institute of Plant Breeding, Seed Sciences and Population Genetics University of Hohenheim Fruwirthstr. 21, Stuttgart 70599, GERMANY Tel.: Fax.:

3 54 ABSTRACT Synthetics play an important role in quantitative genetic research and plant breeding, but few studies have investigated the application of genomic prediction (GP) to these populations. Synthetics are generated by intermating a small number of parents (N P ) and thereby possess unique genetic properties, which make them especially suited for systematic investigations of factors contributing to the accuracy of GP. We generated synthetics in silico from N P = 2 to 32 maize (Zea mays L.) lines taken from an ancestral population with either short- or long-range linkage disequilibrium (LD). In eight scenarios differing in relatedness of the training and prediction sets and in the types of data used to calculate the relationship matrix (QTL, SNPs, tag markers, pedigree), we investigated the prediction accuracy of GBLUP and analyzed contributions from pedigree relationships captured by SNP markers as well as from co-segregation and ancestral LD between QTL and SNPs. The effects of training set size N TS and marker density were also studied. Sampling few parents (2 N P < 8) generates substantial sample LD that carries over into synthetics through co-segregation of alleles at linked loci. For fixed N TS, N P influences prediction accuracy most strongly. If the training and prediction set are related, using N P < 8 parents yields high prediction accuracy regardless of ancestral LD because SNPs capture pedigree relationships and Mendelian sampling through co-segregation. As N P increases, ancestral LD contributes more information, while other factors contribute less due to lower frequencies of closely related individuals. For unrelated prediction sets, only ancestral LD contributes information and accuracies were poor and highly variable for N P 4 due to large sample LD. For large N P, achieving moderate accuracy requires large N TS, long-range ancestral LD and high marker density. Our approach for analyzing prediction accuracy in synthetics provides new insights into the prospects of GP for many types of source populations encountered in plant breeding. 3

4 76 INTRODUCTION Synthetic populations, known as synthetics, have played an important role in quantitativegenetic research on gene action in complex heterotic traits and comparison of selection methods (cf. Hallauer et al. 2010). In many crops, synthetics also serve as cultivars in agricultural production or as source population for recurrent selection programs (cf. Bradshaw 2016). Synthetics are usually created by crossing a small number of parents (N P ) and subsequently cross-pollinating the F 1 individuals for one or several generations (Falconer and Mackay 1996). A prominent example is the Iowa Stiff Stalk Synthetic (BSSS) generated from 16 parents of maize, from which numerous successful elite inbred lines such as B73 have been derived (Hagdorn et al. 2003). Further examples of synthetics include composite crosses (Suneson 1956) and multi-parental advanced inter-cross (MAGIC, see Table S1 for list of abbreviations) populations (Cavanagh et al. 2008) advocated for breeding purposes in crops (Bandillo et al. 2013). Importantly, two-way and four-way crosses, widely employed as source material in recycling breeding (Mikel and Dudley 2006), can be viewed as special cases of synthetics when N P = 2 and 4, respectively. Genomic prediction (GP) proposed by Meuwissen et al. (2001) led to a paradigm-shift in animal breeding during the past decade (Hayes et al. 2009a; de Koning 2016) and has also been widely adopted in plant breeding (Lin et al. 2014). In cattle breeding, GP is predominantly applied within closed breeds and training sets (TS) commonly encompass thousands of individuals. By comparison, in plant breeding the TS sizes are much smaller (e.g., hundreds or fewer of individuals) and populations are usually structured into multiple segregating families or subpopulations. Numerous studies addressed the implementation of GP in structured plant breeding populations (cf. Lorenzana and Bernardo 2009; Albrecht et al. 2011; Lehermeier et al. 2014; Technow and Totir 2015), but systematic investigations on the prospects of GP in synthetics are lacking so far, although they were proposed as particularly suitable source material for recurrent genomic selection (Windhausen et al. 2012; Gorjanc et al. 2016). 4

5 Genomic best linear unbiased prediction (GBLUP), a modification of the traditional pedigree BLUP devised by Henderson (1984), is a widely used method to implement GP in animal and plant breeding (Mackay et al. 2015). Here, the pedigree relationship matrix is replaced by a marker-derived genomic relationship matrix to estimate actual relationships at QTL (Hayes et al. 2009c). The success of this approach depends on three sources of information, namely (i) pedigree relationships captured by markers, (ii) co-segregation of QTL and markers and (iii) population-wide linkage disequilibrium between QTL and markers (Habier et al. 2007, 2013; Wientjes et al. 2013). In classical quantitative-genetics, pedigree relationships between individuals are calculated as twice the probability of identity-by-descent (IBD) of alleles at a locus, conditional on their pedigree (Wright 1922; Falconer and Mackay 1996). However, actual IBD relationships at QTL deviate from pedigree relationships which correspond to expected IBD relationships due to Mendelian sampling (Hill and Weir 2011). In GP, pedigree relationships are captured best with a large number of stochastically independent markers (Habier et al. 2007), whereas capturing the Mendelian sampling term requires co-segregation of QTL and markers (Hayes et al. 2009c; Habier et al. 2013). In pedigree analysis, the founders of the pedigree are by definition assumed to be unrelated (i.e., IBD equal to zero), but in reality, there usually exist latent similarities at QTL contributing to variation in identity-by-state (IBS) relationships among these individuals. Markers enable capturing these IBS relationships if they are in population-wide LD with the QTL in an ancestral population of founders. Thus, ancestral LD between QTL and markers provides also information between individuals that are unrelated by pedigree to the TS (Wientjes et al. 2013; Habier et al. 2013). Ancestral LD generally results from various population-historic processes like mutation, drift and selection (Flint- Garcia et al. 2003) and varies within species primarily due to different bottlenecks imposed by artificial selection or population admixture (Hill 1981; Hartl and Clark 2007). The influence of different levels of ancestral LD on prediction accuracy (PA) in synthetics and related types of populations have so far received little attention. The contributions of the three sources of information to PA were demonstrated in theory and simulations by Habier et al. (2013) using half-sib families in cattle breeding and multiple biparental 5

6 (full-sib) families in maize breeding, where both of these examples consisted of numerous families derived from a large number of parents. However, it is unclear whether these results generalize to other breeding situations, in particular those involving only few parents. In such situations, diverse relationship patterns are generated, new statistical associations between loci arise due to sampling, and ancestral LD might be only partially present in the progeny. These factors are expected to profoundly affect the contributions of the three sources of information to PA and thus, affect the application of GP on related and unrelated genotypes. Synthetics represent an ideal framework for examining the influence of these factors on PA, because the number of parents used for producing them can be varied over a wide range. Here, we simulated two ancestral populations differing substantially in their LD and analyzed synthetics generated from different numbers of parents under eight scenarios that enabled dissecting the factors contributing to PA. The objectives of this study were to (i) examine how PA in synthetics depends on the number of parents and LD in the ancestral population, (ii) assess the importance of the three sources of information for PA and how they are influenced by training set size and marker density, and (iii) analyze the relationship of LD between QTL and markers among the ancestral population, parents, and the synthetics generated from them. Finally, we discuss how our approach provides a general framework for analyzing the factors influencing PA and we draw inferences on the prospects of GP in other scenarios encountered in breeding

7 148 METHODS Genome properties and genetic map: We used maize (Zea mays L.) as a model species in our study. Physical map positions of the 56K Illumina maize SNP BeadChip were used to account for the markedly reduced recombination rate and lower marker density in the centromere regions (McMullen et al. 2009). These positions were converted into genetic map positions required for simulating meiosis events (File S1). In total we obtained 37,286 SNPs distributed over the 10 chromosomes of length 276, 200, 193, 188, 221, 171, 203, 173, 151 and 137 cm (1913 cm in total), corresponding to an average marker density of 24.4 SNPs cm -1. All subsequent meiosis events were simulated using the countlocation model without crossover-interference, where the number of chiasmata was drawn from a Poisson distribution with parameter λ equal to the chromosome length in Morgan, and where crossover positions were sampled from a uniform distribution across the chromosome Simulation of ancestral populations: Two ancestral populations that differed substantially in their level and decay of LD (LD A, Figure 1), were simulated with the software QMSim (Sargolzaei and Schenkel 2009). Ancestral population LR displayed extensive long-range LD A, whereas SR displayed only short-range LD A. The simulation of LR was carried out by closely following Habier et al. (2013) and involved the following steps (Figure S1): First, we generated an initial population of 1,500 diploid individuals by sampling alleles at each (biallelic) locus independently from a Bernoulli distribution with probability 0.5. Second, 5,000 loci were randomly sampled from all SNPs and henceforth interpreted as QTL; all remaining loci were considered as SNP markers. Third, these individuals were randomly mated for 3,000 generations using a constant population size of 1,500 and a mutation rate of Fourth, a severe bottleneck was introduced by reducing population size to 30 randomly chosen individuals, followed by 15 more generations of random mating to generate extensive long-range LD A. Fifth, we conducted three more generations of random mating with a population size of 10,000 individuals to eliminate close pedigree relationships in the ancestral population LR. To produce SR, we 7

8 randomly mated LR for 100 more generations at a population size of 10,000 individuals to remove long-range LD A. Thus, LR and SR strongly differed in their LD A structure, but only marginally in their allele frequencies (Table S2). Always in the last generation, a single gamete was randomly sampled per individual from both SR and LR and treated as completely homozygous doubled haploid line. These 10,000 lines represented the final ancestral population used for production of the synthetics. All lines were considered unrelated when calculating pedigree relationships among their progeny Simulation of synthetics: We generated synthetics differing in N P by sampling N P {2, 3, 4, 6, 8, 12, 16, 24, 32} parent lines from the same ancestral population. From these parents, we produced all possible ( N P 2 ) combinations of single crosses (Syn-1 generation, Figure S1), where the number of Syn-1 progenies per cross was chosen to obtain at least 1,000 individuals in total. For production of the Syn-2 generation, the Syn-1 individuals were intermated at random, allowing also for selfing. Finally, a single doubled haploid line was derived from each of the 1,000 individuals of the Syn-2 generation to obtain the genotypes of the final synthetic. This approach was chosen to avoid additional full-sib relationships among doubled haploid lines that arise when deriving them from the same Syn-2 individual Genetic model: For simulating the polygenic target trait, we sampled a subset of 1,000 of the 5,000 QTL in each simulation replicate. Following Meuwissen et al. (2001), the corresponding QTL effects were drawn from a gamma distribution with scale and shape parameter 0.4 and 1.66, respectively. Signs of QTL effects were sampled from a Bernoulli distribution with probability parameter 0.5. The vector u of true breeding values for all individuals in the synthetic was calculated as u = Wa, where W is the matrix of genotypic scores at QTL coded {2,0} depending on whether an individual was homozygous for the 1 or 0 allele, respectively, that were adjusted for twice the frequency of the 1 allele in the ancestral population (cf. Figure S1), and a is the vector of QTL effects. The corresponding vector y of phenotypes was obtained as y = u + e (Goddard et al. 2011; de los Campos et al. 2013; Habier et al. 2013), i.e., assuming a null mean and adding a vector of independent normally distributed 8

9 environmental noise variables e, where variance σ e 2 was chosen to be identical for the two ancestral populations and all choices of N P, assuming that environmental effects affect phenotypes independently of additive-genetic variance σ u 2 in the synthetic. The value of σ e 2 was therefore set equal 2 to the additive-genetic variance σ AP in ancestral population LR averaged across 1,000 simulation replicates. The heritability h² of the target trait was then on average equal to 0.5 for LR and SR due to nearly identical allele frequencies at QTL, but lower in the synthetics, because σ 2 2 u < σ AP (Table S2). We restricted our simulations to a single level of heritability, because preliminary analyses showed that changing h² resulted in fairly relatively constant shift of PA Analysis of the sources of information exploited in genomic prediction: We conceived eight scenarios to evaluate to what extent the three sources of information contribute to PA in synthetics (Figure 2), when actual relationships at QTL are estimated by marker-derived genomic relationships. The scenarios can be differentiated by three factors. First, individuals in the TS and prediction set (PS) were either related ( Re -scenarios) or unrelated ( Un - scenarios), depending on whether the parents of the TS (P TS ) and of the PS (P PS ) were identical (i.e., P TS = P PS ) or disjoint (i.e., P TS P PS = ). For the Re -scenarios, we sampled individuals for the TS and PS from the same synthetic, whereas for the Un - scenarios, individuals were sampled from two different synthetics produced from disjoint sets P TS and P PS, each of size N P. Both sets of parents originated always from the same ancestral population. Second, pairs of QTL and SNPs were either in LD ( LD A -scenarios) as found in the ancestral population, or in linkage equilibrium ( LE A -scenarios). To achieve the latter, we permuted complete QTL haplotypes among the N P parents (for Un -scenarios separately in each set P TS and P PS ), while keeping their SNP haplotypes unchanged (i.e., conserving their LD A). This procedure eliminates any systematic association between QTL and SNP alleles originating from the ancestral population, but maintains allele frequencies and polymorphic states at QTL, as well as LD A between them. In contrast to previous approaches (cf. Habier et al. 2013), this approach avoids influencing PA by altering actual relationships at QTL. Importantly, after removal of LD A, there is still LD between QTL and SNPs in 9

10 the parents, but this LD is purely due to the limited sample size and thus subsequently referred to as sample LD. Third, four different types of data were used to calculate the relationship matrix K used in BLUP: (i) For the SNP - scenarios, we used SNP genotypes to calculate the marker-derived genomic relationship matrix K G = (g ij ) as g ij = m(x im 2p m )(x jm 2p m ) 2 m p m (1 p m ) (Habier et al. 2007; VanRaden 2008), where x im is the genotype of the i-th individual at the m-th locus coded {2,0} depending on whether this individual was homozygous for the 1 or 0 allele, respectively, and p m is the frequency of the 1 allele at the m-th SNP marker in the ancestral population. (ii) For the QTL - scenarios, the QTL genotypes were used to calculate the actual relationship matrix K Q = (q ij ) using the same formula. (iii) For the Ped -scenario, pedigree records were used to calculate the pedigree relationship matrix K A = (f ij ) with elements f ij being equal to expected IBD relationships (i.e., twice the coefficient of co-ancestry). (iv) For the Tag -scenario, tag markers labeling the origin of QTL alleles at each locus from the N P parents were used to calculate the actual IBD relationship matrix K T = (τ ij ) with elements τ ij being equal to twice the proportion of identical tag marker alleles between each pair of individuals. Tag markers label each QTL allele, regardless of its state, uniquely with a number є {1,.., N P ) in the parents and thus, they allow tracking the segregation process during intermating and identifying the parental origin of each QTL allele in the synthetic. Scenario Re-LD A-SNP reflects the situation mostly encountered in practical applications of GP and used information from pedigree relationships among individuals in the TS and PS captured by SNPs, deviations from pedigree relationships due to (i) Mendelian sampling at QTL captured by cosegregation of QTL and SNPs and (ii) ancestral LD between QTL and SNPs. Scenario Re-LD A-Ped used only pedigree relationships, but ignored deviations due to Mendelian sampling, whereas Re-LD A-Tag accounted for both pedigree relationships and Mendelian sampling. Both scenarios ignored actual relationships among parents by assuming unrelated founders, and thus, did not account for alleles that are IBS but not IBD in the synthetic. Scenario Re-LE A-SNP was artificial, with the goal of determining 10

11 the influence of ancestral LD on PA in scenario Re-LD A-SNP. Scenario Re-LD A-QTL was employed to determine for the Re -scenarios the maximum PA achievable with GBLUP (cf. de los Campos et al. 2013), when assuming that each QTL explains an equal proportion of the additive-genetic variance. The purpose was thus to quantify the reduction in PA for all other Re -scenarios when using a different data type to estimate actual relationships. Scenarios Un-LD A-SNP and Un-LD A-QTL ( Un -scenarios) represent the conceptual counterparts to Re-LD A-SNP and Re-LD A-QTL (Figure 2). Un-LD A-SNP reflects the practical situation of predicting the genetic merit of individuals unrelated to the TS, whereas Un-LD A-QTL provides the corresponding upper bound of PA. For both scenarios, alleles in the TS and PS had IBD probability equal to zero and, thus, the only remaining source of information contributing to PA in Un-LD A-SNP was ancestral LD between QTL and SNPs to track actual relationships among parents. Scenario Un-LE A-SNP was employed as negative-control scenario to validate the simulation designs. As expected, PA for Un-LE A- SNP fluctuated around zero for all investigated settings (results not shown), confirming that there are only three sources of information contributing to PA when using K G Analysis of linkage disequilibrium and linkage phase similarity: We calculated LD as the squared correlation coefficient (r 2, Hill and Robertson 1968) between all pairs of QTL and SNPs in (i) each ancestral population (LD A), (ii) the set of N P parents sampled from the ancestral population, and (iii) the synthetic generated from the parents. Furthermore, we computed the linkage phase similarity of QTL-SNP pairs in the TS and PS. Here, we adopted a similar approach as de Roos et al. (2008), but replaced the correlation by the cosine similarity 274 Linkage phase similarity = n TS PS i r i ri n i (r TS i ) 2 n i (r PS i ) 2, (1) where i refers to the index of the QTL-SNP pair and n is the number of pairs for which linkage phase similarity is calculated. The reason was to account not only for the ranking but also for the absolute size of the r statistics in the two data sets (see File S2 for details). Linkage phase similarity was 11

12 calculated for all QTL-SNP pairs falling into consecutive bins of 0.5 cm width. LD was first averaged within each bin and subsequently, both LD and linkage phase similarity statistics were averaged across chromosomes and simulation replicates Genomic prediction: The statistical model used for predicting breeding values can be written as 283 y = 1μ + Zu + ε, (2) where Z is the incidence matrix linking phenotypes with breeding values, u is the vector of random breeding values with mean zero and variance-covariance matrix var(u) = Kσ u 2, where K is a relationship matrix, calculated from different data types as described above, and σ u 2 is the additive- genetic variance in the synthetic. Residuals ε are random with mean zero and var(ε) = Iσ ε 2, where I is an identity matrix and σ ε 2 is the residual variance. Estimates of variance components σ u 2 and σ ε 2 were obtained by restricted maximum likelihood and estimated breeding values u were predicted using the mixed.solve function from R-package rrblup (Endelman 2011). PA was always calculated as the correlation between u and u for the PS in each simulation replicate. Following previous studies (Goddard et al. 2011; de los Campos et al. 2013), we also investigated how well estimated relationships k ij (i.e., g ij, f ij, τ ij ) between individuals i and j in the TS and PS reflect the corresponding actual relationships q ij at QTL. We therefore calculated the 2 coefficient of determination R k,q of the regression of k ij on q ij in each simulation replicate and all scenarios (except for Re-LD A-QTL and Un-LD A-QTL, 2 where R k,q = 1.0). In order to assess the effect of TS size on PA, we sampled N TS = 125, 250, 500 or 750 individuals from the 1,000 lines of the synthetic, where 250 was used as default when another factor (e.g., marker density) was varied. For the PS, we always sampled N PS = 100 individuals from (i) the remaining individuals that were not part of the TS in the Re -scenarios or (ii) the second synthetic in the Un -scenarios. For all SNP -scenarios, the effect of marker density on PA was evaluated for two values of 5 and 0.25 SNPs cm -1, the former being used as default. The number of randomly sampled SNPs per chromosome in each simulation replicate was proportional to the respective chromosome 12

13 length. The two marker densities of 5 and 0.25 SNPs cm -1 resulted in an average genetic map distance between each QTL and its closest nearby SNP of 0.18 cm and 2.02 cm, respectively (Figure 1). All reported results are arithmetic means over 1,000 simulation replicates, which were stochastically independent conditional on the ancestral populations. A simulation replicate comprises (i) random sampling of 1,000 QTL from the 5,000 initial QTL and sampling of QTL effects, (ii) sampling of the parents from the ancestral population and, in the case of the LE A -scenarios, additionally permuting QTL haplotypes, (iii) creation of synthetics from each set of parents, (iv) sampling of the individuals for the TS and PS, (v) sampling of the noise variable e and calculation of the breeding and phenotypic values, and (vi) training of the prediction equation and calculation of estimated breeding values as well as PA in the PS (Figure S1). All computations were carried out in the R statistical environment (R Core Team 2012). 13

14 315 RESULTS Linkage disequilibrium in the ancestral populations: For ancestral population LR, LD A showed a steep decline extending to a genetic map distance = 0.5 cm and approached an asymptote of about 0.08 for > 1 cm (Figure 1), reflecting the presence of long-range LD A. By comparison, LD A in ancestral population SR started at slightly smaller values for closely linked loci and showed a similar decline for < 1 cm. It levelled off at about = 2 cm, where it almost reached its asymptotic value of zero due to absence of long-range LD A resulting from the 100 additional generations of random mating Linkage disequilibrium in the parents and the synthetics: Figure 3A shows the distribution of LD between QTL-SNP pairs in the parents, measured as r 2, as a function of. LD in the parents takes on only a limited number of values in the interval [0,1], because only a finite number of genotype configurations is possible for two biallelic loci, which depends exclusively on N P. For N P = 2, all LD values are equal to 1. For N P = 3 and 4, possible LD values are { 1, 1} and {0, 1, 1, 1}, respectively, whereas for N P = 16, more than 100 values are possible, resulting in a nearly continuous distribution of LD values in the parents. Under LE A (i.e., ancestral linkage equilibrium due to permutation of QTL haplotypes), the frequency of LD values in the parents was thus almost independent of, except for some small residual deviations due to similarity of ancestral allele frequencies at closely linked loci (see File S4 for details). Under LE A, the high frequencies of pairs of loci in high LD for N P = 3 and 4 demonstrate the magnitude of sample LD (Figure 3A, left column). If additionally, ancestral LD was present, large parental LD values occurred more frequently for tightly linked loci ( < 1 cm) for both ancestral populations. Under short-range LD A in SR, the frequencies of high parental LD values were almost identical to those found under LE A for > 1 cm, regardless of N P. Conversely, under long-range LD A in LR, the frequency of high parental LD values was considerably elevated also for > 1 cm. Altogether, the distribution of LD values in the parents was much stronger influenced by N P than by 14

15 ancestral LD. The proportion of QTL-SNP pairs in high LD diminished as N P increased, but grew when shifting from short- to long-range LD A (Figure 3A, SR vs. LR). Figure 3B shows the average LD between QTL-SNP pairs in synthetics as a function of and N P. The level of the LD curve dropped rapidly as N P increased from 2 to 8 and approached the curve of ancestral LD. Under LE A, LD in synthetics was still substantial for N P = 4 due to sample LD, yet successively approached zero as N P was increased further. For N P > 2, the presence of ancestral LD resulted in elevated LD in the synthetics, where the increment was large between tightly linked QTL- SNP pairs ( < 1 cm) for both ancestral populations and moderate between loosely linked loci ( > 1 cm) for LR Linkage phase similarity between training and prediction set: For scenario Re-LD A-SNP (P TS = P PS ), linkage phase similarity between TS and PS exceeded 0.8 up to = 20 cm, regardless of the ancestral population (Figure 4). By comparison, values were much lower for Un-LD A-SNP (P TS P PS = ). Increasing N P reduced linkage phase similarity only marginally for Re-LD A-SNP even for = 20 cm, but resulted in a substantial increase for Un-LD A-SNP. The higher ancestral LD in LR resulted only in a minor increase in linkage phase similarity in Re-LD A-SNP, but in a large increase for Un-LD A-SNP, irrespective of N P. Since permuting QTL haplotypes eliminated ancestral LD in scenario Re-LE A-SNP, linkage phase similarity was identical for SR and LR and showed similar results as Re-LD A-SNP for SR (results not shown) Influence of ancestral linkage disequilibrium and number of parents on prediction accuracy: PA declined for all Re -scenarios (except Re-LD A-Ped), but increased for all Un -scenarios with an increasing number of parents N P (Figure 5), where the strongest changes occurred between N P = 2 and 8 for all scenarios. The highest PA was always achieved by scenario Re-LD A-QTL, closely followed by Re-LD A-SNP for small N P, with an increasing difference for larger N P. PA increased when shifting from low (SR) to high (LR) ancestral LD for scenario Re-LD A-SNP, but decreased for Re-LE A-SNP. For 15

16 scenario Re-LD A-Tag, PA was always intermediate between Re-LD A-SNP and Re-LE A-SNP. For Re-LD A- Ped, PA concavely increased from N P = 2 up to its maximum value of 0.4 for N P = 8, followed by a minor decrease. Re-LD A-Ped and Re-LE A-SNP approached identical PA for large N P under long-range LD A in LR, whereas Re-LE A-SNP retained superior PA under short-range LD A in SR. For Un-LD A-QTL, PA strongly increased for both ancestral populations, especially from N P = 2 to 8, followed by a moderate increase. For Un-LD A-SNP, the overall level of PA was much lower, but showed a similarly increasing curvature as Un-LD A-QTL for long-range LD A, whereas under short-range LD A, PA was almost consistently < 0.2 without sizeable increase for all values of N P Influence of training set size and marker density on prediction accuracy: Increasing TS size (N TS ) from 125 to 750 individuals was overall most beneficial for all Re -scenarios, except for Re-LD A-Ped (Figure S3). Conversely, Re-LD A-Ped, as well as Un-LD A-SNP under short-range LD A, showed only a minor increase in PA for larger N TS. However, for Un-LD A-SNP under long-range LD A and for Un-LD A-QTL under both short- and long-range LD A, the increase in PA along with N TS was notable, especially for N P > 8. Reducing the marker density from 5 SNPs cm -1 to 0.25 SNPs cm -1 resulted in a substantial reduction of PA for all SNP -scenarios (Figure S4). This reduction was reinforced for scenarios utilizing ancestral LD (Re-LD A-SNP and Un-LD A-SNP), especially in the presence of long-range LD A and for large values of N P. 16

17 383 DISCUSSION In plant breeding, GP has been applied to various types of populations such as single or multiple biparental families or diversity panels of inbred lines. These materials differ fundamentally in their pedigree structure, the number of founder individuals involved in their development, as well as the LD in the ancestral population from which they were taken. Synthetics are especially suited for systematically assessing the influence of these factors on prediction accuracy, because the variable number of parents used for generating synthetics leads to (i) different pedigree relationships among individuals and (ii) a trade-off between ancestral LD and sample LD arising in the parents. Thus, our approach provides new insights into how these factors influence the ability of molecular markers to capture actual relationships at causal loci, which determines the accuracy in various applications of GP Influence of the number of parents and ancestral LD on actual relationships at causal loci and prediction accuracy: The accuracy of GP relies on the distribution of actual relationships q ij at causal loci (QTL) between individuals in the TS and PS and (ii) the quality of the approximation of q ij by marker-derived genomic relationships g ij (Goddard et al. 2011; Habier et al. 2013). We first investigated PA using the actual relationship matrix Q, which provides an upper bound of PA given fixed values for N P, N TS and h² (de los Campos et al. 2013). Subsequently, we estimated Q by the marker-derived genomic relationship matrix G and inferred how the three sources of information contributed to PA. Actual relationships q ij between two individuals i and j can be factorized into q ij = f ij + m ij + ξ ij, (3) where f ij is their expected IBD relationship at QTL, m ij = τ ij f ij is the deviation of the actual from the expected IBD relationship due to Mendelian sampling, and ξ ij is the deviation of the actual (IBS) relationship from the actual IBD relationship. Whereas f ij and m ij provide information solely with 17

18 respect to the parents (i.e., the founders of the pedigree), ξ ij accounts also for actual relationships among the parents (Powell et al. 2010; Vela-Avitúa et al. 2015). If TS and PS are related ( Re -scenarios), the distribution of f ij depends on N P (Figure 6A) and on the mating scheme employed for production of the synthetic (Figure S1). For small N P, this distribution is dominated by full-sib and half-sib relationships, whereas distantly related and unrelated individuals dominate for larger N P. The closer the pedigree relationships between individuals, the longer are the chromosome segments they inherit from common ancestors and the larger is the conditional variance in actual IBD relationships, i.e., var(m ij f ij ) (Figure 6B, Hill and Weir 2011; Goddard et al. 2011). In other words, var(m ij f ij ) is inversely proportional to the number of independently segregating chromosome segments and, hence, the length and number of chromosomes must be taken into account when transferring our results to other species. For example, in bread wheat (2n = 42), var(m ij f ij ) and consequently PA attributable to the Mendelian sampling term are expected to be smaller than in maize (2n = 20). The contribution of ξ ij to q ij depends on the level of ancestral LD. Elevated LD A increases var(q ij ) in the ancestral population (Figure S2, LR vs. SR) and in turn increases the variation in similarity of haplotypes among parents sampled therefrom (Habier et al. 2013). Consequently, var(q ij f ij ) in synthetics increases with ancestral LD (Figure 6B and S2), on top of the variance var(m ij f ij ) caused by Mendelian sampling. Assuming known actual relationships and fixed TS size, PA therefore decreases if (i) N P increases and (ii) ancestral LD decreases (Figure 5). This is because both factors reduce the absolute frequency of close actual relationships among TS and PS (Figure S5). If actual relationships among the parents were not accounted for, the decline in PA was reinforced as N P increased (Figure 5, scenario Re-LD A-Tag). The reason for this follows from the factorization (Eq. 3): the larger N P, the more frequent are pairs of individuals with small or zero pedigree relationship (Figure 6A) and the more important it is to account for actual relationships among parents. Conversely, PA was consistently higher for small N P due to strong pedigree relationships and Mendelian sampling, despite the 18

19 accompanying negative effects of reduced heritability in the TS and the reduced additive-genetic variance in the PS on PA (Table S2). Restricting predictive information to pedigree relationships (scenario Re-LD A-Ped), resulted in only moderate PA, unless for N P = 2 (Figure 5). In this case, all individuals in the TS and PS were fullsibs (Figure 6A), which resulted in identical estimated breeding values by pedigree BLUP, so that PA could not be calculated (indicated as PA = 0 in Figure 5). For N P > 2, there was variation in pedigree relationships in synthetics (Figure 6A) and thus, PA > 0. Further research is warranted on the importance of variation in pedigree relationships for GP in the presence of Mendelian sampling and ancestral LD, e.g., by considering mating schemes such as MAGIC, which reduce or even entirely avoid variation in pedigree relationships. If the TS and PS are unrelated ( Un -scenarios), only ξ ij contributes to variation in q ij, because f ij and m ij are equal to zero. Moreover, if N P is small, QTL in the TS and PS can (i) be fixed for different alleles (Table S2) or (ii) differ in their LD structure due to sample LD. This limits the occurrence of close actual relationships between TS and PS (Figure S5, Un-LD A-QTL) and reduces the upper bounds of PA compared with the corresponding Re -scenarios (Figure 5, Un-LD A-QTL vs. Re-LD A-QTL). As N P increases, allele frequencies and LD between loci converge towards those in the ancestral population in both Re - and Un -Scenarios (Table S2). In turn, the closest actual relationships between TS and PS converge as well (Figure S5), resulting ultimately in similar PA for Re-LD A-QTL and Un-LD A-QTL (Figure 5). In conclusion, the difference in predicting related and unrelated genotypes vanishes as N P increases for a given TS size, because it is then primarily ancestral information that drives the accuracy of GP Sample LD and co-segregation crucial factors for prediction accuracy in synthetics: LD in the parents represents a combination of LD carrying over from the ancestral population and LD generated anew due to limited N P. The latter LD, herein referred to as sample LD, results from a bottleneck in population size similar to that used in our simulations for generating long-range LD in the ancestral population (cf. Figure S1), but can be much stronger if N P is small (e.g. 4). Co-segregation is defined as 19

20 the co-inheritance of alleles at linked loci on the same gamete and thus describes the process that prevents parental LD between them from being rapidly eroded by recombination (Figure S6). Together, sample LD and co-segregation result in high LD in synthetics, which for small N P exceeds by far the level of ancestral LD (Figure 3B, see File S3 for details). The crucial property of sample LD, however, is that it is specific to a set of parents and thus provides predictive information only for their descendants. Hence, using co-segregation as source of information in GP relies on the presence of pedigree relationships (Habier et al. 2013). Conversely, the fraction of parental LD that stems from ancestral LD is a commonality among all descendants of the ancestral population, irrespective of pedigree relationships. The particularly small number of parents used in synthetics makes sample LD and cosegregation crucial factors contributing to PA, a situation that differs greatly from previously investigated scenarios (e.g., Habier et al. 2007, 2013; Wientjes et al. 2013). Hence, knowledge of how ancestral LD and sample LD contribute to parental LD, depending of N P, is essential for evaluating the applicability of training data to prediction of both related and unrelated genotypes. The influence of sample LD on parental LD and PA in the Re -scenarios is illustrated best by considering different values of N P : For N P = 2, sample LD in the parents is maximized, because all pairs of polymorphic loci are in complete LD (r² = 1.0), irrespective of ancestral LD, linkage or genetic map distance. Co-segregation of linked QTL and SNPs during intermating largely conserves LD, even for loosely linked loci (Figure S6), so that LD in synthetics remained at high levels (Figure 3B). Therefore, replacing Q with G resulted in merely a marginal reduction of PA (Figure 5, Re-LD A-QTL vs. Re-LD A-SNP). Previous studies claimed that PA in biparental populations is the maximum obtainable for given TS size (Riedelsheimer et al. 2013; Lehermeier et al. 2014), despite absence of variation in pedigree relationships. Our results demonstrate that this is exclusively attributable to the efficient utilization of sample LD via co-segregation. For N P = 3 and 4, LD can take two and four discrete values, respectively (see Results). Thus, sample LD still takes up a large share of parental LD (Figure 3A). However, the occurrence of different LD values (in contrast to N P = 2) introduces a dependency on ancestral LD: the frequency of loci with high parental LD increases in the presence of ancestral LD compared with ancestral linkage equilibrium. This difference carries over during intermating and resulted in increased 20

21 LD in the synthetics, especially under long-range ancestral LD (Figure 3B, LR). However, the increment in PA was only marginal (Figure 5, Re-LD A-SNP vs. Re-LE A-SNP) owing to the overriding contribution of sample LD to parental LD for N P = 3 and 4. Nevertheless, the reduction in sample LD for N P = 3 or 4, compared with N P = 2, impaired co-segregation information and reinforced the decline in PA when relying on markers (Figure 5, Re-LD A-QTL vs. Re-LD A-SNP). For N P 16, sample LD becomes negligible (Figure 3A) so that parental LD hardly differed from ancestral LD. This led to (i) reinforced reduction in PA, when using markers rather than known QTL genotypes (Figure 5, Re-LD A-QTL vs. Re-LD A-SNP), especially for short-range ancestral LD, and (ii) convergence of PA of GBLUP and pedigree BLUP in the absence of ancestral LD (Figure 5, Re-LE A-SNP vs. Re-LD A-Ped). The reason for the latter is that under marginal contribution of co-segregation, PA stems primarily from capturing pedigree relationships by SNPs. For the Un -scenarios, sample LD is manifested independently in P TS and P PS, which results in different co-segregation patterns in TS and PS that cannot reliably be exploited in GP. Therefore, the ancestral LD that is common to both sets of parents measured by linkage phase similarity in the synthetics (Figure 4) provides the only source of information connecting the TS and PS. This constraint resulted in a much larger drop in PA when replacing Q with G in the Un -scenarios (Figure 5, Un-LD A-QTL vs. Un-LD A-SNP) compared with the corresponding Re -scenarios (Figure 5, Re- LD A-QTL vs. Re-LD A-SNP), especially under short-range ancestral LD. This decline in PA when predicting the genetic merit of unrelated instead of related genotypes corroborates previous findings on GP across populations in both animal and plant breeding (Hayes et al. 2009b; de Roos et al. 2009; Technow et al. 2013; Riedelsheimer et al. 2013; Albrecht et al. 2014; Heslot and Jannink 2015). Variation in linkage phase similarity between TS and PS caused by sample LD affects GP of unrelated genotypes in an unforeseeable manner: while identical and reversed QTL-SNP linkage phases manifested by sample LD cancel out on average, individual TS-PS combinations can show above or below average linkage phase similarity and thus, co-segregation patterns. This translates into large variation of PA among different TS-PS combinations. Additional simulations using unequal N P to derive the TS and PS showed that variation in PA was even higher when using small N P to generate the PS 21

22 than for the TS (Figure S7). A possible explanation might be that regardless of the TS composition, small N P for the PS drastically reduces the frequency of polymorphic loci (Table S2) and thereby increases the variation in linkage phase similarity with the TS for the remaining loci, which in turn increases the variability of prediction. Considering the practical relevance of such prediction scenarios, further research is needed to investigate this finding in greater detail Influence of LD on capturing pedigree relationships: The ability to capture pedigree relationships by SNPs increases with the effective number of independently segregating SNPs in the model (Habier et al. 2007). Higher LD between SNPs reduces this number and thus, reduces the contribution of pedigree relationships captured by SNPs to PA. Scenario Re-LE A-SNP demonstrates this fact for large values of N P, where LD between QTL and SNPs in synthetics was small (Figure 3B) and hence, PA mainly relied on capturing pedigree relationships. In line with this reasoning, PA decreased from SR to LR (Figure 5, Re-LE A-SNP) as well as when marker density was reduced from 5 to only 0.25 SNPs cm -1, because similar to increasing LD, using low marker density reduced the number of independently segregating SNPs (Figure S4, Re-LE A-SNP). In GBLUP, the consequences of an imprecise estimation of pedigree relationships by SNPs due to strong LD are limited, because the loss in PA compared with pedigree BLUP is mostly overcompensated for by capturing either co-segregation (Figure 5; small N P, Re-LE A-SNP vs. Re-LD A- Ped) or long-range ancestral LD (Figure 5; large N P, Re-LE A-SNP vs. Re-LD A-Ped). An exception is the combination of large N P and short-range ancestral LD, where the comparatively small contribution of ancestral LD to PA does not necessarily compensate for that loss, so that GBLUP might not provide the desired advantage over pedigree BLUP. Alternative models employing variable-selection (e.g., BayesB), which capitalize more on LD rather than pedigree relationships (Habier et al. 2007; Zhong et al. 2009; Jannink et al. 2010), might help to improve the prospects of GP in such cases Influence of training set size on prediction accuracy: In this study, we varied training set size N TS for given values of N P, because resources devoted to the TS differ between breeding programs and do not 22

23 necessarily depend on N P. Under fixed N P, the absolute frequency of individuals with close actual relationship among TS and PS increases with N TS (Figure S5), which led to similar benefits in PA for all Re -scenarios (Figure S3, except Re-LD A-Ped). However, the general decline of PA in these scenarios with increasing N P was only slightly attenuated even when using 750 instead of 125 individuals in the TS. This is because the need for larger TS size increases rapidly as pedigree relationships with the PS decrease (Habier et al. 2010), which in turn shifts the distribution of actual relationships toward lower values (Figure 6B and S2). Thus, N TS must generally be increased along with N P to counteract as much as possible the expected decline in PA. According to Habier et al. (2013), altering TS size affects the contributions of the three sources of information to PA, but this inference is based on the assumption that TS size was increased by adding new families to the TS (unrelated to the initially included families), which is comparable to increasing N P in our study. De los Campos et al. (2013) showed that the estimation of actual 2 relationships by SNPs is sufficiently characterized by R k,q (Figure S8) and thus, largely independent of N TS, apart from estimation error. In synthetics, the distribution of actual relationships q ij is defined by N P and LD A (Figure 6 and S2). Thus, increasing N TS increases the chances for each individual in the PS to have several individuals with close actual relationships q ij in the TS, which was previously found to be crucial for achieving high PA (Jannink et al. 2010; Clark et al. 2012). Therefore, the contributions to PA from co-segregation and ancestral LD increase with N TS, because they are required to capture deviations from pedigree relationships. Conversely, using small N TS will tend to hamper the occurrence of high q ij values and hence, increase the reliance on pedigree relationships. If TS and PS are unrelated, the absolute frequency of close actual relationships is low, even if N TS is large (Figure S5). Additionally, actual relationships are rather poorly estimated by SNPs when relying solely on ancestral LD (Figure S8, Un-LD A-SNP). Consequently, huge N TS (>> 750) and high marker density would be required to substantially elevate PA, especially if there is only short-range ancestral LD (cf. de los Campos et al. 2013). 23

24 Influence of marker density on prediction accuracy: High marker density is especially important if LD between QTL and SNPs extends only to short map distances (Solberg et al. 2008; Zhong et al. 2009; Hickey et al. 2014). This applies in our study if either sample LD was negligible (Figure S4; large N P, Re- LD A-SNP vs. Re-LE A-SNP) or if TS and PS were unrelated (Figure S4, Un-LD A-SNP), so that PA relied heavily on ancestral LD. Our results also show that in the latter case, using high marker density strongly improved PA for both ancestral populations, implying that capturing LD between tightly linked loci ( < 1 cm) is beneficial even if long-range ancestral LD prevails. With low marker density, capturing only the long-range part of ancestral LD (Figure 1, LR) still provided moderate PA (Figure S4, LR), but PA dropped below < 0.1 for short-range ancestral LD (Figure S4, SR). This was likely because most SNPs were no longer in LD with QTL and thus contributed mostly noise to the prediction equation. These results are in agreement with former studies (de los Campos et al. 2013; Habier et al. 2013; Hickey et al. 2014; Lorenz and Smith 2015) reporting that under insufficient marker density, adding individuals unrelated to the PS to the TS can even decrease PA. In summary, the required marker density for N P 16 should be chosen in compliance with the extent of ancestral LD. While in this case, high density is mandatory if TS and PS are unrelated, moderate PA can still be achieved under low marker density if TS and PS are related due to pedigree relationships contributing to PA. For small N P, extensive LD in synthetics (due to sample LD and cosegregation) lowers the requirements on marker density. Although co-segregation is captured optimally if SNPs and QTL are as tightly linked as possible, medium marker density ( 1 SNPs cm -1, depending on N P ) is likely sufficient to reach PA near the optimum Expected impact of ancestral LD on GP in synthetics: In GP of genetic predisposition in humans or breeding values of bulls, the availability of several thousand training individuals, in conjunction with high marker densities, allows for efficient use of rather low levels of ancestral LD, as usually observed in these species (de Roos et al. 2008; Goddard and Hayes 2009; de los Campos et al. 2013). We showed that short-range ancestral LD is generally less valuable in plant breeding, where TS usually comprise only hundreds or fewer individuals. Ancestral LD can differ substantially among crops and different 24

25 germplasm within crops (Flint-Garcia et al. 2003). Usually, low levels of ancestral LD are found in diversity panels that encompass lines from different breeding programs and/or geographic origin as well as in materials largely unselected by breeders, such as landraces or gene bank accessions (Hyten et al. 2007; Delourme et al. 2013; Romay et al. 2013). Recently, Gorjanc et al. (2016) proposed GP for recurrent selection of synthetics generated from doubled haploid lines derived from landraces. In the light of our findings, such an approach generally requires large TS size and high marker density to outperform pedigree BLUP, unless one chooses small N P to ensure satisfactory PA due to cosegregation. In contrast, extensive long-range ancestral LD is usually found in elite breeding germplasm of major crops such as maize (Windhausen et al. 2012; Unterseer et al. 2014), wheat (Maccaferri et al. 2005), barley (Zhong et al. 2009), soybean (Hyten et al. 2007) or sugar beet (Würschum et al. 2013). If synthetics were derived from such germplasm, ancestral LD is expected to contribute substantially to PA, as shown by our results. However, LD determined from biallelic SNPs might overestimate ancestral LD between QTL-SNP pairs, because their allele frequencies can deviate due to ascertainment bias in discarding SNPs with low minor allele frequencies for the construction of SNP arrays (Ganal et al. 2011; Goddard et al. 2011). Such an overestimation would impair the advantage of GP approaches over pedigree BLUP Implications for other scenarios relevant in plant breeding: Research on GP in plant breeding has so far focused primarily on the use of single (e.g., Lorenzana and Bernardo 2009; Riedelsheimer et al. 2013) and multiple segregating biparental families (BF) (e.g., Heffner et al. 2011; Albrecht et al. 2011; Schulz-Streeck et al. 2012; Habier et al. 2013; Lehermeier et al. 2014). For N P = 2, our scenarios Re- LD A-SNP and Un-LD A-SNP correspond exactly to GP within and between BF derived from unrelated parents. In practice, breeders mostly derive lines directly from F 1 crosses (Mikel and Dudley 2006), whereas we applied a further generation of intermating (Figure S1). This additional meiosis slightly reduces LD in synthetics (see File S3 for details) and in turn, PA (results not shown). While GP within BF generally works well, predicting an unrelated BF can be risky and unreliable (Riedelsheimer et al. 2013) 25

26 as underlined by our results for scenario Un-LD A-SNP (Figure S7, N P = 2). Similar uncertainties might be encountered if new lines from an untested BF are predicted based on pre-existing data from multiple BF (Heffner et al. 2011), diversity panels (Würschum et al. 2013) or populations of experimental hybrids (Massman et al. 2013) to obtain predicted breeding values prior to partially phenotyping the new cross (Figure S7, N P > 2 in TS and N P = 2 in PS). The risk of such approaches is likely attenuated in advanced breeding cycles, where putatively unrelated BF usually share more recent common ancestors than a TS comprising truly unrelated material, as would be the case in an ideal diversity panel. However, Hickey et al. (2014) showed that if two BF share only a grand-parent as their most recent common ancestor, PA was not substantially higher than for unrelated BF. This underpins the need for close relatives in the TS (e.g., full-sibs or half-sibs) to warrant high and robust PA across different prediction targets. Accordingly, previous studies on GP in diversity panels concluded that the observed medium to high PAs were partially attributable to latent groups of related germplasm (e.g., Rincent et al. 2012; Schopp et al. 2015). If a BF is too small for training the prediction equation, multiple BF can be alternatively pooled together (Heffner et al. 2011; Technow and Totir 2015). Such a combined TS can be constructed by sampling lines from each BF to predict the remainder in each BF ( within ) or by using some BF to predict other BF ( across ) (cf. Albrecht et al. 2011). Our scenarios Re-LD A-SNP and Un-LD A-SNP are similar to these within and across situations for N P > 2, but - besides the additional meiosis discussed above - show another important difference to F 1-derived multiple BF: generating synthetics by random mating of the Syn-1 generation breaks up the clear pedigree structure in full-sib, half-sib and unrelated families (Figure S9). This reduces both the mean and variance of pedigree relationships, which in turn reduces PA (results not shown). As discussed above, capturing pedigree relationships plays a major role in GP of both synthetics and multiple BF if TS and PS are related, especially if N P is large. This is because in both situations, co-segregation is barely used to obtain accuracy within families (cf. Habier et al. 2013). In practical breeding programs using multiple BF, the situation might be slightly different, if some parents are overrepresented compared with others and introduce a predominant linkage phase patterns that can be exploited in GP. Moreover, one has the opportunity 26

27 to improve information from co-segregation by (i) clustering related BF into the TS to reflect the cosegregation pattern of the PS or (ii) explicit modelling of co-segregation (cf. Habier et al. 2013) or family-specific effects using hierarchical models (Technow and Totir 2015). However, both of these strategies are not easily accessible in synthetics, unless one replaces random by controlled mating in order to keep track of pedigree relationships. Since ancestral LD persists well over generations (Habier et al. 2007), its contribution to PA is expected to be only marginally affected by additional intermating generations. Thus, ancestral LD can generally be considered of great importance for GP of material related or unrelated to the TS, particularly if N P is large. In the present study, we considered the two most extreme situations of relatedness or unrelatedness of the TS and PS, because their parents were either identical or entirely different. Further research is warranted for situations of partial overlapping of parents among families, which occurs frequently in practice, e.g., when proven inbred lines contribute to multiple crosses in subsequent breeding cycles. Moreover, we focused here exclusively on PA, but the genetic gain from genomic selection, which is of ultimate interest to breeders, depends additionally on the genetic variance in the population. Since both parameters are influenced by the choice of N P, the potential of recurrent genomic selection in synthetics needs to be examined for different values of N P and different levels of ancestral LD, ideally across multiple selection cycles. 27

28 664 ACKNOWLEDGMENTS We thank Chris-Carolin Schön, Tobias Würschum, José Marulanda, Willem Molenaar and three anonymous reviewers for valuable suggestions to improve the content of the manuscript. PS acknowledges Syngenta for partially funding this research by a Ph.D. fellowship and AEM the financial contribution of CIMMYT/GIZ through the CRMA Project DATA AVAILABILITY STATEMENT The authors state that all simulated data and results necessary for confirming the conclusions presented in the article are represented fully within the article and data supplements. Figure S1 provides a detailed overview over the entire simulation scheme and assumptions underlying all results presented herein

29 676 LITERATURE CITED Albrecht, T., H.-J. Auinger, V. Wimmer, J. O. Ogutu, C. Knaak et al., 2014 Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years. Theor. Appl. Genet. 127: Albrecht, T., V. Wimmer, H. Auinger, M. Erbe, C. Knaak et al., 2011 Genome-based prediction of testcross values in maize. Theor. Appl. Genet. 123: Bandillo, N., C. Raghavan, and P. Muyco, 2013 Multi-parent advanced generation inter-cross (MAGIC) populations in rice: progress and potential for genetics research and breeding. Rice 6: Bradshaw, J. E., 2016 Plant Breeding: Past, Present and Future. Springer International Publishing. Cavanagh, C., M. Morell, I. Mackay, and W. Powell, 2008 From mutations to MAGIC: resources for gene discovery, validation and delivery in crop plants. Curr. Opin. Plant Biol. 11: Clark, S. a, J. M. Hickey, H. D. Daetwyler, and J. H. J. van der Werf, 2012 The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44: 4. Delourme, R., C. Falentin, B. F. Fomeju, M. Boillot, G. Lassalle et al., 2013 High-density SNP-based genetic map development and linkage disequilibrium assessment in Brassica napus L. BMC Genomics 14: 120. Endelman, J. B., 2011 Ridge Regression and Other Kernels for Genomic Selection with R Package rrblup. Plant Genome 4: Falconer, D. F., and T. S. C. Mackay, 1996 Introduction to Quantitative Genetics (1996 Longman, Ed.). Pearson, Essex. Flint-Garcia, S. a, J. M. Thornsberry, and E. S. Buckler, 2003 Structure of linkage disequilibrium in plants. Annu. Rev. Plant Biol. 54: Ganal, M. W., G. Durstewitz, A. Polley, A. Bérard, E. S. Buckler et al., 2011 A large maize (Zea mays L.) SNP genotyping array: development and germplasm genotyping, and genetic mapping to compare with the B73 reference genome. PLoS One 6: e Goddard, M. E., and B. J. Hayes, 2009 Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10: Goddard, M. E., B. J. Hayes, and T. H. E. Meuwissen, 2011 Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128: Gorjanc, G., J. Jenko, S. J. Hearne, and J. M. Hickey, 2016 Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genomics 17: 30. Habier, D., R. L. Fernando, and J. C. M. Dekkers, 2007 The impact of genetic relationship information on genome-assisted breeding values. Genetics 177: Habier, D., R. L. Fernando, and D. J. Garrick, 2013 Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction. Genetics 194: Habier, D., J. Tetens, F. Seefried, P. Lichtner, and G. Thaller, 2010 The impact of genetic relationship information on genomic breeding values in German Holstein cattle. Genet. Sel. Evol. 42: 5. Hagdorn, S., K. Lamkey, M. Frisch, G. P. E. O., and M. A. E., 2003 Molecular genetic diversity among progenitors and derived elite lines of BSSS and BSCB1 maize populations. Crop Sci. 43:

30 Hallauer, A. R., M. J. Carena, and J. de M. Filho, 2010 Quantitative genetics in maize breeding. Springer. Hartl, D. L., and A. G. Clark, 2007 Principles of Population Genetics. Sinauer Associates, Inc. Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard, 2009a Genomic selection in dairy cattle: progress and challenges. J. Dairy Sci. 92: Hayes, B. J., P. J. Bowman, A. C. Chamberlain, K. Verbyla, and M. E. Goddard, 2009b Accuracy of genomic breeding values in multi-breed dairy cattle populations. Genet. Sel. Evol. 41: 51. Hayes, B. J., P. M. Visscher, and M. E. Goddard, 2009c Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res. Cambridge 91: Heffner, E. L., J. Jannink, and M. E. Sorrells, 2011 Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding Program. Plant Genome 4: Henderson, C., 1984 Applications of linear models in animal breeding. University of Guelph, ON. Heslot, N., and J.-L. Jannink, 2015 An alternative covariance estimator to investigate genetic heterogeneity in populations. Genet. Sel. Evol. 47: 93. Hickey, J. M., S. Dreisigacker, J. Crossa, S. Hearne, R. Babu et al., 2014 Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54: Hill, W. G., 1981 Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38: Hill, W. G., and A. Robertson, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: Hill, W. G., and B. S. Weir, 2011 Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. Cambridge 93: Hyten, D. L., I. Y. Choi, Q. Song, R. C. Shoemaker, R. L. Nelson et al., 2007 Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175: Jannink, J.-L., A. J. Lorenz, and H. Iwata, 2010 Genomic selection in plant breeding: from theory to practice. Briefings Funct. genomics proteomics 9: de Koning, D.-J., 2016 Meuwissen et al. on Genomic Selection. Genetics 203: 5 7. Lehermeier, C., N. Krämer, E. Bauer, C. Bauland, C. Camisan et al., 2014 Usefulness of multi-parental populations of maize (Zea mays L.) for genome-based prediction. Genetics 198: Lin, Z., B. J. Hayes, and H. D. Daetwyler, 2014 Genomic selection in crops, trees and forages: A review. Crop Pasture Sci. 65: Lorenzana, R. E., and R. Bernardo, 2009 Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120: Lorenz, A. J., and K. P. Smith, 2015 Adding genetically distant individuals to training populations reduces genomic prediction accuracy in Barley. Crop Sci. 55: de los Campos, G., A. I. Vazquez, R. Fernando, Y. C. Klimentidis, and D. Sorensen, 2013 Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genet. 9: 7. Maccaferri, M., M. C. Sanguineti, E. Noli, and R. Tuberosa, 2005 Population structure and long-range linkage disequilibrium in a durum wheat elite collection. Mol. Breed. 15: Mackay, I., E. Ober, and J. Hickey, 2015 GplusE: beyond genomic selection. Food Energy Secur. 4:

31 Massman, J. M., A. Gordillo, R. E. Lorenzana, and R. Bernardo, 2013 Genomewide predictions from maize single-cross data. Theor. Appl. Genet. 126: Mcmullen, M. D., S. Kresovich, H. S. Villeda, P. Bradbury, H. Li et al., 2009 Genetic Properties of the Maize Nested AssociationMapping Population. Science (80-. ). 325: Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: Mikel, M. A., and J. W. Dudley, 2006 Evolution of North American dent corn from public to proprietary germplasm. Crop Sci. 46: Powell, J. E., P. M. Visscher, and M. E. Goddard, 2010 Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11: R Core Team, 2012 R: A language and environment for statistical computing. ISBN Riedelsheimer, C., J. B. Endelman, M. Stange, M. E. Sorrells, J. L. Jannink et al., 2013 Genomic predictability of interconnected biparental maize populations. Genetics 194: Rincent, R., D. Laloë, S. Nicolas, T. Altmann, D. Brunel et al., 2012 Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.). Genetics 192: Romay, M. C., M. J. Millard, J. C. Glaubitz, J. a Peiffer, K. L. Swarts et al., 2013 Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 14: R55. de Roos, a P. W., B. J. Hayes, and M. E. Goddard, 2009 Reliability of genomic predictions across multiple populations. Genetics 183: de Roos, a P. W., B. J. Hayes, R. J. Spelman, and M. E. Goddard, 2008 Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics 179: Sargolzaei, M., and F. S. Schenkel, 2009 QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: Schopp, P., C. Riedelsheimer, H. F. Utz, C.-C. Schön, and A. E. Melchinger, 2015 Forecasting the accuracy of genomic prediction with different selection targets in the training and prediction set as well as truncation selection. Theor. Appl. Genet. 128: Schulz-Streeck, T., J. O. Ogutu, Z. Karaman, C. Knaak, and H. P. Piepho, 2012 Genomic Selection using Multiple Populations. Crop Sci. 52: Solberg, T. R., a K. Sonesson, J. a Woolliams, and T. H. E. Meuwissen, 2008 Genomic selection using different marker types and densities. J. Anim. Sci. 86: Suneson, C. A., 1956 An Evolutionary Plant Breeding Method. Agron. J. 6: 1 4. Technow, F., A. Bürger, and A. E. Melchinger, 2013 Genomic prediction of northern corn leaf blight resistance in maize with combined or separated training sets for heterotic groups. G3 3: Technow, F., and L. R. Totir, 2015 Using Bayesian Multilevel Whole Genome Regression Models for Partial Pooling of Training Sets in Genomic Prediction. G3 5: Unterseer, S., E. Bauer, G. Haberer, M. Seidel, C. Knaak et al., 2014 A powerful tool for genome analysis in maize: development and evaluation of the high density 600 k SNP genotyping array. BMC Genomics 15: 823. VanRaden, P. M., 2008 Efficient methods to compute genomic predictions. J. Dairy Sci. 91:

32 Vela-Avitúa, S., T. H. Meuwissen, T. Luan, and J. Ødegård, 2015 Accuracy of genomic selection for a sib-evaluated trait using identity-by-state and identity-by-descent relationships. Genet. Sel. Evol. 47: 9. Wientjes, Y. C. J., R. F. Veerkamp, and M. P. L. Calus, 2013 The Effect of Linkage Disequilibrium and Family Relationships on the Reliability of Genomic Prediction. Genetics 193: Windhausen, V. S., G. N. Atlin, J. M. Hickey, J. Crossa, J.-L. Jannink et al., 2012 Effectiveness of genomic prediction of maize hybrid performance in different breeding populations and environments. G3 2: Wright, S., 1922 Coefficients of Inbreeding and Relationship. Am. Nat. 56: Würschum, T., J. C. Reif, T. Kraft, G. Janssen, and Y. Zhao, 2013 Genomic selection in sugar beet breeding populations. BMC Genet. 14: 85. Zhong, S., J. C. M. Dekkers, R. L. Fernando, and J.-L. Jannink, 2009 Factors Affecting Accuracy From Genomic Selection in Populations Derived From Multiple Inbred Lines: A Barley Case Study. Genetics 182:

33 FIGURES Figure 1 Linkage disequilibrium (LD A) between pairs of loci plotted against their genetic map distance in centimorgans (cm), for the two ancestral populations SR (shortrange LD) and LR (long-range LD). The two vertical lines represent the average distance between QTL and its closest nearby SNP for the two marker densities investigated in our study Figure 2 Flowchart of the eight scenarios analyzed in this study. Training set and prediction set were either related ( Re -scenarios) or unrelated ( Un -scenarios). The arrows represent the changes made between scenarios, e.g., removal of ancestral LD between QTL and SNPs (LD A LE A) or replacing the relationship matrix (G Q). The background texture indicates whether identity-by-state or identityby-descent information was used. The green circles show for the SNP-based scenarios the sources of information that contributed to prediction accuracy (cf. Habier et al. 2013), where in addition to LD A, RS refers to pedigree relationships at QTL captured by SNPs and CS refers to co-segregation of QTL and SNPs. 33

34 Figure 3 (A) Frequency of QTL-SNP pairs falling into 8 disjoint intervals of linkage disequilibrium (LD) in the parents of synthetics, plotted against their genetic map distance, for three different numbers of parents N P. (B) Average LD between QTL-SNP pairs, plotted against their genetic map distance, for synthetics generated from different N P. The mean LD in the respective ancestral population (LD A) is shown for comparison (red graphs). The left column in A and B refers to scenarios Re-LE A-SNP and Un- LE A-SNP (independent of the ancestral population), where ancestral LD between QTL and SNPs was eliminated, whereas the other two columns correspond to all other scenarios, for the ancestral populations SR (short-range LD) and LR (long-range LD), respectively. 34

35 Figure 4 Linkage phase similarity of QTL-SNP pairs in the training set (TS) and prediction set (PS) for scenarios Re-LD A-SNP and Un-LD A-SNP, plotted against the number of parents N P used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD), and for different genetic map distances (0.5, 5 and 20 cm ± 0.5 cm) between QTL and SNPs Figure 5 Prediction accuracy for seven scenarios (scenario Un-LE A-SNP not shown), plotted against the number of parents N P used to generate synthetics, for the two ancestral populations SR (short-range LD) and LR (long-range LD). Results refer to a training set size of N TS = 250 doubled haploid lines and a marker density of 5 SNPs cm

36 Figure 6 (A) Frequency of the seven possible values f ij of pedigree relationships for different numbers of unrelated inbred parents N P used to generate synthetics. (B) Conditional distributions q ij f ij of actual relationships q ij conditional on their pedigree relationship f ij between individuals i and j in the training set and prediction set, respectively, for the two ancestral populations SR (short-range LD) and LR (long-range LD). 36