Genomic Selection in R Giovanny Covarrubias-Pazaran Department of Horticulture, University of Wisconsin, Madison, Wisconsin, Unites States of America E-mail: covarrubiasp@wisc.edu. Most traits of agronomic importance are quantitative in nature, and genetic markers have been used for decades to dissect such traits. Recently, genomic selection has earned attention as next generation sequencing technologies became feasible for major and minor crops. Mixed models have become a key tool for fitting genomic selection models, but most current genomic selection software can only include a single variance component other than the error, making hybrid prediction using additive, dominance and epistatic effects unfeasible for species displaying heterotic effects. Likelihood-based software for fitting mixed models with multiple random effects that allows the user to specify the variance-covariance structure of random effects has not been fully exploited. The R package sommer facilitates the use of mixed models for genomic selection and hybrid prediction purposes using more than one variance component and allowing specification of covariance structures. The program contains four algorithms for estimating variance components: Average information (AI), Newton-Raphson (NR), Expectation-Maximization (EM) and Efficient Mixed Model Association (EMMA; ridge regression). Kernels for calculating the additive, dominance and epistatic relationship matrices are included, along with other useful functions for genomic analysis. sommer can handle more complex problems than regular genomic selection software, and is faster than Bayesian counterparts in the magnitude of hours to days, and can deal with missing data, using a gentle environment such as R. Other software available for genomic selection are:
a)! rrblup (only functional for a single random effect) b)! regress (only NR algorithm available, returns negative variance components) c)! ASReml (not free) d)! BGLR (Bayesian can take a long time) e)! MCMCglmm (Better Bayesian but this takes a long long time 1000 ) Four scenarios of genomic selection are highlighted in this document: PREDICTION OF GENERAL PERFORMANCE OF CROSSES 1)! Genotypic and phenotypic data for the parents is available and we want to predict performance of the possible crosses assuming a purely additive model (species with no heterosis) 2)! Genotypic data for the parents is available and phenotypic data for some of all the possible crosses is available (~10%), and we want to predict performance of the rest of the possible crosses (~90%) assuming an additive-dominant model (species with heterosis) PREDICTION OF SPECIFIC PERFORMANCE OF INDIVIDUALS WITHIN POPULATIONS 3)! Genotypic data for a population of individuals is available and phenotypic data is available only for some (i.e. phenotiping is very expensive) and you aim to predict the rest of the population using a purely additive model. 4)! Genotypic data for a population of individuals is available and phenotypic data is available only for some (i.e. phenotiping is very expensive) and you aim to predict the rest of the population using an additive-dominance-epistatic model. Situation 1) occurs when you work with a species that it reproductive mechanism that is mainly self pollinated, therefore heterosis is strangely encounter. The performance of the cross can be estimated then as the average of parental breeding values (BV). To obtain the breeding value of certain parents, such materials are tested in different locations and years and fitting a mixed model to obtain the genotypic BLUPs. Henderson realized that when some data is missing, the use of the pedigree among individuals could be used to predict the performance of some
individuals in the scenarios where the data was missing. Keeping track of pedigrees is difficult in all breeding programs which has lead to the estimation of relationships using markers. This matrix of relationships based on markers has been named genomic relationship matrix, and is parallel to the additive relationship matrix based on pedigrees. Assume you work at CIMMYT and have genomic information for 599 lines with 1279 SNP markers each. Given they are lines you expect only additive variance to be significant. Now you want to predict the performance of all possible crosses among those 599 lines. Using sommer you would do it this way: #### call the phenotypic and genotypic information library(sommer) data(wheatlines) X <- wheatlines$wheatgeno; X[1:5,1:5]; dim(x) Y <- wheatlines$wheatpheno rownames(x) <- rownames(y) #### select environment 1 and create incidence and additive #### relationship matrices y <- Y[,1] # response grain yield Z1 <- diag(length(y)) # incidence matrix K <- A.mat(X) # additive relationship matrix #### perform the GBLUP pedigree-based approach by ### specifying your random effects (ETA) in a 2-level list ### structure and run it using the mmer function ETA <- list(add=list(z=z1, K=K)) ans <- mmer(y=y, Z=ETA, method="emma") # kinship based summary(ans) #### Predict the progeny by extracting the BV for the lines #### and get the average BV for all possible combinations GEBV.pb <- ans$u.hat # this are the BV
rownames(gebv.pb) <- rownames(y) crosses <- do.call(expand.grid, list(rownames(y),rownames(y))); dim(crosses) cross2 <- duplicated(t(apply(crosses, 1, sort))) crosses2 <- crosses[cross2,]; head(crosses2); dim(crosses2) # get GCA1 and GCA2 of each hybrid GCA1 = GEBV.pb[match(crosses2[,1], rownames(gebv.pb))] GCA2 = GEBV.pb[match(crosses2[,2], rownames(gebv.pb))] #### join everything and get the mean BV for each combination BV <- data.frame(crosses2,gca1,gca2); head(bv) BV$BVcross <- apply(bv[,c(3:4)],1,mean); head(bv) plot(bv$bvcross) Finally, you will get the GEBV for the 179,101 possible crosses from this 599 wheat lines, you can sort them by best performance and in the real world you would do the best crosses predicted. Situation 2) occurs when you work with a species with an outcross reproductive mechanism therefore heterosis is usually encountered. In this example we show how to perform genomic prediction for single crosses that have not occurred yet using information of some of the single crosses available. Assume you work in a corn breeding program and have 40 plants from 2
heterotic groups, 20 in each (Dent and Flint). And you have genotypic data for the 40 parents and phenotypic information from 100 out of the 400 possible crosses evaluated in four environments. You can use this information to predict the other 300 crosses. data(cornhybrid) hybrid2 <- cornhybrid$hybrid # extract cross data A <- cornhybrid$k # Additive relationship matrix for all y <- hybrid2$yield # response ### incidence matrices X1 <- model.matrix(~ Location, data = hybrid2);dim(x1) Z1 <- model.matrix(~ GCA1-1, data = hybrid2);dim(z1) Z2 <- model.matrix(~ GCA2-1, data = hybrid2);dim(z2) Z3 <- model.matrix(~ SCA -1, data = hybrid2);dim(z3) #### Realized IBS relationships for each effect K1 <- A[levels(hybrid2$GCA1), levels(hybrid2$gca1)]; dim(k1) K2 <- A[levels(hybrid2$GCA2), levels(hybrid2$gca2)]; dim(k2) S <- kronecker(k1, K2) ; dim(s) rownames(s) <- colnames(s) <- levels(hybrid2$sca) ### specify random component ETA <- list(list(z=z1, K=K1), list(z=z2, K=K2), list(z=z3, K=S)) ans <- mmer(y=y, X=X1, Z=ETA) summary(ans) Now you have fitted values for all possible 400 single cross hybrids including those missing points and BLUPs for GCA s and SCA s.
Situation 3) occurs when you want to predict the performance of a specific individual that you have genotype but not phenotype. This usually occurs when the phenotyping is expensive and you can only achieve to phenotype some individuals, but genotyping is not limited. Therefore, you can use the all the information to predict the performance of the individuals that are genotyped but not phenotyped. We will predict the color for individuals using a purely additive model in a full sib family. data(cpdata) CPpheno <- CPdata$pheno CPgeno <- CPdata$geno ### look at the data head(cppheno) CPgeno[1:5,1:5] ## fit a model including additive and dominance effects y <- CPpheno$color Za <- diag(length(y))
A <- A.mat(CPgeno) # additive relationship matrix y.trn <- y # copy the response to test prediction accuracy ### delete data for 1/5 of the population ww <- sample(c(1:dim(za)[1]),72) y.trn[ww] <- NA ETA.A <- list(add=list(z=za,k=a)) ans.a <- mmer(y=y.trn, Z=ETA.A) cor(ans.a$fitted.y[ww], y[ww], use="pairwise.complete.obs") Situation 4) is the same than the previous example but adding the dominance and epistatic relationships to the model. Given that this is a full sib family and we know that ¼ of the σ 2 D is shared among individuals of this type of family we include this effects expecting to gain prediction accuracy. Zd <- diag(length(y)) Ze <- diag(length(y)) D <- D.mat(CPgeno) # dominant relationship matrix E <- E.mat(CPgeno) # epistatic relationship matrix ETA.ADE <- list(add=list(z=za,k=a),dom=list(z=zd,k=d),epi=list(z=ze,k=e)) ans.ade <- mmer(y=y.trn, Z=ETA.ADE) cor(ans.ade$fitted.y[ww], y[ww], use="pairwise.complete.obs") summary(ans.ade) As you can see the epistatic variance is usually zero or insignificant to make a difference in the prediction but the addition of dominance relationships definitely increased the prediction accuracy in full sib families as theory states. You may want to check it from time to time and use it for this families or polyploidy organisms.
Figure. Comparison among purely additive versus additive+dominance model showing a prediction increment in a full sib family where dominance relationships are important.