Genomic Prediction and Selection for Multi-Environments J. Crossa 1 j.crossa@cgiar.org P. Pérez 2 perpdgo@gmail.com G. de los Campos 3 gcampos@gmail.com 1 CIMMyT-México 2 ColPos-México 3 Michigan-USA. June, 2015. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 1/24
Contents 1 The problem 2 Models 3 Model fitting 4 Cross validation 5 Application examples (Part 1) 6 Model extensions with environmental covariates CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 2/24
The problem The problem In most agronomic traits, the effects of genes are modulated by environmental conditions, generating G E. Researchers working in plant breeding have developed multiple methods for accounting for, and exploiting G E in multi-environment trials. Genomic selection is gaining ground in plant breeding. Most applications so far are based on single-environment/single-trait models. Preliminary evidence (e.g., Burgueño et al., 2012) suggests that there is great scope for improving prediction accuracy using multi-environment models. The ideas can be taken one step further by incorporating information on environmental covariates. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 3/24
Continue... The problem CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 4/24
Continue... The problem CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 5/24
Models Models Model 1 (EL, Environment + Line, no pedigree) y ij = µ + E i + L i + e ij Model 2 (EA, Environment + Line, with markers) y ij = µ + E i + g j + e ij Model 3 (Environments, Line and interactions markes and environment) y ij = µ + E i + g j + Eg ij + e ij CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 6/24
Assumptions Models It is assumed that E i N(0, σ 2 E ), g N(0, σ2 gg) with G being the genomic relationship matrix and Eg ij the interaction term between genotypes and environment. Eg N(0, (Z g GZ T g ) Z E Z T E), Z g connects genotypes with phenotypes, Z E connects phenotypes with environments, and stands for Hadamart product between two matrices. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 7/24
Model fitting Description of Data Objects - Y, data frame containing the elements described below; - Y$yield: (nx1), a numeric vector with centered and standardized yield; - Y$VAR (nx1), a factor giving the IDs for the varieties; - Y$ENV (nx1), a factor giving the IDs for the environments; - A, a symmetric positive semi-definite matrix containing the pedigree or marker-based relationships (dimensions equal to number of lines by number of lines). We assume that the rownames(a)=colnames(a) gives the IDs of the lines; CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 8/24
Model fitting Model fitting Model 1 (EL, Environment + Line, no pedigree) library(bglr) # incidence matrix for main eff. of environments. ZE<-model.matrix(~factor(Y$ENV)-1) # incidence matrix for main eff. of lines. Y$VAR<-factor(x=Y$VAR,levels=rownames(A),ordered=TRUE) ZVAR<-model.matrix(~Y$VAR-1) # Model Fitting ETA<-list( ENV=list(X=ZE,model="BRR"), VAR=list(X=ZVAR,model="BRR")) fm1<-bglr(y=y$yield,eta=eta,saveat="m1_",niter=6000,burnin=1000) CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 9/24
Model fitting Model fitting Model 2 (EA, Environment + Line, with markers) X<-scale(X,center=TRUE,scale=TRUE) G<-tcrossprod(X)/ncol(X) G<-G/mean(diag(G)) L<-t(chol(G)) ZL<-ZVAR%*%L ETA<-list( ENV=list(X=ZE,model="BRR"), Grm=list(X=ZL,model="BRR") ) fm2<-bglr(y=y$yield,eta=eta,saveat="m2_",niter=6000,burnin=1000) CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 10/24
Model fitting Model 3 (Environments, Line and interactions markers and environment) ZGZ<-tcrossprod(ZL) ZEZE<-tcrossprod(ZE) K<-ZGZ*ZEZE diag(k)<-diag(k)+1/200 K<-K/mean(diag(K)) ETA<-list( ENV=list(X=ZE,model="BRR"), Grm=list(X=ZL,model="BRR"), EGrm=list(K=K,model="RKHS") ) fm3<-bglr(y=y$yield,eta=eta, saveat= M3_,nIter=6000,burnIn=1000) CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 11/24
Cross validation Cross validation 1 CV1: Prediction of performance of newly developed lines (i.e., lines that have not been evaluated in any field trials). 2 CV2: Prediction in incomplete field trials; here the aim was to predict performance of lines that have been evaluated in some environments but not in others. See Figure in next slide. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 12/24
Continue... Cross validation Figure 1: Two hypothetical cross-validation schemes (CV1 and CV2) for five lines (Lines 1-5) and five environments (E1-E5), source: Jarquín et al. (2014). CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 13/24
Application examples (Part 1) Example Wheat dataset (CIMMyT) Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (x ij, i = 1,..., n, j = 1,..., p) (coded as 0,1). The pedigree information is also available. Histogram of Y$yield Yield 1 2 3 4 5 6 7 Frequency 0 100 200 300 400 1 2 4 5 Environment 1 2 3 4 5 6 7 Y$yield Figure 2: Grain yield by environment. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 14/24
Application examples (Part 1) Data preparation... #Load genotypic data load("pedigree_markers.rdata") #Load phenotypic data pheno=read.table(file="599_yield_raw-1.prn",header=true) pheno=pheno[,c(2,5,6)] index=paste(pheno$env,pheno$gen1,sep="@") yavg=tapply(pheno$gy,index,"mean") tmp=names(yavg) tmp2=strsplit(tmp,"@") gen=character() env=character() for(i in 1:length(tmp2)) { env[i]=tmp2[[i]][1] gen[i]=tmp2[[i]][2] } Y=data.frame(yield=yavg,VAR=gen,ENV=env) index=order(as.character(y$env),as.character(y$var)) Y=Y[index,] CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 15/24
Continue... Application examples (Part 1) index=order(colnames(a)) A=A[index,index] X=X[index,] save(y,a,x,file="standarized_data.rdata") CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 16/24
Application examples (Part 1) Code for cross validation schemas... #CV=1: assigns lines to folds #CV=2: assigns entries of a line to folds CV<-2 nfolds<-5 sets<-rep(na,nrow(y)) set.seed(123) IDs<-as.character(unique(Y$VAR)) if(cv==1) { folds<-sample(1:nfolds,size=length(ids),replace=true) for(i in 1:nrow(Y)){ sets[i]<-folds[which(ids==y$var[i])] } } if(cv==2) { IDy<-as.character(Y$VAR) for(i in IDs){ tmp=which(idy==i) ni=length(tmp) tmpfold<-sample(1:nfolds,size=ni,replace=ni>nfolds) sets[tmp]<-tmpfold } } CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 17/24
Application examples (Part 1) Fitting model and extracting results... ################################################### #Model 1 ################################################### # incidence matrix for main eff. of environments. ZE<-model.matrix(~factor(Y$ENV)-1) # incidence matrix for main eff. of lines. Y$VAR<-as.factor(Y$VAR) ZVAR<-model.matrix(~Y$VAR-1) # Model Fitting ETA<-list( ENV=list(X=ZE,model="BRR"), VAR=list(X=ZVAR,model="BRR")) y=y$yield testing=(sets==1) y[testing]=na fm1<-bglr(y=y,eta=eta,saveat="m1_",niter=6000,burnin=1000) unlink("*.dat") #Extract the predictions predictions=data.frame(env=y$env[testing], Individual=Y$VAR[testing], y=y$yield[testing], yhat=fm1$yhat[testing]) CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 18/24
Continue... Application examples (Part 1) #write.table(predictions,file=paste("predictions.csv",sep=""), # row.names=false,sep=",") #doby version predictions=orderby(~env,data=predictions) lapplyby(~env,data=predictions,function(x){cor(x$yhat,x$y)}) > lapplyby(~env,data=predictions,function(x){cor(x$yhat,x$y)}) $ 1 [1] 0.01630911 $ 2 [1] 0.6108203 $ 4 [1] 0.564435 $ 5 [1] 0.289207 CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 19/24
Application examples (Part 1) Results for one fold... Correlation 0.0 0.1 0.2 0.3 0.4 M1 M2 M3 Figure 3: Results from CV1 CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 20/24
Continue... Application examples (Part 1) Correlation 0.0 0.1 0.2 0.3 0.4 0.5 M1 M2 M3 Figure 4: Results from CV2 CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 21/24
Model extensions with environmental covariates Model extensions with environmental covariates This model is obtained by extending model EA by incorporating the environmental covariates. Model 4 (EAW) y ij = µ + E i + a j + t ij + e ij, where t ij = Q q=1 W ijqγ q represent a regression on ECs and W ijq is the evaluation of the q-th EC at the ij-th environmental-line combination and γ q represents the effect of the q-th EC. Assumptions: γ q N(0, σ 2 γ), t = W γ N(0, σ 2 t W W T ). CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 22/24
Model extensions with environmental covariates Continue... Model 5 (EAW-A W) y ij = µ + E i + a j + t ij + at ij + e ij Assumptions: at N(0, (Z p GZ T p ) WW T σ 2 at ) CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 23/24
Model extensions with environmental covariates References Burgueño, J., G. de-los-campos, K. Weigel, and J. Crossa. (2012). Genomic prediction of breeding values when modeling genotype environment interaction using pedigree and dense molecular markers. Crop Science, 43: 311-320. Jarquín, D., J. Crossa, X. Lacaze, P. Cheyron, J. Daucourt, J. Lorgeou, F. Piraux, et al. (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics, 127 (3): 595-607. CIMMYT, México-SAGPDB Genomic Prediction and Selection for Multi-Environments 24/24