Technical Challenges to Implementation of Genomic Selection

Size: px
Start display at page:

Download "Technical Challenges to Implementation of Genomic Selection"

Transcription

1 Technical Challenges to Implementation of Genomic Selection C. Maltecca Animal Science Department NCSU NSIF Meeting 1

2 Outline Introduction A first choice: Single stage vs. Two stages A syncretic overview of priors in two stages approaches Alternatives to MCMC Lighting the black box: toward the informed use of prior information A look forward NSIF Meeting

3 Introduction A few sparse remarks: Livestock traits are in large part polygenic, so far little evidence of true oligogenic, traits. No knowledge of the specific genes controlling these traits has been needed to make improvement through selection Genomic selection promises to increase selection accuracy and accelerate genetic improvement by emphasizing the SNP most strongly correlated to phenotype The genes and sequence variants affecting phenotype remain largely unknown (Snelling et al. 01) Genomic predictions theoretically rely on linkage disequilibrium (LD) between genotyped SNP and unknown functional variants, but familial linkage de facto increases effectiveness when predicting individuals related to those in the training data. NSIF Meeting 3

4 A first choice: Single stage vs. Two stages ssblup Simple (at least conceptually) Applicable with minimal changes to a wide variety of models Threshold RRM Survival X X X W 1 W X W W + α H X y b = u W y 1 A11 A H 1 = 1 A + G 1 A 1 A NSIF Meeting

5 A first choice: Single stage vs. Two stages ssblup Computationally demanding In pedigree based BLUP MME solutions are proportional to the number of individuals in the pedigree In most genomic ssblup the cost is cubic on the number of genotyped individuals Needs efficient rewriting of some of the software used for national evaluations Legarra and Ducrocq (01) Solving ssmme by considering two systems: One as close as possible to regular MME One isolating the block corresponding to the genotyped individuals Iterating and updating between solution of the systems X X W X 1 W X 0 0 X '1 W1 X '1 W 0 W1 W1 + α A11 α A1 0 α A1 W W + α A αi 0 0 αi αi α A NSIF Meeting 0 0 α I 0 αg 0 b u 1 u ϕ γ X y W y 1 = W y 0 0 5

6 A first choice: Single stage vs. Two stages ssblup Computationally demanding Creation and inversion of G for 30,000 individuals (0,000 SNP) ~3h (Aguilar 011) Faux et al. (01) A method to approximate inversion of G based on Cholesky factorization 1 = T D 1T G T G= ZZ' m pi (1 pi ) i=1 For a subject: i) Select animal closely related based on threshold ii) Perform regression of these individuals on subject iii) Fill position in T with inverse of solutions and 1 on diagonal NSIF Meeting 6

7 A syncretic overview of priors in two stages approaches Most two stage approaches rely on a Bayesian implementation Phenotypes Pop mean Residual Variance Genotypes Pedigree Effect size Polygene Effect Variance Polygene Variance π Prior for Variance Prior for polygenic variance Prior for π Prior for scale Indicator Non informative priors Modified from Karkkainen and Sillanpaa (01) NSIF Meeting 7

8 A syncretic view of priors in two stages approaches All two stages share some commonality: It is possible to see most priors specification as either generalizations or particular case of others De los Campos et al. (01) 8

9 A syncretic overview of priors in two stages approaches Performance of different prior specifications are relatively similar both in livestock and plants for a wide range of traits: Reliability of Breeding Values for Sires in Validation Set Trait1 TMT AT TP DT MMF AVGF PAa debv GBLUP BayesA3 Bayesian Lasso3 HBLUP Gray et al. (01) 9

10 A syncretic overview of priors in two stages approaches Attempts have been made to generalize the most popular Bayesian methods to allow differential shrinkage for arbitrary (>) groups of markers (Gianola et al., 010; Maltecca et al. 01) In these implementations, marker grouping is data driven. In most cases for real data cluster behavior is erratic and standard models perform comparably well without the extra burden of inferring the cluster number 10

11 Alternatives to MCMC Although popular, from the computational perspective, the application of the Markov chain Monte Carlo (MCMC) method is not optimal for high-dimensional problems (HD? Sequence?) Computational burden proportional to (n*m) Few alternatives have been proposes mostly based on maximum a posteriori (MAP) by expectation maximization (EM): Hayashi and Iwata (010) Karkkainen and Sillanpaa (01) (a generalized EM algorithm) Sun et al. (01) Unlike most MAP, variational Bayes (Li and Sillanpaa 01) methods produce a measure of uncertainty of the estimates and can be considered as generalization of EM methods. In the full Bayesian analysis, many posterior distributions are intractable. In VB estimation, find a tractable distribution that can approximate the target posterior 11

12 Alternatives to MCMC Variational Bayes LASSO application to genomic predictions (Maltecca unpublished data) Convergence time 15 min Iteration(10),convergence( e-0) Iteration(100),convergence(5.9731e-0) Iteration(00),convergence( e-03) Iteration(300),convergence( e-0) Iteration(00),convergence( e-0) Iteration(500),convergence( e-05) Iteration(600),convergence( e-05) Iteration(700),convergence(5.77e-06) Iteration(710),convergence(.63965e-06) Iteration(70),convergence(3.7156e-06) Iteration(730),convergence( e-06) Iteration(70),convergence( e-06) rate of convergence 0.0 1; 1; 1; 1; = 1; 0.00 a = d = e = g = uni iteration*10 Iteration(10),convergence( e-05) Iteration(100),convergence( e-05) Iteration(00),convergence(.71351e-05) Iteration(300),convergence( e-05) Iteration(00),convergence(1.6968e-05) \\ Iteration(100),convergence( e-05) Iteration(1300),convergence(7.833e-06) Iteration(100),convergence(.8090e-06) Iteration(1500),convergence(.9785e-06) Iteration(1590),convergence(1.0776e-06) e-05 3e-05 e-05 Convergence time 65 min 1e ; 0.001; 0.001; 0.001; = 0.001; 0e+00 a = d = e = g = uni 5e-05 rate of convergence iteration 1

13 Lighting the black box: towards the informed use of prior information Originally postulated that genomic predictions will rely on linkage disequilibrium (LD) between genotyped SNP and unknown functional variants Evidence suggest that within-family disequilibrium between alleles at markers and those at causal loci plays a central role when predicting individuals related to those in the training data. Because of this prediction accuracy is highly dependent on familial relationships, and this has limited the ability to use data from nominally un-related individuals to enhance prediction accuracy. It is possible that genomic selection making appropriate use of functional informed SNP genotypes will be less reliant on within-family disequilibrium allowing robust prediction across unrelated populations and low heritability complex traits. Already available annotation tools to identify those features most likely to affect phenotypes, Information used to structure the prior density assigned to marker effects 13 13

14 Lighting the black box: towards the informed use of prior information A first attempt to utilize biological guidance in DGVs predictions Data from on-farm records of 9 diseases (Parker et al. 01) (~1 million events on 600,000 cows) cystic ovaries (CYST), digestive problems (DIGE), displaced abomasum (DSAB), ketosis (KETO), lameness (LAME), mastitis (MAST), metritis (METR), reproductive problems (REPR), and retained placenta (RETP). Disease CYST DIGE DSAB KETO LAME MAST METR REPR Heritability SE Sire threshold liability model ST from ~00,000 cows/health 1 1

15 Lighting the black box: towards the informed use of prior information Liabilities for sires obtained from the models in section 1 were employed to perform a principal component analysis The first principal components combined explained roughly 78% of the overall liability variability 15 15

16 Lighting the black box: towards the informed use of prior information Association Association analysis was carried out with a Bayes-C model average approach on the first PC of liabilities of sires (83) with reliability >.0 averaged across single traits. The posterior means of SNP effects were used collectively to predict the genomic merit of (non-overlapping) windows of 1Mb Windows that contributed the highest variance (>0.1) as compared to the total genomic variance were considered as indicating a potential association

17 Lighting the black box: towards the informed use of prior information Not looking to a specific trait but to a combination of (few several) biological processes underlying disease resistance (avoiding being sick at all) or specific to the etiology of each particular disease or group of diseases. In this case we speculate that we should observe enrichment in pathways or ontologies directly related to metabolic pathways for metabolic diseases and perhaps reproductive while (or in addition?) General immune functions related to Mastitis lameness (reproductive?). SNAT on-line annotation tool was used. SNP PC 1 3 Two criteria GO enrichment and pathway enrichment 17 17

18 Lighting the black box: towards the informed use of prior information Pathway enrichment Entry Pathway PCA Class bta01 Endocytosis Cellular_Processes;Cell Growth and Death bta015 Phagosome Cellular_Processes;Transport and Catabolism bta010 Apoptosis Cellular_Processes;Transport and Catabolism bta0060 Glycine, serine and threonine metabolism Metabolism; Amino Acid Metabolism bta00500 Starch and sucrose metabolism Metabolism; Carbohydrate Metabolism bta00190 Oxidative phosphorylation 1 Metabolism; Energy Metabolism bta0051 O-Glycan biosynthesis - 1 Metabolism; Glycan Biosynthesis and Metabolism bta00531 Glycosaminoglycan degradation - Metabolism; Glycan Biosynthesis and Metabolism bta01100 Metabolic pathways - NA bta01100 Metabolic pathways - NA bta0910 Insulin signaling pathway - Organismal Systems; Endocrine System bta006 Chemokine signaling pathway - 3 Organismal Systems; Immune System bta006 Chemokine signaling pathway - Organismal Systems; Immune System bta006 Chemokine signaling pathway - 1 Organismal Systems; Immune System bta06 RIG-I-like receptor signaling pathway - Organismal Systems; Immune System bta063 Cytosolic DNA-sensing pathway - Organismal Systems; Immune System bta0660 T cell receptor signaling pathway - Organismal Systems; Immune System 18

19 Lighting the black box: towards the informed use of prior information GO enrichment 19

20 Lighting the black box: towards the informed use of prior information Using results from PCA refitted markers assigning them to enriched category 1) classes identified ) enriched pathways 3) GO terms A simple modification of the Spike Slab model: All markers pertaining to an Enriched category" (based on proximity) assigned to the slab All other markers assigned to the spike (No distinction between categories) Markers not allowed to jump during iterations Retained Placenta Fold Bayes-C Guided Fold Mastitis Bayes-C Guided Fold 1 3 Ketosis Bayes-C Guided

21 A look forward Challenges in efficiently include, summarize, visualize, genomic information in selection programs: Reduce computational burden Efficiently store, retrieve and process data Challenges in efficiently integrate genomic information: Build models that capitalize on functional annotation Build infrastructure to allow data connection integration interpretation 1

22 Acknowledgments K. Gray K. Parker Y. Huang J. Cole J. Cassady NC Agricultural Founda1on W. Snelling G. de los Campos M. Cleveland