Efficient inversion of genomic relationship matrix and its implications. Ignacy Misztal University of Georgia. Work at UGA

Size: px
Start display at page:

Download "Efficient inversion of genomic relationship matrix and its implications. Ignacy Misztal University of Georgia. Work at UGA"

Transcription

1 Efficient inversion of genomic relationship matrix and its implications Ignacy Misztal University of Georgia Work at UGA Evaluation methodologies for nearly all major animal genetic companies across species Side effects of many sponsors Troubleshooting Ideas for new research Access to perhaps most extensive data sets anywhere This talk Discovery of efficient inverse of genomic relationship matrix Implications for GWA and deliverables from sequence data 1

2 Genetic evaluation over time Index sequential fitting of selected effects BLUP joint fitting of all effects Crucial efficient inversion of numerator relationship matrix Multistep genomic sequential fitting Single-step GBLUP joint fitting Cost of direct inversion of G high for over 100k animals Inverse of matrix that combines pedigree and genomic relationships -1-1 H = A G - A Aguilar et al., 2010 Christensen and Lund, 2010 Boemcke et al., 2010 GEBV young=w1pa + w2dgv-w3pi 2

3 Size problem Number of genotyped animals About 1 million Holsteins ~25,000 proven bulls Close to 200,000 Angus Easy inversion of G for 100,000 animals Dimensionality of genomic relationship matrix G U D U' Eigenvalue decomposition = G not full rank (VanRaden,2008; Maciotta et al., 2010; Aguilar et al., 2010) Problems with inversion in Holsteins at > 5-10k genotypes Dimensionality around 10k (Maciotta et al., 2010) G U D U ' = t t t If PCG iteration, need Gq Gq U D ( U 'q) Eigenvalue decomposition expensive t t t 3

4 Proposed computing options Unsymmetric Single-Step (Misztal et al., 2009; Legarra and Ducrocq, 2011) Does not converge SS SNP model with imputation for ungenotyped animals (Fernando et al., 2014) Very expensive SS with SNP effects for genotyped animals only (Legarra and Ducrocq, 2011; Liu et al., 2014) Does not converge All require substantial/untested changes to mixed models "Anybody can make things bigger, more complex.. It takes imagination -- and a lot of courage -- to move in the opposite direction. "Imagination is more important than knowledge. Einstein 4

5 Inversion by recursion u u, u,.., u p ' u i 1 2 i 1 i i u Pu Φ Generic recursion, any order of animals 1 1 var( u) ( I P)'var( Φ) ( I P) Cost low only if P sparse For pedigree relationships if animals ordered from oldest to youngest (Henderson, 1976; Quaas, 1988): u 0.5u 0.5u i si di i P very sparse Is limited recursion applicable to genomic relationships? Effort by Faux et al. (2012) Algorithm for proven and young animals (APY) For young animals u i u 1,u 2,...,u i-1 = =0 in GBLUP u j +e i å p ij u j + å p ij j=" proven" j=" young" Misztal et al. (2014) Gpp 0 GppGp y -1-1 G = + M G ypgpp I 0 0 I Z genotypes for proven animals p Z genotypes for young animals y m g z 'Z 'G i ii -1 i p ppzpzi Linear cost for young animals 5

6 Tests with Holsteins (Fragomeni et al., 2015) G needed G -1 Regular inverse APY inverse 23k bulls as proven Correlations of GEBV with regular inverse > k cows as proven > k random animals as proven > 0.99 u u c uc P u n nc c n Choose core c and noncore n animals uc I 0 uc un Pnc I εn I 0 Gcc 0 I Pcn G Pnc I 0 Mnn 0 I G 1 I P 1 cn Gcc 0 I I 0 Mnn Pnc I The inverse Do we need info of SNP clusters/ chromosome blocks? P G G, Mnn diag{ gi, i pi,1: i 1 g' i,1: i 1} 1 nc nc cc No 6

7 Costs with 720k genotyped animals (Masuda et al., 2016) 22 M US Holsteins for production Computing time for APY inverse 4h Would be one month with direct inverse on supercomputer Genomic single-step GBLUP evaluation ~ 2 times more expensive than BLUP Accuracy worse with all genotypes Quality issues Indirect prediction (Lourenco et al., 2015) Everything should be made as simple as possible, but not simpler Einstein 7

8 % genome coverage 6/3/2016 Why APY works? Limited dimensionality of genomic information Limited number of independent SNP clusters Limited number of independent chromosomal segments (Me) Independent chromosome segments Heterogenetic and homogenic tracts in genome (Stam, 1980) Rediscovered by VanRaden (2008) Independent chromosome segments Me (Goddard, 2009; Daetwyler, 2010) E(Me)=4NeL (Stam, 1980) Ne effective population size L length of genome in Morgans 100% 4NeL Need 12 Me SNPs to detect 90% of junctions (MacLeod et al., 2005) Number of biggest segments 8

9 Independent SNP clusters BV SNP effects u Za Z U Δ V Singular value decomposition = U U=I, V V=I, Δ G UΔΔU' UDU' Genomic relationship matrix Rank(G) min(#snp,#anim) Z' Z V' ΔΔV SNP BLUP design matrix Rank(Z Z) min(#snp,#anim) u U Δ Va U Δs Vsa Original SNP SNP clusters Rank (Genomic relationship matrix, SNP BLUP design matrix) Min( < Number of SNP, Number of animals, Number of independent chromosome segments) 9

10 Recursions with limited dimensionality of genomic information Breeding value u Ta e Very small error Effect of independent SNP clusters or independent chromosome segment Assume 10,000 independent segments 10,000 randomly chosen core animals (u c ) a T u 1 c c All additive information in any 10,000 animals Questions Chromosome segments function of Ne SNP clusters determined by distribution of eigenvalues Is number of SNP clusters function of Ne? How to determine dimensionality of G? 10

11 Fraction of G variance explained G= UDU' sum(d 1:n ) sum(d) 100% n number of largest eigenvalues What fraction of variance in G is information and what is noise? 11

12 4 NeL 99% 98% 2 NeL 95% NeL 90% 12

13 13

14 True accuracies as function of number of eigenvalues corresponding to given explained variance in G Ne=20 Ne=200 Approximate number of animals / segments NeL 2NeL 4NeL Accuracies maximized by 98% information in G, 95% almost as good Last 2% of information in G noise Real populations study (Pocrnic et al., 2016) Dairy, beef, pigs, broilers 14

15 Number of eigenvalues Number of eigenvalues 6/3/2016 Number of eigenvalues in G to explain given fraction of variability Holstein (77k animals) Angus (81k animals) Pigs (13k animals) Broilers (18k animals) Fraction of explained variance in G Number of eigenvalues in G to explain given fraction of variability Ne=160 Holstein Ne=80 Angus Ne=40 Pig Chicken Fraction of explained variance 15

16 Realized reliability (R 2 ) for 3 traits in Jerseys Jersey Milk 75K Jersey Fat 75K Jersey Protein 75K Questions Distribution of chromosome segments Dimensionality of G with SNP selection Crossbreeding and APY 16

17 Unlimited GREML with APY? Conclusions Discovery of rules to invert genomic relationship matrix Based on limited dimensionality of genomic information With single-step and APY, genomic selection perhaps mature methodology 17

18 ssgblup for Genome Wide Association Studies Large research interest in GWAS Limitations of Bayesian methods G=ZZ unweighted genomic relationships G=ZDZ weighted G R 2 in dairy 1400 genotyped animals (Lourenco et al., 2012) Milk Fat% Prot% BLUP ssgblup 1 ssgblup 2 ssgblup 3 18

19 Correlations between QTLs and clusters of SNP effects - simulation ssgwas/1 Single SNP BayesB SNP cluster size Wang et al., 2012 Hassani et al.,

20 GWAS resolution and Ne Manhattan plots from different method different unless averaged by ~ 1 Mb segments (IA State studies, +++) 10 Mbase in chicken (Hawken et al., 2015) Best accuracy with weights on blocks of 30 SNP or 3 Mb (Su et al., 2014) 620k independent SNP in human population with Ne= k independent SNP in animal population with Ne=100? Hypothesis: GWAS resolution function of Ne Response to QTL in GWAS - hypothesis QTL Noise ~1/Ne M Expected response in SNP BLUP Response by BayesB Response by BayesB 2 20

21 Comparison of Three Methods (broilers) ssgblup Iterations on SNP (it3) 2.5% Classical GWAS 0.8% BayesB 23% Wang et al., 2012 Therefore, the proposed mutation rate gave an expected heterozy- between QTL. The simulated additive genetic variance of each Plots and accuracies in Zhang et al. (2010) simulation BayesB Acc 0.83 Weighted RRGBLUP Acc

22 GWAS findings / BayesB With SNP weighting, small or no improvement in accuracy of GEBV with large number of genotypes What is wrong: BayesB or UGA? Why large SNP with small but not big number of genotyped animals? y=μ+za+e SNP BLUP y=μ+dgv+e Equivalent model GEBV=w 1 PA+w 2 DGV-w 3 PI (VanRaden, 2008) PI parental index (fraction of PA in DGV) DGV=PI +(DGV-PI)=PI+QBV QBV based on QTLs alone With large data sets: GEBV=PI + QBV Few or no large QTL for polygenic traits With small data sets: GEBV=PI + QBV peaks in GWAS mainly due to relationships 22

23 Sire 1 Segments and QTLs in different genotyped populations Sire 2 QTL 1 QTL 2 QTL 3 Medium size Small size Large size Hypothetical responses in GWAS Small size Response I Response 2 23

24 Sequence analysis and genetic evaluation Resolution of GWAS insufficient to identify causative SNP (?) Can use causative SNP if they are identified by different methods (Brøndum et al., 2014) Do large causative SNP exist? Can large QTL exist despite selection? Genetics and genomics of mortality in US Holsteins (Tokuhisa et al, 2014; Tsuruta et al., 2014) 6M records, SNP50k genotypes of 35k bulls 24

25 Milk first parity Mortality first parity 25

26 Conclusions GWAS resolution possibly function of effective population size GWAS with small genotyped populations possibly capture relationships not QTL Acknowledgements Grants from Holsteins Assoc., Angus Assoc., Cobb-Vantress, Zoetis, Smithfield, PIC, AFRI grant from USDA NIFA Collaborators Shogo Tsuruta Ignacio Aguilar Breno Fragomeni Ivan Pocrnic Daniela Lourenco Yutaka Masuda Andres Legarra 26