Symposium on Frontiers in Big Data September 23, 2016 University of Illinois, Urbana-Champaign. Robert J.Tempelman Michigan State University

Similar documents
Practical integration of genomic selection in dairy cattle breeding schemes

Strategy for applying genome-wide selection in dairy cattle

Accuracy of whole genome prediction using a genetic architecture enhanced variance-covariance matrix

Fundamentals of Genomic Selection

BLUP and Genomic Selection

Genomic models in bayz

Genomic selection and its potential to change cattle breeding

DESIGNS FOR QTL DETECTION IN LIVESTOCK AND THEIR IMPLICATIONS FOR MAS

General aspects of genome-wide association studies

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

What could the pig sector learn from the cattle sector. Nordic breeding evaluation in dairy cattle

Comparison of genomic predictions using genomic relationship matrices built with different weighting factors to account for locus-specific variances

Genomic prediction for Nordic Red Cattle using one-step and selection index blending

Impact of Genotype Imputation on the Performance of GBLUP and Bayesian Methods for Genomic Prediction

Genomic selection +Sexed semen +Precision farming = New deal for dairy farmers?

TSB Collaborative Research: Utilising i sequence data and genomics to improve novel carcass traits in beef cattle

Question. In the last 100 years. What is Feed Efficiency? Genetics of Feed Efficiency and Applications for the Dairy Industry

Genome-wide association mapping using single-step GBLUP!!!!

TAKE HOME MESSAGES Illinois Parameter < 18,000 18,000 22,000 > 22,000

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Genomic Selection in R

Daniela Lourenco, Un. of Georgia 6/1/17. Pedigree. Individual 1. Individual 2

Genetic evaluation using single-step genomic best linear unbiased predictor in American Angus 1

11/30/11. Lb output/beef cow. Corn Yields. Corn and Wheat

The development of breeding strategies for the large scale commercial dairy sector in Zimbabwe

Genomic Selection Using Low-Density Marker Panels

Implementation of Genomic Selection in Pig Breeding Programs

Data Mining and Applications in Genomics

Operation of a genomic selection service in Ireland

Reliabilities of genomic prediction using combined reference data of the Nordic Red dairy cattle populations

Establishment of a Single National Selection Index for Canada

Genomic Selection using low-density marker panels

Diagnostic Screening Without Cutoffs

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Understand biotechnology in livestock animals. Objective 5.04

Revisiting the a posteriori granddaughter design

Genetic Analysis of Cow Survival in the Israeli Dairy Cattle Population

Genome-Wide Association Studies (GWAS): Computational Them

Statistical analyses of milk traits in the New Zealand dairy cattle population

Conifer Translational Genomics Network Coordinated Agricultural Project

Introduction to Indexes

Big Data, Science and Cow Improvement: The Power of Information!

Optimization of dairy cattle breeding programs using genomic selection

Edinburgh Research Explorer

Molecular markers in plant breeding

Herd Summary Definitions

Prediction of clinical mastitis outcomes within and between environments using whole-genome markers

Graphical Modelling in Genetics and Systems Biology

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

MSc specialization Animal Breeding and Genetics at Wageningen University

Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats

Melding genomics and quantitative genetics in sheep breeding programs: opportunities and limits

Chapter 20 Biotechnology and Animal Breeding

Introduction. Veterinary World, EISSN: Available at RESEARCH ARTICLE Open Access

Improving fertility through management and genetics.

French report - Interbeef WG 20th and 21st June

Genomics and resilience

Senegal Dairy Genetics / Sénégal Génétique Laitière

Philippine Dairy Buffalo Breeding Program

Speeding up discovery in plant genetics and breeding

Improving Genetics in the Suckler Herd by Noirin McHugh & Mark McGee

RECOLAD. Introduction to the «atelier 1» Genetic approaches to improve adaptation to climate change in livestock

MAS refers to the use of DNA markers that are tightly-linked to target loci as a substitute for or to assist phenotypic screening.

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Use of genomic tests and sexed semen increase genetic level within herd

Efficiency of Different Selection Indices for Desired Gain in Reproduction and Production Traits in Hariana Cattle

Estimation of Environmental Sensitivity of Genetic Merit for Milk Production Traits Using a Random Regression Model

Genomic prediction. Kevin Byskov, Ulrik Sander Nielsen and Gert Pedersen Aamand. Nordisk Avlsværdi Vurdering. Nordic Cattle Genetic Evaluation

Management Basics for Beef Markets. Bethany Funnell, DVM Purdue College of Veterinary Medicine

TECHNICAL BULLETIN December 2017

Experiences in the use of genomics in ruminants in Uruguay

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

An introductory overview of the current state of statistical genetics

Invited Review. Value of genomics in breeding objectives for beef cattle

IT-Solutions for Animal Production. 26 May 2014 Page 1

Estimating Effects and Making Predictions from Genome-Wide Marker Data

Characterization of Allele-Specific Copy Number in Tumor Genomes

Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits

Revolutionizing genetics in the cattle industry. Extensive toolset. Proven technology. Collaborative approach.

Addressing the Concerns of the Lacks Family: Quantification of Kin Genomic Privacy

Concepts of Genetics Ninth Edition Klug, Cummings, Spencer, Palladino

Linking Genetic Variation to Important Phenotypes

Genomic analyses with emphasis on single-step

1999 American Agricultural Economics Association Annual Meeting

A journey: opportunities & challenges of melding genomics into U.S. sheep breeding programs

france, genetic selection & technology-driven innovation

Proceedings, The Range Beef Cow Symposium XXIV November 17, 18 and 19, 2015, Loveland, Colorado HOW TO USE COMMERCIALLY AVAILABLE GENOMIC PREDICTIONS

Evaluation of genomic selection for replacement strategies using selection index theory

This project has received funding from the European Union s Horizon 2020 research and innovation program under Grant Agreement No

Slide 1. Slide 2. Slide 3. Current Status of the Canadian Animal Genetic Resources Program (AAFC-CAGR) Yves Plante Research Scientist

Understanding and Using Expected Progeny Differences (EPDs)

Compression distance can discriminate animals by genetic profile, build relationship matrices and estimate breeding values

Sequencing Millions of Animals for Genomic Selection 2.0

Beef Cattle Genomics: Promises from the Past, Looking to the Future

User s Guide Version 1.10 (Last update: July 12, 2013)

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

Transcription:

Symposium on Frontiers in Big Data September 3, 016 University of Illinois, Urbana-Champaign Robert J.Tempelman Michigan State University *See also December 015 special issue of Journal of Agricultural, Biological, and Environmental Statistics 1

Who am I? I am proud to be known as a animal breeder or animal breeding scientist That may conjure up some romantic images But really one of the first big data agricultural scientists. Also seminal users of mixed model analyses.

Mixed models Charles R. Henderson (1911-1989) Helped developed mixed model analyses for animal breeders before they became more widely popular later. With O. Kempthorne, S. R. Searle, and C. M. von Krosigk. The estimation of environmental and genetic trends from records subject to culling. 1959, Biometric 15:19-18. y nx1 = Xβ + Wu qx1 + e data fixed effects genetic (polygenic) residual (e.g. age herd) effects Mixed Model Equations (MME) u ~ N(0,Aσ u) XX ' XZ ' βˆ ' e ~ N(0,Iσ 1 = Xy e) ZX ' ZZ ' + A λ uˆ Zy ' A: genetic relationships σ e λ = 3 (pedigree/kinship) σ u

Resource constrained computing (in the good old days) My Masters Thesis (University of Guelph) 1986-1988. Maximum memory request on Guelph IBM mainframe computer was (wait for it!): 8MB q > *13,7 (additive and dominance genetic effects for each of 13,7 Holstein cattle including ancestors, n = 8,39 cows with phenotypes >MME: 180 MB (Full-store) without fixed effects! Solution?.. sparse matrix storage techniques MME in animal breeding are > 99% zeroes (whew!)..so only save non-zeroes Thank you Karin Meyer! (DFREML) 4

Mixed models and animal breeding Mixed Models on Steroids. A subtle q > n problem: (q = number of animal effects, n = number of records) Let s go back in time (1988): already both q and n > 10M cattle back then for USDA Holstein national genetic evaluation (Wiggans, 1988) Hmm even sparse matrix software can t save you there! Solution? Save MME on hard disk; use Gauss-Seidel (now precondition conjugate gradient) iteration on data. Now (015) q > 70M cattle, n > 130M or > 650M depending on definition of phenotypes. 5

From Brown and Kass (009) What is statistics? The American Statistician (009) 63: 105-110 Physicists and engineers very often become immersed in the subject matter. In particular, they work hand in hand with neuroscientists and often become experimentalists themselves. Furthermore, engineers (and likewise computer scientists) are ambitious; when faced with problems, they tend to attack, sweeping aside impediments stemming from limited knowledge about the procedures that they apply This has been the culture of animal breeding (for better or for worse) Field of study is important.passion even more so. 6

It s not always about obtaining the biggest servers! Algorithm development! Example 1: MME requires A -1 not A. XX ' XZ ' βˆ ' 1 = Xy ZX ' ZZ ' + A λ uˆ Zy ' Inverting A is not possible. Henderson (again!) developed rules for computing A -1 computations linear in q! (numerically stable/sparse) Example : Estimating variance components σ λ = σ REML. Standard algorithms in canned statistical packages unworkable AI(Average Information)-REML is based on hybrid Newton- Raphson/Fisher scoring algorithm that exploits sparsity of MME. (Gilmour et al., 1985; Biometrics 51: 1440) e u 7

Genetic improvement of livestock Has led to dramatic changes Since 1963, milk production / cow has doubled >50% of that due to genetic trend (Garcia et al., 016; PNAS 113:E3995) Historically: Artificial insemination with frozen semen, embryo transfer/ estrus synchronization Widespread global exchange of germplasm Whole genome prediction (WGP) based on the use of 10s/100s of thousands high density single nucleotide polymorphism (SNP) marker genotypes on each cow has further increased selection intensity Impact on higher accuracy of genetic merit and lower generation interval Selection response should accelerate.

Largest genomic databases (courtesy, Paul Van Raden, USDA-AGIL; August, 016) Ancestry.com 3andMe CDCB/USDA Genotypes > million >1 million 1.4 million Species Human Human Cattle Countries? >55 53 Genotyping cost $99 $199 $37 135 Delivery (weeks) 6 8 6 8 1 DNA generations Few Few >10 EBV reliability NA Low High Reference: Web sites: http://genomemag.com/davies-3andme/#.vdy7zosy1 https://www.3andme.com/ http://dna.ancestry.com/ https://www.cdcb.us/ http://aipl.arsusda.gov/main/site_main.htm

Genetic trend for Dairy Net Merit $ NM$ 800 600 400 00 All Bulls AI bulls Cows 0-00 -400-600 -800 1980 1984 1988 199 1996 000 004 008 01 Birth year Courtesy: USDA-AGIL (Animal Genomics Improvement Laboratory) Introduction of WGS

Evolution of SNP marker panels in cattle breeding Courtesy: USDA-AGIL (Animal Genomics Improvement Laboratory) 11

Whole Genome Prediction (WGP) Polygenic Effects ( optional ) n { ui} N( σ u ) u= ~ 0, A 1 i= Model: y nx1 = Xβ + Zg + Wu qx1 + e Animal SNP allelic substitution effects: ' z 1 ' z Z = ' z q Genotypes Genotypes Z Matrix of genotypes LD (linkage disequilibrium) Phenotype g [ g g g g ] = 1 3 m ' [ z z z z ] ' z i = i 1 i i 3 im values = 0, 1 or (# of copies of reference allele) y nx1 Typical distributional assumption n { g j} N( σ g) g= ~ 0, I j 1 = LD generates signal dependence (Chen and Storey, 006) multicollinearity

Big Data keeps getting bigger! m > 50,000 SNP markers (some imputation from lower density panels) q > 70M animals (>1.4 M of which are genotyped so how do you deal with that?...see later) n > 130M records. Most research studies involve far smaller q and n. Greater recognition that if most SNP markers are NOT in tight LD with genes (QTL), then normality assumption is tenuous. With m and/or q for same n, effective sample size actually. big data can be a misnomer. 13

Some alternative candidate priors for g (Meuwissen et al., 001; de los Campos et al., 013) p(g j ) rrblup ( ) g j ~ N 0, σ g g j BayesB p(g j ) More flexible priors generally require a Bayesian approach BayesA g j SSVS g Heavy tails j ~ t( νg, σg ) ( σ gφ j ) ( νg νg ) g j ~ N 0, φ j ~ χ, p(g j ) g Variable selection g j j t( νg, σg ) with prob π ~ 0 with prob ( 1-π ) Bayesian alphabet models (Gianola) φ j : augmented variables (computational convenience) p(g j ) Variable selection g j σ g j ~ 1 N 0, N 0, c φ j ~ Bernoulli ( π ) g ( φ j) + φ j ( σ g) c >>1 hyperparameters should be estimated..they define genetic architecture 14

Bayesian analyses More ambitious priors typically require use of MCMC (Markov Chain Monte Carlo) methods. Sample from the posterior distribution of all unknowns. Don t be at the mercy of the prior! (Dan Gianola) estimate hyperparameters Research reproducibility issue: Many researchers (animal breeders and others) may not draw enough MCMC samples from the posterior density Autocorrelated draws Issue potentially more dramatic for computing Bayesian credible bounds than, say, posterior means. A much greater issue also for denser SNP chips (or m >> n). Ongoing research to improve mixing ( autocorrelation) 15

Exact inference and computational efficiency MCMC: Real-time posterior densities (updates) very difficult to get with big data no memory from previous analyses need to rerun Analytical approximations Expectation-maximization (EM) often referred to as big data Bayes (Allenby et al., 014) However, extreme sensitive to starting values (Chen and Tempelman, 015) especially with large m relative to n (multimodal joint posterior densities more likely) Normal prior on g: Pragmatic alternative for large national genetic evaluations. Yet limited effectiveness for genome wide association studies relative to more flexible priors. 16

Reducing dimensionality with equivalent models (if q<<m) Assume q = n (one record per animal) SNP effects model (rrblup) y = Xβ + Zg + e Genomic animal effects model (GBLUP) y = Xβ + u + e m { } ( g ) j N σ g g= ~ 0, I = 1 j uqx 1 = Zg u ~ N 0, ZZ' σ g ( ) G = ZZ is the genomic relationship matrix (need its inverse for MME for GBLUP). Can easily backsolve for g from u in GBLUP (Stranden 17 and Garrick, 008).

The latest and greatest Using biological information having a disciplinary passion is useful for a data scientist! Assign higher prior probability to SNP in coding or regulatory genomic regions (BayesRC; McLeod et al., 016 BMC Genomics 17:144) Addressing industry concerns: Combining data on genotyped and non-genotyped animals. single step procedures (Aguilar et al., 010) based on blending G (on genotyped animals) with A (on nongenotyped animals) APY (Mizstal, 016 Genetics 0:401) based on sparsifying G -1 numerically stable! 18

Worries keep coming 015 # of SNP markers increasing (10s of millions in sequencing). Because of high LD, even greater multicollinearity creating instability, especially in MCMC inference. 19

Big data and sustainability management systems & environments are changing more rapidly than animal populations can adapt to such changes through natural selection (Hohenboken et al., 005)..e.g. Energy policy (corn distiller s grain) More intensive management (larger farms) Climate change What are the implications for genetic improvement of livestock? How should we prepare as statistical geneticists? What kind of direction should we provide?

Scope of inference and livestock breeding Scope of inference What is this in context of animal breeding? Broad scope: Inferring upon average additive genetic merit for animals across all environments. Often is the primary focus Narrow scope: Inferring upon (eventually adapting) genetic merit for animals within a specific environment. To accommodate both objectives need to create SNP*E terms Curse of dimensionality intensifies!

Big Data in dairying.

Funding by Agriculture and Food Research Initiative Competitive Grants # 011-67015-30338 and 011-68004- 30340 is gratefully acknowledged. 3