General aspects of genome-wide association studies

Similar documents
Lecture 6: GWAS in Samples with Structure. Summer Institute in Statistical Genetics 2015

Understanding genetic association studies. Peter Kamerman

QTL Mapping, MAS, and Genomic Selection

H3A - Genome-Wide Association testing SOP

Introduction to Quantitative Genomics / Genetics

QTL Mapping Using Multiple Markers Simultaneously

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

Why do we need statistics to study genetics and evolution?

BST227 Introduction to Statistical Genetics. Lecture 3: Introduction to population genetics

EFFICIENT DESIGNS FOR FINE-MAPPING OF QUANTITATIVE TRAIT LOCI USING LINKAGE DISEQUILIBRIUM AND LINKAGE

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Genome-wide association studies (GWAS) Part 1

GENOME WIDE ASSOCIATION STUDY OF INSECT BITE HYPERSENSITIVITY IN TWO POPULATION OF ICELANDIC HORSES

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Genomic Selection in Dairy Cattle

Linkage Disequilibrium

Should the Markers on X Chromosome be Used for Genomic Prediction?

Genomic Selection with Linear Models and Rank Aggregation

A Primer of Ecological Genetics

Application of Whole-Genome Prediction Methods for Genome-Wide Association Studies: A Bayesian Approach

QTL Mapping, MAS, and Genomic Selection

Population stratification. Background & PLINK practical

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

BMC Proceedings. Charalampos Papachristou 1,2*, Carole Ober 3 and Mark Abney 3. From Genetic Analysis Workshop 19 Vienna, Austria.

GenSap Meeting, June 13-14, Aarhus. Genomic Selection with QTL information Didier Boichard

Strategy for Applying Genome-Wide Selection in Dairy Cattle

. Open access under CC BY-NC-ND license.

Papers for 11 September

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Genetics Effective Use of New and Existing Methods

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

Genomic Estimated Breeding Values Using Genomic Relationship Matrices in a Cloned. Population of Loblolly Pine. Fikret Isik*

Genomic prediction for numerically small breeds, using models with pre selected and differentially weighted markers

Summary for BIOSTAT/STAT551 Statistical Genetics II: Quantitative Traits

Human Genetics and Gene Mapping of Complex Traits

Genome-wide association and pathway inference for Residual Feed Intake (RFI) in Duroc boars

High-density SNP Genotyping Analysis of Broiler Breeding Lines

Hardy-Weinberg Principle

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

Genome-Wide Association Studies (GWAS): Computational Them

SYLLABUS AND SAMPLE QUESTIONS FOR JRF IN BIOLOGICAL ANTHROPOLGY 2011

Single step genomic evaluations for the Nordic Red Dairy cattle test day data

The effect of genomic information on optimal contribution selection in livestock breeding programs

B I O I N F O R M A T I C S

DESIGNS FOR QTL DETECTION IN LIVESTOCK AND THEIR IMPLICATIONS FOR MAS

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Fundamentals of Genomic Selection

Across-breed genomic prediction in dairy cattle

Comparison of genomic predictions using genomic relationship matrices built with different weighting factors to account for locus-specific variances

Allele frequency changes due to hitch-hiking in genomic selection programs

1b. How do people differ genetically?

Pedigree transmission disequilibrium test for quantitative traits in farm animals

Managing genetic groups in single-step genomic evaluations applied on female fertility traits in Nordic Red Dairy cattle

Human Genetics and Gene Mapping of Complex Traits

heritability problem Krishna Kumar et al. (1) claim that GCTA applied to current SNP

MAS using a dense SNP markers map: Application to the Normande and Montbéliarde breeds

b. (3 points) The expected frequencies of each blood type in the deme if mating is random with respect to variation at this locus.

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Variation Chapter 9 10/6/2014. Some terms. Variation in phenotype can be due to genes AND environment: Is variation genetic, environmental, or both?

Improving the power of GWAS and avoiding confounding from. population stratification with PC-Select

Topics in Statistical Genetics

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

The promise of genomics for animal improvement. Daniela Lourenco

Introduction to Animal Breeding & Genomics

Introduction to Quantitative Genetics

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Genetics and Bioinformatics

Lecture 6: Association Mapping. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Including α s1 casein gene information in genomic evaluations of French dairy goats

Familial Breast Cancer

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

16.2 Evolution as Genetic Change

Genomic models in bayz

What could the pig sector learn from the cattle sector. Nordic breeding evaluation in dairy cattle

Practical integration of genomic selection in dairy cattle breeding schemes

Impact of QTL minor allele frequency on genomic evaluation using real genotype data and simulated phenotypes in Japanese Black cattle

A genome wide association study of metabolic traits in human urine

Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM

Modern Genetic Evaluation Procedures Why BLUP?

SUPPLEMENTARY METHODS AND RESULTS

Analysis of genome-wide genotype data

Breeding on polled genetics in Holsteins - chances and limitations

Genomic Selection Using Low-Density Marker Panels

Genomic Selection using low-density marker panels

Modeling & Simulation in pharmacogenetics/personalised medicine

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Prostate Cancer Genetics: Today and tomorrow

Supplementary Note: Detecting population structure in rare variant data

Traditional Genetic Improvement. Genetic variation is due to differences in DNA sequence. Adding DNA sequence data to traditional breeding.

An Introduction to Population Genetics

The Accuracy of Genomic Selection in Norwegian Red. Cattle Assessed by Cross Validation

Polygenic Influences on Boys & Girls Pubertal Timing & Tempo. Gregor Horvath, Valerie Knopik, Kristine Marceau Purdue University

Genetic data concepts and tests

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

arxiv: v1 [stat.ap] 31 Jul 2014

Transcription:

General aspects of genome-wide association studies Abstract number 20201 Session 04 Correctly reporting statistical genetics results in the genomic era Pekka Uimari University of Helsinki Dept. of Agricultural Sciences 29.11.2013 1

Quality check of the SNPs Call rate Low call rate indicates genotyping problems All SNPs with CR < 90% are usually excluded Minor allele frequency Eliminates non-polymorphic SNPs MAF limit of 1% or 5% are commonly used Number of animals in each genotyping class Hardy-Weinberg equilibrium May indicate genotyping problems Other causes are selection, recent mutation, random drift, small population size etc. Some variation in practice exist e.g. P- value limit from 0.05 to 10-6 CR, MAF and HWE of the best SNPs (the smallest P-value in the GWA) are checked with more detail Also Illumina quality control parameter can be used Illumina Beadstudio allele calling scatter plots 2

Quality check of the samples Call rate Commonly used limit for call rate is 90% Duplication IBS/IBD is appr. 100% Know relatedness, parentage test IBS/IBD estimation Parent-offspring: proportion of SNPs with IBD=1 should be close to 100% depending on the level of inbreeding Full-sibs expectation is IBD=0 25% IBD=1 50% IBD=2 25% Population test 3

Population structure Sample structure can be studies using e.g. PLINK multidimensional scaling, Eigenstrat by Patterson, PLOS 2006 or other methods and software Population clustering Finnish Landrace 0.2 0.15 0.1 Crossbred 0.05 Finnish Yorkshire 0-0.1-0.05 0 0.05 0.1 0.15-0.05-0.1 Dept. of Agricultural Sciences 4

Imputation of genotypes Estimates missing genotypes, the most probably is usually used as an imputed genotype Increases power of association analysis Is done simultaneously together with haplotyping In animal genetics the most commonly used software seems to be Beagle (Browning and Browning, 2009) Faster than for example older fastphase Imputation error is usually very low appr. 2-3% 5

Choice of Animals Purebred animals commonly used Genetic homogeneity Usually high LD For populations with high LD less markers are needed compared to populations with low LD Synthetic breeds / breed crosses Allele and locus heterogeneity 1 0.9 0.8 0.7 0.6 D' / r2 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 kbp 6

Number of Animals In human genetic studies the number of studied / genotyped individuals runs from 100 1000 10000 or more Effects the power of the study Power of the study design can be estimated before hand but usually several assumptions about the mode of inheritance must be made, thus these estimates are seldom for any use For simple Mendelian traits 10 cases and 10 controls can be enough More realistic picture about the power can be achieved comparing the results of GWAS of similar or bigger sample size Remember that even with small sample sizes you can achieve P-values < 10-7, some of those are false positives 7

Choice of Phenotypes The most common phenotype is based on estimated breeding values of males used in AI (AI-bulls, AI-boars) Possible only for traits that are measured in a national recording scheme (or similar type of data recording from performance testing stations etc.) For other traits observations of the genotyped animals are used Usually EBV s are deregressed prior to GWAS (Garrick et al. 2009) Removes the effect of parents on EBVs Deregression Calculation of weight of observations for GWAS 8

Statistical methods Single SNP model The most commonly used model Test for association one SNP at the time Multiple SNP model Several or all SNPs analysed simultaneously Oversaturation (number of markers >> number of individuals) needs to be handled Selecting a subset of variables Shrinking the estimates towards zero 9

Genomic control (GC) is based on the idea a majority of the markers are not associated with the trait and the test statistics should follow the null hypothesis distribution Q-Q plot Population stratification / relationship between the samples 10

Q-Q plot a) No association no population stratification or relatedness b) No association but indication of population stratification or relatedness c) Evidence for association and population stratification or relatedness d) Evidence for association but no population stratification or relatedness McCarthy et al, Nature Reviews Genetic, 2008, 9:356-369 11

Genomic control If there are indications (based on Q-Q plot) of population stratification or relatedness of samples test statistics can be adjusted using genomic control λ (Devlin & Roeder, 1999, Biometrics 55:997-1004) A simple estimate of λ is the mean of the obtained tests statistics or the median divided by 0.456 (0.456 is the expected median for chi-square distribution with df=1) Pearson and Manolio, 2008, JAMA 299:1335-2150 Dept. of Agricultural Sciences sciences Kuvat: 29.11.2013 17.5.2010 12

Population stratification / relationship between the samples The most commonly used way to control relatedness is to include the pedigree structure into a single marker mixed model Mixed linear model: y i = μ + b*x i + a i + e i, y i is the deregressed EBV x i is the number of minor alleles (0, 1, or 2) of the tested SNP b is the corresponding regression coefficient a i is a random polygenic effect with a i ~ N(0, Aσ 2 a), where A is the additive relationship matrix and σ 2 a is the polygenic variance e i is a random residual effect with e i ~ N(0, Iσ 2 e/w i ), where I is an identity matrix, σ 2 e is the residual variance, and w i is the weight 13

Relationship matrix Based either on pedigree (A) or genotypes (G) Genomic relationship matrix (G) Most commonly used the method presented by VanRaden (2008, J. Dairy Sci. 91, 4414-4423) G = ZZ /k, where k=2 p i (1-p i ) Or weight markers by reciprocals of their expected variance G = ZDZ, where D ii = 1/(mk i ) where k i = 2p i (1-p i ) and m=number of markers 14

P-values Test statistic of the SNP-effect from the mixed linear model: t-test F-test Wald test Squared t-test statistic has an exact F(1,n 1) distribution The Wald statistic can be used to test a simple hypothesis H 0 :θ = θ 0 on the entire parameter vector, θ θ T I θ θ θ has x 2 -distribution with 1 df When n Wald-test with 1 df the square of the t -test statistic 15

Manhattan plot 16

Commonly used programs Variance-covariance estimation packages DMU, ASREML etc Fits mixed model equation and estimates variance components for each SNP takes long time to analyze all markers Use either A or G-matrix Methods and programs that take into account population and family structure (approximations) GenABEL (GRAMMAR), Aulchenko et al, 2007, Genetics EMMAX, Kang et al, 2010, Nature Genetics Tassel, Zhang et al, 2010 Nature Genetics GEMMA, Zhou and Stephens 2012 Nature Genetics Should be faster than DMU and ASREML 17

Multiple SNP model number of markers >> number of individuals Bayesian LASSO y = μ + Xb + u + e X is the matrix on m most correlated marker genotypes b is a vector of marker effects u polygenic effect with G genomic relationship matrix computed from the rest of the markers b j σ j 2 ~N(0, σ j 2 ) σ j 2 λ~exp(λ 2 /2) λ~gamma(κ, ξ) Kärkkäinen & Sillanpää (2012, Genetics) 18

Multiple SNP model number of markers >> number of individuals Heteroscedastic Ridge Regression y = μ + Xb + e X is the matrix on marker genotypes b is a vector of random marker effects First round: Shrinkage factor λ~σ e 2 /σ b 2 Second round: Shrinkage factor λ~σ 2 e /σ 2 bj where σ 2 bj is calculated based on an estimate of a marker effect b j from the first round bigrr R-package (Shen et al. 2013, Genetics) 19

Comparison of the methods Single SNP methods 20

Comparison of the methods Multiple SNP methods 21

Post GWAS (see review by Wojcik et al. 2015 BMC Genetics) Aggregate markers into biologically relevant units, gene or pathway Increase power: combine multiple weak or moderate signals Allow for allelic or locus heterogeneity Gene-level analyses Combines independent signals within a gene Should take LD into account E.g. VEGAS (Liu et al. AJHG 2010) Pathway-level (gene-set) analyses Related collection of genes with similar biological function Assess if strong associations cluster within a gene set compared to genes outside of the pathway (or gene set) 22