Lecture 6: Association Mapping. Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Similar documents
Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Genome-Wide Association Studies (GWAS): Computational Them

General aspects of genome-wide association studies

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Genome-wide association studies (GWAS) Part 1

Supplementary Note: Detecting population structure in rare variant data

ARTICLE ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure

Mapping and Mapping Populations


DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Association mapping of Sclerotinia stalk rot resistance in domesticated sunflower plant introductions

Supplementary material

Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

A unified mixed-model method for association mapping that accounts for multiple levels of relatedness

Quantitative Genetics

Computational Workflows for Genome-Wide Association Study: I

H3A - Genome-Wide Association testing SOP

Applying Genotyping by Sequencing (GBS) to Corn Genetics and Breeding. Peter Bradbury USDA/Cornell University

BTRY 7210: Topics in Quantitative Genomics and Genetics

Conifer Translational Genomics Network Coordinated Agricultural Project

Statistical Methods in Bioinformatics

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

"Genetics in geographically structured populations: defining, estimating and interpreting FST."

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

SNPs - GWAS - eqtls. Sebastian Schmeier

Ning Yang 1., Yanli Lu 1,2., Xiaohong Yang 3, Juan Huang 1, Yang Zhou 1, Farhan Ali 1, Weiwei Wen 1, Jie Liu 1, Jiansheng Li 3, Jianbing Yan 1 *

Exact Multipoint Quantitative-Trait Linkage Analysis in Pedigrees by Variance Components

Additional file 8 (Text S2): Detailed Methods

Accuracy of whole genome prediction using a genetic architecture enhanced variance-covariance matrix

Association Mapping in Wheat: Issues and Trends

Linkage Disequilibrium. Adele Crane & Angela Taravella

University of Groningen. The value of haplotypes Vries, Anne René de

Axiom mydesign Custom Array design guide for human genotyping applications

A SUPER Powerful Method for Genome Wide Association Study

Data Mining and Applications in Genomics

Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations

Methods for linkage disequilibrium mapping in crops

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

Lecture 10: Introduction to Genetic Drift. September 28, 2012

An introductory overview of the current state of statistical genetics

Gene Expression Data Analysis (I)

Title: Powerful SNP Set Analysis for Case-Control Genome Wide Association Studies. Running Title: Powerful SNP Set Analysis. Hill, NC. MD.

Author's response to reviews

Managing genetic groups in single-step genomic evaluations applied on female fertility traits in Nordic Red Dairy cattle

Creation of a PAM matrix

Genome-wide association mapping of agronomic and morphologic traits in highly structured populations of barley cultivars

ABSTRACT : 162 IQUIRA E & BELZILE F*

Testing Gene-Environment Interaction in Large Scale Case-Control Association Studies: Possible Choices and Comparisons

Gene-centric Genomewide Association Study via Entropy

Segmentation and Targeting

Conifer Translational Genomics Network Coordinated Agricultural Project

Introduction to gene expression microarray data analysis

Systematic comparison of CRISPR/Cas9 and RNAi screens for essential genes

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Quality Control Assessment in Genotyping Console

Human SNP haplotypes. Statistics 246, Spring 2002 Week 15, Lecture 1

A Statistical Framework for Joint eqtl Analysis in Multiple Tissues

A GENOTYPE CALLING ALGORITHM FOR AFFYMETRIX SNP ARRAYS

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Runs of Homozygosity Analysis Tutorial

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Computational Haplotype Analysis: An overview of computational methods in genetic variation study

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

MMAP Genomic Matrix Calculations

PLINK gplink Haploview

Basic statistical analysis in genetic case-control studies

Genetic Variation and Genome- Wide Association Studies. Keyan Salari, MD/PhD Candidate Department of Genetics

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Principal Component Analysis in Genomic Data

Exploring the Genetic Basis of Congenital Heart Defects

A Propagation-based Algorithm for Inferring Gene-Disease Associations

1 why study multiple traits together?

Efficient Control of Population Structure in Model Organism Association Mapping

Lees J.A., Vehkala M. et al., 2016 In Review

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

Gene Discovery of Nitrogen Utilization in Maize Yuhe Liu, Devin Nichols, Stephen Moose Department of Crop Sciences, University of Illinois

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach

Genomic Selection in R

Improved Power by Use of a Weighted Score Test for Linkage Disequilibrium Mapping

Segmentation and Targeting

Mapping and selection of bacterial spot resistance in complex populations. David Francis, Sung-Chur Sim, Hui Wang, Matt Robbins, Wencai Yang.

Genetic methods and landscape connectivity

Quality Measures for CytoChip Microarrays

Nature Genetics: doi: /ng Supplementary Figure 1. QQ plots of P values from the SMR tests under a range of simulation scenarios.

Park /12. Yudin /19. Li /26. Song /9

Genetics and Bioinformatics

MSc specialization Animal Breeding and Genetics at Wageningen University

Tutorial Segmentation and Classification

Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats

EECS730: Introduction to Bioinformatics

SECTION 11 ACUTE TOXICITY DATA ANALYSIS

Transcription:

Lecture 6: Association Mapping Michael Gore lecture notes Tucson Winter Institute version 18 Jan 2013

Association Pipeline

Phenotypic Outliers Outliers are unusual data points that substantially deviate from the mean and strongly influence parameter estimates Should ALWAYS check for outliers in your data sets Do NOT ignore outliers if detected

Phenotypic Outliers Not all outliers are illegitimate contaminants, and not all illegitimate scores show up as outliers. (Barnett & Lewis, 1994) Outliers can 1) increase error variance 2) reduce the power of statistical tests 3) distort estimates 4) decrease normality if non-randomly distributed

Potential Causes of Outliers Human errors in data collection, recording, or entry Technical errors from faulty or non-calibrated phenotyping equipment Intentional or motivated mis-reporting such as speed phenotyping in a hot field environment Osborne, Jason W. & Amy Overbay (2004)

Phenotyping Tools Barcoded tools, barcoded tags on plants, barcode scanners, radio-frequency identification (RFID) tags, smartphones, hand-held mobile computer

More Potential Causes of Outliers Sampling error such as underrepresentation of a subpopulation Incorrect assumption about the distribution (e.g., temporal trend not accounted for in experimental design) Real biological outlier 1% chance of getting an outlier that is 3 standard deviations from the mean Osborne, Jason W. & Amy Overbay (2004)

Evaluate Data for Outliers Get to know your data! Histogram Box-plot (Box and Whisker plot) Quantile-Quantile plot graphical method for comparing two probability distributions to assess goodness-of-fit

Histogram

Box-plot

Box-plot 25% data 50% data 25% data Central plus sign - mean Center line - median

Q-Q-plot (normal) delta-tocotrienol content in maize grain single rep

Statistical Identification of Outliers Two of several possible methods Cook s distance measures influence of a data point. Data points that substantially change effect estimates. Deleted studentized residuals measures leverage of a data point. Data points that affect least squares fit.

Removal of Outliers Removing anomalous data points from data sets is controversial to some folks. My two cents If outliers are not removed, inferences made from the fitted model may not be representative of the population under study. If you remove outliers, then be sure to report it in the manuscript.

Q-Q-plot (normal) delta-tocotrienol content in maize grain single rep Outliers removed based on deleted studentized residuals

Non-Normal Trait Data When fitting a mixed model, two very important assumptions are that the error terms follow a normal distribution and that there is a constant variance. When data are non-normal, these two assumptions in particular could be violated.

Analysis of Non-Normal Trait Data Generalized linear mixed models can be used to analyze non-normal data The Box-Cox procedure can be used to find the most appropriate transformation that corrects for non-normality of the error terms and unequal variances.

Box-Cox Transformation Common Box-Cox Transformations l Y -2 Y -2 = 1/Y 2-1 Y -1 = 1/Y 1-0.5 Y -0.5 = 1/(Sqrt(Y)) 0 log(y) 0.5 Y 0.5 = Sqrt(Y) 1 Y 1 = Y 2 Y 2 http://www.isixsigma.com/tools-templates/normality/making-data-normal-using-boxcox-power-transformation/

Q-Q-plot (normal) delta-tocotrienol content in maize grain two reps Outliers removed based on deleted studentized residuals; Box-Cox Transformed BLUPs

Association Pipeline

Population structure allele frequency differences among individuals due to local adaptation or diversifying selection Familial relatedness allele frequency differences among individuals due to recent co-ancestry If not properly controlled both can cause spurious associations in GWAS

Maize Population Structure Fitch-Margoliash tree for 260 maize inbred lines using the log-transformed proportion of shared alleles distance from 94 SSR markers Liu et al. 2003 Genetics 165:2117-2128

Structured Association (Q) A set of random markers is used to estimate population structure Estimates are incorporated into a statistical analysis to control for genetic structure

STRUCTURE (Q) Model-based clustering method to detect the presence of distinct populations Assign individuals to populations Study hybrid zones Identify migrants and admixed individuals Estimate population allele frequencies http://pritch.bsd.uchicago.edu/software.html

STRUCTURE (Q) Identify different subpopulations within a sample of individuals collected from a population of unknown structure With a fixed number of subpopulations (k), STRUCTURE is used to assign individuals to clusters (i.e., subpopulation) with a probability of membership

STRUCTURE (Q) Estimated ln(probability of the data; blue) and Var[ln(probability of the data; pink)] for k from 2 to 5. Values are from STRUCTURE run three times at each value of k using 89 SSRs Hamblin et al. 2007 PLoS ONE 2:e1367

Maize Population Assignment Inbred Non Stiff Stalk Stiff Stalk Tropical Subpopulation 796 78.5% 18.9% 2.6% mixed 4226 91.7% 7.1% 1.2% nss 33-16 97.2% 1.4% 1.4% nss 38-11 99.3% 0.3% 0.4% nss A188 98.2% 1.3% 0.6% nss A214N 1.7% 76.2% 22.1% mixed A239 96.3% 3.5% 0.2% nss A272 12.2% 1.9% 85.9% ts A441-5 53.1% 0.5% 46.4% mixed A554 97.9% 1.9% 0.2% nss A556 99.4% 0.4% 0.2% nss A6 3.0% 0.3% 96.7% ts A619 99.0% 0.9% 0.1% nss Ab28A 77.6% 0.2% 22.2% mixed B10 57.0% 42.9% 0.2% mixed B164 75.6% 23.3% 1.1% mixed B2 98.8% 0.7% 0.5% nss B37 0.2% 99.7% 0.1% ss B46 78.4% 21.4% 0.2% mixed B52 98.5% 1.2% 0.3% nss Flint-Garcia et al. 2005 The Plant Journal 44:1054-1064

STRUCTURE (Q) Van Inghelandt et al. 2010 Theoretical Applied Genetics 120:1289-1299 Determination of the number of subpopulations k based on STRUCTURE and 359 SSR markers using the ad hoc criterion Δ K of Evanno et al. 2005 Mol. Ecol. 14:2611-2620

Principle Component Analysis Fast and effective approach to diagnose population structure PCA summarizes variation observed across all markers into a smaller number of underlying component variables

Principle Component Analysis PCs relate to separate, unobserved subpopulations from which genotyped individuals originated Loading of each individual on each PC describe population membership

Principle Component Analysis EINGENSTRAT software package is used to estimate PCs of the marker data

Principle Component Analysis Scree plot shows the fraction of total variance in the data explained by each PC

Kinship Coefficient (K) A kinship coefficient (F) is the probability that two homologous genes are identical by descent Kinship from genetic markers is an estimate of relative kinship that is based on probabilities of identical by state Even with pedigrees, marker-based kinship has higher accuracy

Marker-based Relative Kinship (K) F ij = (Q ij -Q m )/(1-Q m ) Θ ij, where Q ij is the probability of identity by state for random genes from i and j, and Q m is the average probability of identity by state for genes coming from random individuals in the population from which i and j where drawn.

Marker-based Relative Kinship (K) SPAGeDi software package with Loiselle (Loiselle et al., 1995) kinship coefficient can be used to generate the relative kinship matrix Negative values between individuals should be set to zero. These individuals are less related than random individuals (i.e., degree of genetic covariance caused by polygenic effects is defined to be 0)

Marker-based Relative Kinship (K) The original kinship matrix for mixed models is an additive numerator relationship matrix calculated from pedigree Diagonals have values between 1 and 2 (1 plus inbreeding coefficient) and off diagonals have values between 0 and 1 (kinship)

Marker-based Relative Kinship (K) Therefore, A-matrix is twice the coancestry matrix used in population genetics Multiplication of K by 2 for inbred lines does not change BLUEs and BLUPs but affects the ratios of VC estimates

Number of Background Markers Needed for Estimating Relationships Likelihood-based model fitting can be used to quantify robustness of genetic relatedness derived from markers Kinship is more sensitive than population structure to marker number Yu et al. 2009 The Plant Genome 2:63-77

Number of Background Markers Needed for Estimating Relationships Robustness of relationship estimate can be tested by fitting multiple phenotypic traits with different subsets of markers Number required for biallelic SNPs is higher than multiallelic SSRs (e.g., maize 100 SSRs; 1000 SNPs) Yu et al. 2009 The Plant Genome 2:63-77

Model Fit with the Mixed Model Bayesian information criterion (BIC) can be used to determine the optimal number of principal components to include in the GWAS model The goodness-of-fit among GWAS models can be assessed with a likelihoodratio-based R 2 statistic, denoted R 2 LR Sun et al. 2010 Heredity 105:333-340

Model Fit with the Mixed Model R 2 considers the change of likelihood LR between models with different fixed and random effects R 2 is easily computed and has a LR monotonic nondecreasing property Sun et al. 2010 Heredity 105:333-340

Computation with Mixed Models Large numbers of individuals is computationally intensive for mixed models Computing time for solving MLM increases with the cube of the number of individuals fit as a random effect Computing time is further increased because iteration is needed to estimate parameters such as variance components

Efficient Computation Compression reduce the size of the random genetic effect in the absence of pedigree information by clustering individuals into groups Population parameters previously determined (P3D) eliminate iterations to re-estimate variance components for each marker Zhang et al. 2010 Nature Genetics 42:355-360

Compression and P3D The joint use of compression MLM and P3D greatly reduces computing time and maintains or increases statistical power no compression, no P3D GWAS with 1,315 individuals and 1 million markers takes 26 years compression with P3D takes 2.7 days for GWAS with the same data set Zhang et al. 2010 Nature Genetics 42:355-360

Correcting for Multiple Testing Bonferroni correction procedure to control the family-wise error rate (i.e., probability of making one or more type I errors) Simplest and most conservative method to control FWER Calculated as α/n, when n is number of hypotheses (i.e., SNPs tested)

Correcting for Multiple Testing False Discovery Rate procedure to control the expected proportion of false discoveries Less stringent than Bonferroni q-value is the FDR analogue of p-value e.g., q=0.10 is 10 false discoveries/100 tests Benjamini Hochberg procedure is often used in GWAS

Candidate Genes When controlling for the multiple testing problem in GWAS, typically only the strongest signals are identified To help detect SNPs with moderate effects that are not significant at the genome-wide level, could consider only SNPs within or at a fixed distance from candidate genes candidate gene FDR

Test for Candidate Gene Enrichment with GWAS hits PICARA an analytical approach designed to search for and validate a priori candidates that are puportedly involved in phenotypic variation and to integrate these candidates with information content taken from GWAS Chen et al. (2012) PLoS ONE 7(11):e46596

Rare Variants in GWAS GWAS typically identify common variants but rare variants are important Most NGS-GWAS are severely underpowered for testing rare variants SNPs with a very low MAF (<0.05) may not have Type I error rates at nominal levels Need very large sample sizes

Rare Variant GWAS Methods Power of single marker tests is poor, thus collapsing rare variants across a region creates a more common aggregate (Li and Leal, 2008). Weighting of alleles on criteria such as MAF or biological function (Madsen and Browning, 2009) Ladouceur et al. (2012) The Empirical Power of Rare Variant Association Methods: Results from Sanger Sequencing in 1,998 Individuals. PLoS Genet 8: e1002496

Not Controlling for Population Structure and Relatedness

Controlling for Population Structure and Relatedness

Stepwise Regression with the Multi-Locus Mixed Model (MLMM) To clarify complex association signals that involve a major effect locus The MLMM employs stepwise mixedmodel regression with forward inclusion and backward elimination, thus allowing for a more exhaustive search of a large model space The MLMM re-estimates the variance components of the model at each step Segura et al. (2012) Nature Genetics 44: 825 830

Controlling for Population Structure and Relatedness Three SNPs Identified by MLMM Included as Covariates

Software for Genome-Wide Association and Genomic Selection GAPIT - Genome Association and Prediction Integrated Tool http://www.maizegenetics.net/gapit TASSEL - Trait Analysis by association, Evolution and Linkage http://www.maizegenetics.net/bioinformatics