Genome-wide association studies (GWAS) Part 1 Matti Pirinen FIMM, University of Helsinki 03.12.2013, Kumpula Campus FIMM - Institiute for Molecular Medicine Finland www.fimm.fi
Published Genome-Wide Associations through 12/2012 Published GWA at p 5e-8 for 17 trait categories Contents: 1.Concepts, Rationale 2.Technology and Data 3.Examples and criticism 4.Applications NHGRI GWA Catalog www.genome.gov/gwastudies www.ebi.ac.uk/fgpt/gwas/
What is an association? IDEAL CASE TYPICAL CASE
Concepts Individuals (humans, mice, flies, plants) Phenotypes, traits Quantitative (height, cholesterol levels) Discrete (case-control, Parkinson's disease) EXPLAIN Understanding, mechanisms, therapeutics PREDICT Intervention, agriculture, no need to understand
Y=μ+G+E+(GxE) Y phenotype, μ population mean E nvironment Chemicals, temperature, food, physical activity... G enetics DNA, A,C,G,T, 3e+9 bases, 22 chrs + X + Y, diploid, meiosis Genes proteins, 2e+4, 1-2 % of DNA Single nucleotide polymorphisms (SNPs) GxE not in this course
Transmission of genomes From parents to offspring with recombination and mutation Variation at two physically close genetic loci tend to be statistically correlated
Linkage disequilibrium Non-independence of alleles at (two) SNPs in population LD SNPs tag each other in a population sample Knowing allele at locus 1 gives some information about the allele at locus 2 Basis for gene mapping with a SNP panel A: 50% B: 50% a: 50% b: 50% A/a B/b A- B 0.25 0.3 0.5 A- b 0.25 0.2 0 a-b 0.25 0.2 0 a-b 0.25 0.3 0.5 r^2 0 0.13 1
1000 genomes project
Chr 2 pos 136608646 From 1000 Genomes database
Does genetics affect trait? Do more close relatives have more similar phenotypes (on average)? Twin studies compare monozygotic twins with dizygotic twins Environment can be a confounder in humans Recently, same question has been asked with unrelated individuals (Yang 2010, Nat Gen) Environment is not an (obvious) confounder
Variance component analysis Decompose trait variance into genetic and environmental (Y=G+E) For height in 14,500 Finnish samples, we get that genetic component explains 52% of variation Heritability, the proportion of variance explained by genetics
Gene mapping Once we know there is a genetic component to the trait we want to find the relevant positions (loci, a locus) on the genome affecting the trait Linkage analysis in pedigrees Association analysis in populations
Linkage analysis: Trace whether some segments of genome are co-transmitted with the disease Palles et al. Nat Gen 2013
Linkage analysis Families Does trait correlate with genetic sharing at some parts of the genome? Parametric Linkage analysis in pedigrees Affected sib-pairs (non-parametric) Need for families restricts sample size and thus size of genetic effects that can be found Localisation is coarse since large blocks of genome are linked in close relatives Sequence data and annotations of variants help
Collect a population sample and test whether trait is associated with any genetic variants
Association analysis If a large panel of SNPs can be genotyped, association between each SNP and the trait can be tested Role of LD (HapMap project, 1000 Genomes) Only samples from (homogeneous) population (not families) needed Genome coverage and localisation depend on #SNPs and LD in the population
Risch: Searching for genetic determinants in the new millenium Nature, June 2000
Genotyping Current Human SNP Chips Contain probes for several million SNPs Price ~100 euros/sample Figure from Steven M. Carr http://www.mun.ca/biology/ scarr/dna_chips.html
Genotype calling OptiCall Shah et al.
Output from genotyping GOOD! ERROR, clustering algorithm ERROR, monomorphic ERROR, structural variant From D.Phil thesis of Damjan Vukcevic, Oxford, 2009
Quality control Remove individuals and SNPs that show low quality Individuals: Sex discrepancies, missingness, heterozygosity, relatedness, ancestry SNPs: missingness, deviation from HardyWeinberg equilibrium, low minor allele frequency
X intensity Sex
Relatedness Genome Pairs of individuals Number of chromosomes Identical by Descent 0: Gray 1: Blue 2: Red
Population structure
QC genotype calls The optimal approach is either to model the calling errors or to look at all cluster plots SNP filtering is then a short cut The level of SNP filtering is therefore a trade-off
Minor allele frequency Genotype calling algorithms like to model the extra clusters
Hardy-Weinberg Clustering errors, or putative structural variation often distorts HW equilibrium
Missingness High missingness is also indicative of clustering failure or poor cluster resolution
Testing for association 100% of SNPS
Testing for association 80.69% of SNPS 1% MAF
Testing for association 78.36% of SNPS MAF > 1% & info > 0.975
Testing for association 78.36% of SNPS MAF > 1% & info > 0.975 & HW < 1e-20
Testing for association 77.92% of SNPS MAF > 1% & info > 0.975 & HW < 1e-20 miss < 2%
Fingerprint markers, gender checks Genotype QC, intensity outlier Given an individual s disease status, their genotype at a SNP, is independent of, and identically distributed to, other samples conditional on the underlying genotype frequencies in cases and controls Duplicate checks, relatedness Population structure analysis
A good Manhattan plot Wellcome Trust Case Control Consortium, Crohn's disease, Nature 2007 Shows signals supported by many neighboring SNPs
A retracted paper Sebastiani et al. Genetic signatures of exceptional longevity in humans Science July 2010 Was retracted in July 2011 because QC had not been done properly! Sebastiani et al. 2010 Science
Principal compnent analysis (PCA) Dimension reduction technique in statistics Represent high dimensional data points in a few (usually one or two) dimensions Optimization criterion is maximal empirical variance (See handout)
PC plot of Europeans Novembre et al. 2008 Nat Gen
PCA in Northern Europe Salmela E, et al. (2008) Genome-Wide Analysis of Single Nucleotide Polymorphisms Uncovers Population Structure in Northern Europe. PLoS ONE 3(10): e3519
Uses of PCA in GWAS PCs tell about ancestry Exclude individuals of unexpected ancestry Use PCs as covariates in the analysis to correct for possible biases induced by sample collection or non-genetic geographical effects on phenotype
Population structure in case-control association studies Are there differences in genotype frequencies between cases and controls? If yes, then locus is possibly interesting But could also reflect ascertainment scheme if cases and controls are not well matched w.r.t. genetic background! Example: Let's look at analysis where 5,000 UK controls are compared against 1,800 UK + 500 Irish Psoriasis cases
UK+Irish Psoriasis study Not well matched w.r.t genetic background!
Basic analysis First PC as covariate LACTASE REGION SNP LACTASE REGION? Population Phenotype