Genome-wide association studies (GWAS) Part 1

Genome-wide association studies (GWAS) Part 1 Matti Pirinen FIMM, University of Helsinki 03.12.2013, Kumpula Campus FIMM - Institiute for Molecular Medicine Finland www.fimm.fi

Published Genome-Wide Associations through 12/2012 Published GWA at p 5e-8 for 17 trait categories Contents: 1.Concepts, Rationale 2.Technology and Data 3.Examples and criticism 4.Applications NHGRI GWA Catalog www.genome.gov/gwastudies www.ebi.ac.uk/fgpt/gwas/

What is an association? IDEAL CASE TYPICAL CASE

Concepts Individuals (humans, mice, flies, plants) Phenotypes, traits Quantitative (height, cholesterol levels) Discrete (case-control, Parkinson's disease) EXPLAIN Understanding, mechanisms, therapeutics PREDICT Intervention, agriculture, no need to understand

Y=μ+G+E+(GxE) Y phenotype, μ population mean E nvironment Chemicals, temperature, food, physical activity... G enetics DNA, A,C,G,T, 3e+9 bases, 22 chrs + X + Y, diploid, meiosis Genes proteins, 2e+4, 1-2 % of DNA Single nucleotide polymorphisms (SNPs) GxE not in this course

Transmission of genomes From parents to offspring with recombination and mutation Variation at two physically close genetic loci tend to be statistically correlated

Linkage disequilibrium Non-independence of alleles at (two) SNPs in population LD SNPs tag each other in a population sample Knowing allele at locus 1 gives some information about the allele at locus 2 Basis for gene mapping with a SNP panel A: 50% B: 50% a: 50% b: 50% A/a B/b A- B 0.25 0.3 0.5 A- b 0.25 0.2 0 a-b 0.25 0.2 0 a-b 0.25 0.3 0.5 r^2 0 0.13 1

1000 genomes project

Chr 2 pos 136608646 From 1000 Genomes database

Does genetics affect trait? Do more close relatives have more similar phenotypes (on average)? Twin studies compare monozygotic twins with dizygotic twins Environment can be a confounder in humans Recently, same question has been asked with unrelated individuals (Yang 2010, Nat Gen) Environment is not an (obvious) confounder

Variance component analysis Decompose trait variance into genetic and environmental (Y=G+E) For height in 14,500 Finnish samples, we get that genetic component explains 52% of variation Heritability, the proportion of variance explained by genetics

Gene mapping Once we know there is a genetic component to the trait we want to find the relevant positions (loci, a locus) on the genome affecting the trait Linkage analysis in pedigrees Association analysis in populations

Linkage analysis: Trace whether some segments of genome are co-transmitted with the disease Palles et al. Nat Gen 2013

Linkage analysis Families Does trait correlate with genetic sharing at some parts of the genome? Parametric Linkage analysis in pedigrees Affected sib-pairs (non-parametric) Need for families restricts sample size and thus size of genetic effects that can be found Localisation is coarse since large blocks of genome are linked in close relatives Sequence data and annotations of variants help

Collect a population sample and test whether trait is associated with any genetic variants

Association analysis If a large panel of SNPs can be genotyped, association between each SNP and the trait can be tested Role of LD (HapMap project, 1000 Genomes) Only samples from (homogeneous) population (not families) needed Genome coverage and localisation depend on #SNPs and LD in the population

Risch: Searching for genetic determinants in the new millenium Nature, June 2000

Genotyping Current Human SNP Chips Contain probes for several million SNPs Price ~100 euros/sample Figure from Steven M. Carr http://www.mun.ca/biology/ scarr/dna_chips.html

Genotype calling OptiCall Shah et al.

Output from genotyping GOOD! ERROR, clustering algorithm ERROR, monomorphic ERROR, structural variant From D.Phil thesis of Damjan Vukcevic, Oxford, 2009

Quality control Remove individuals and SNPs that show low quality Individuals: Sex discrepancies, missingness, heterozygosity, relatedness, ancestry SNPs: missingness, deviation from HardyWeinberg equilibrium, low minor allele frequency

X intensity Sex

Relatedness Genome Pairs of individuals Number of chromosomes Identical by Descent 0: Gray 1: Blue 2: Red

Population structure

QC genotype calls The optimal approach is either to model the calling errors or to look at all cluster plots SNP filtering is then a short cut The level of SNP filtering is therefore a trade-off

Minor allele frequency Genotype calling algorithms like to model the extra clusters

Hardy-Weinberg Clustering errors, or putative structural variation often distorts HW equilibrium

Missingness High missingness is also indicative of clustering failure or poor cluster resolution

Testing for association 100% of SNPS

Testing for association 80.69% of SNPS 1% MAF

Testing for association 78.36% of SNPS MAF > 1% & info > 0.975

Testing for association 78.36% of SNPS MAF > 1% & info > 0.975 & HW < 1e-20

Testing for association 77.92% of SNPS MAF > 1% & info > 0.975 & HW < 1e-20 miss < 2%

Fingerprint markers, gender checks Genotype QC, intensity outlier Given an individual s disease status, their genotype at a SNP, is independent of, and identically distributed to, other samples conditional on the underlying genotype frequencies in cases and controls Duplicate checks, relatedness Population structure analysis

A good Manhattan plot Wellcome Trust Case Control Consortium, Crohn's disease, Nature 2007 Shows signals supported by many neighboring SNPs

A retracted paper Sebastiani et al. Genetic signatures of exceptional longevity in humans Science July 2010 Was retracted in July 2011 because QC had not been done properly! Sebastiani et al. 2010 Science

Principal compnent analysis (PCA) Dimension reduction technique in statistics Represent high dimensional data points in a few (usually one or two) dimensions Optimization criterion is maximal empirical variance (See handout)

PC plot of Europeans Novembre et al. 2008 Nat Gen

PCA in Northern Europe Salmela E, et al. (2008) Genome-Wide Analysis of Single Nucleotide Polymorphisms Uncovers Population Structure in Northern Europe. PLoS ONE 3(10): e3519

Uses of PCA in GWAS PCs tell about ancestry Exclude individuals of unexpected ancestry Use PCs as covariates in the analysis to correct for possible biases induced by sample collection or non-genetic geographical effects on phenotype

Population structure in case-control association studies Are there differences in genotype frequencies between cases and controls? If yes, then locus is possibly interesting But could also reflect ascertainment scheme if cases and controls are not well matched w.r.t. genetic background! Example: Let's look at analysis where 5,000 UK controls are compared against 1,800 UK + 500 Irish Psoriasis cases

UK+Irish Psoriasis study Not well matched w.r.t genetic background!

Basic analysis First PC as covariate LACTASE REGION SNP LACTASE REGION? Population Phenotype