Introduction to statistics for Genome- Wide Association Studies (GWAS) Day 2 Section 8

Introduction to statistics for Genome- Wide Association Studies (GWAS) 1

Outline Background on GWAS Presentation of GenABEL Data checking with GenABEL Data analysis with GenABEL Display of results 2

R Packages for GWAS Plink developped by the M.I.T. but only available for linux platform only. (http://pngu.mgh.harvard.edu/~purcell/plink/). SNPassoc (Juan R. González 1, et al. Bioinformatics, 2007 23(5):654-655) GenABEL (Aulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. Bioinformatics. 2007, 23(10):1294-6.) 3

What is a GWAS? A genome-wide association study is an approach that involves rapidly scanning markers across genome ( 0.5M or 1M) of many people ( 2K) to find genetic variations associated with a particular disease. A large number of subjects are needed because (1)associations between SNPs and causal variants are expected to show low odds ratios, typically below 1.5 (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases 4

What is a GWAS? 5

Why are such studies possible now? The completion of the Human Genome Project in 2003 and the International HapMap Project in 2005, researchers now have a set of research tools that make it possible to find the genetic contributions to common diseases 6

GWAS for complex diseases 7

Overview of the general design and workflow of a genome-wide association (GWA) study 8

What have GWAS found? In 2005, it was learned through GWAS that age-related macular degeneration is associated with variation in the gene for complement factor H, which produces a protein that regulates inflammation (Klein et al. (2005) Science, 308, 385 389) In 2007, the Wellcome Trust Case-Control Consortium (WTCCC) carried out GWAS for the diseases coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension. This study was successful in uncovering many new disease genes underlying these diseases. See next page for more publications in GWAS 9

Examples of GWAS Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat Genet. 2007 Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Wellcome Trust Case Control Consortium Nature. 2007;447;661-78 Genomewide association analysis of coronary artery disease. Samani et al. N Engl J Med. 2007;357;443-53 Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Parkes et al. Nat Genet. 2007;39;830-2 Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Todd et al. Nat Genet. 2007;39;857-64 A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Frayling et al. Science. 2007;316;889-94 Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Zeggini et al. Science. 2007;316;1336-41 Scott et al. (2007) A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science, 316, 1341 1345. 10

Example: Data & Results 11

Problem(s) How to make inference about SNP-Disease associations? Which computational tools to use? 12

Features of GenABEL Specifically designed for GWAS Provides specific facilities for storage and manipulation of large data Very fast tests for GWAS Specific functions to analyze and display the results More efficient than the library genetics 13

GeneABEL: GWAS.data class 14

Exploring GWAS.data class objects library("genabel") data(ge03d2ex) # phenotype data summary(ge03d2ex@phdata) R output id sex age dm2 Length:136 Min. :0.0000 Min. :23.84 Min. :0.0000 Class :character 1st Qu.:0.0000 1st Qu.:38.33 1st Qu.:0.0000 Mode :character Median :1.0000 Median :48.71 Median :1.0000 Mean :0.5294 Mean :49.07 Mean :0.6324 3rd Qu.:1.0000 3rd Qu.:58.57 3rd Qu.:1.0000 Max. :1.0000 Max. :81.57 Max. :1.0000 height weight diet bmi Min. :150.2 Min. : 46.63 Min. :0.00000 Min. :17.30 1st Qu.:161.5 1st Qu.: 69.02 1st Qu.:0.00000 1st Qu.:24.56 Median :169.4 Median : 81.15 Median :0.00000 Median :28.35 Mean :169.4 Mean : 87.40 Mean :0.05882 Mean :30.30 3rd Qu.:175.9 3rd Qu.:102.79 3rd Qu.:0.00000 3rd Qu.:35.69 Max. :191.8 Max. :161.24 Max. :1.00000 Max. :59.83 NA's : 1.0 NA's : 1.00 NA's : 1.00 15

Exploring GWAS.data class objects library("genabel") data(ge03d2ex) # phenotype data summary(ge03d2ex@phdata) # number of people in study ge03d2ex@gtdata@nids # number of SNPs ge03d2ex@gtdata@nsnps # SNP names ge03d2ex@gtdata@snpnames[1:10] # Chromosome labels ge03d2ex@gtdata@chromosome[1:10] # SNPs map positions ge03d2ex@gtdata@map[1:10] 16

Descriptive statistics: phenotypes descriptive.trait(ge03d2ex) R output No Mean SD id 136 NA NA sex 136 0.529 0.501 age 136 49.069 12.926 dm2 136 0.632 0.484 height 135 169.440 9.814 weight 135 87.397 25.510 diet 136 0.059 0.236 bmi 135 30.301 8.082 type 2 diabetes status descriptives.trait(ge03d2ex, by=ge03d2ex@phdata$dm2)) = by case-control status 17

Descriptive statistics: markers descriptives.marker(ge03d2ex) $`Minor allele frequency distribution` X<=0.01 0.01<X<=0.05 0.05<X<=0.1 0.1<X<=0.2 X>0.2 No 146.000 684.000 711.000 904.000 1555.000 Prop 0.036 0.171 0.178 0.226 0.389 $`Distribution of number of SNPs out of HWE, at different alpha` X<=1e-04 X<=0.001 X<=0.01 X<=0.05 X>0.05 No 46.000 71.000 125.000 275.000 4000 Prop 0.011 0.018 0.031 0.069 1 $`Distribution of porportion of successful genotypes (per SNP)` X<=0.9 0.9<X<=0.95 0.95<X<=0.98 0.98<X<=0.99 X>0.99 No 1.000 0 0 135.000 0 Prop 0.007 0 0 0.993 0 R output $`Distribution of porportion of successful genotypes (per person)` X<=0.9 0.9<X<=0.95 0.95<X<=0.98 0.98<X<=0.99 X>0.99 No 37.000 6.000 996.000 1177.000 1784.000 Prop 0.009 0.002 0.249 0.294 0.446 $`Mean heterozygosity for a SNP` [1] 0.2582298 $`Standard deviation of the mean heterozygosity for a SNP` [1] 0.1592255 $`Mean heterozygosity for a person` [1] 0.2476507 $`Standard deviation of mean heterozygosity for a person` [1] 0.04291038 18

Test of Hardy-Weinberg equilibrium # Test of Hardy-Weinberg equilibrium in control group s<-summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 0),]) pexcas<-s[,"pexact"] estlambda(pexcas) # Test of Hardy-Weinberg equilibrium in case group s<-summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 1),]) pexcas<-s[,"pexact"] estlambda(pexcas) R output Controls Cases 19

Data checking: procedure qc1<-check.marker(ge03d2ex, p.level=0) R output RUN 1 3993 markers and 134 people in total 304 (7.613323%) markers excluded as having low (<1.865672%) minor allele frequency 36 (0.9015778%) markers excluded because of low (<95%) call rate 0 (0%) markers excluded because they are out of HWE (P <0) 1 (0.7462687%) people excluded because of low (<95%) call rate 3 (2.238806%) people excluded because too high autosomal heterozygosity (FDR <1%) Mean autosomal HET was 0.2747262 (s.e. 0.03721277), people excluded had HET >= 0.5041617 1 (0.7462687%) people excluded because of too high IBS (>=0.95) Mean IBS was 0.785972 (s.e. 0.02000698), as based on 2000 autosomal markers In total, 3653 (91.4851%) markers passed all criteria In total, 129 (96.26866%) people passed all criteria 20

Data checking: summary table summary(qc1) R output $`Per-SNP fails statistics` NoCall NoMAF NoHWE Redundant Xsnpfail NoCall 42 0 0 0 0 NoMAF NA 376 0 0 0 NoHWE NA NA 0 0 0 Redundant NA NA NA 0 0 Xsnpfail NA NA NA NA 1 $`Per-person fails statistics` IDnoCall HetFail IBSFail isfemale ismale IDnoCall 1 0 0 0 0 HetFail NA 3 0 0 0 IBSFail NA NA 1 0 0 isfemale NA NA NA 2 0 ismale NA NA NA NA 0 21

Data checking: output The procedure provides the list of individuals (idok) and SNPs (snpok) who passed all QC criteria. It is then possible to obtain a clean dataset: data1<-ge03d2ex[qc1$idok, qc1$snpok] 22

Data checking: HW plots after cleaning s1<-summary(data1@gtdata[(data1@phdata$dm2 == 1),]) pexcas1<-s1[,"pexact"] estlambda(pexcas1) R output After Before 23

Finding genetic sub-structure # matrix of genomic kindship between all pairs of individuals data1.gkin <-ibs(data1[,data1@gtdata@chromosome!= "X"], weight="freq") # distance matrix data1.dist<-as.dist(0.5-data1.gkin) #use classical multidimensional scaling data1.mds<-cmdscale(data1.dist) #plot the two first components plot(data1.mds) Exclude these individuals 24

Remove outliers km<-kmeans(data1.mds, centers=2, nstart=1000) cl1<-names(which(km$cluster==1)) cl2<-names(which(km$cluster==2)) data2<-data1[cl1,] Then, repeat the QC analysis allowing for HWE checks (using controls and exclude markers with FDR 0.2) qc2<-check.marker(data2, hweids=(data2@phdata$dm2 ==0), fdr=0.2) summary(qc2) R output NoCall NoMAF NoHWE Redundant Xsnpfail NoCall 0 0 0 0 0 NoMAF NA 40 0 0 0 NoHWE NA NA 0 0 0 Redundant NA NA NA 0 0 Xsnpfail NA NA NA NA 0 IDnoCall HetFail IBSFail isfemale ismale IDnoCall 0 0 0 0 0 HetFail NA 0 0 0 0 IBSFail NA NA 0 0 0 isfemale NA NA NA 0 0 ismale NA NA NA NA 0 25

GWA scan: raw data Scan of the raw data (before quality control) using a score test, as implemented in the qtscore() function. an0<-qtscore(dm2, ge03d2ex, trait="binomial") plot(an0) # add corrected p-values in green add.plot(an0, df="pc1df", col="green") interesting results? R output 26

GWA scan: raw data Scan of the raw data (before quality control) using a score test, as implemented in the qtscore() function. #descriptive table descriptives.scan(an0) R output: Top 10 results Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.189730 0.000280 0.000386-0.102941-0.632353 0.000633 rs2975760 3 10518480 0.182573 0.000298 0.000411 0.141182 0.274763 0.001143 rs7418878 1 2808520 0.170464 0.000974 0.001274 0.154881 0.200980 0.002264 rs5308595 3 10543128 0.223766 0.001054 0.001375 0.170057 0.375940 0.004593 rs4804634 1 2807417-0.079119 0.001197 0.001552 0.061353-0.203788 0.003696 rs3224311 2 6009769 0.142522 0.001329 0.001716 0.133082 0.170370 0.002941 rs26325 3 10617781-0.447811 0.001331 0.001719-0.447811-0.895623 0.001331 rs8835506 2 6010852 0.142857 0.001532 0.001966 0.135566 0.163636 0.003162 rs3925525 2 6008501 0.139601 0.001940 0.002464 0.128991 0.170370 0.004555 rs2521089 3 10487652 0.108577 0.002052 0.002601 0.056511 0.170655 0.006966 27

GWA scan: cleaned data data2<-data2[qc2$idok, qc2$snpok] # plot an1<-qtscore(dm2, data2, trait="binomial") plot(an1) # add corrected p-values add.plot(an1, df="pc1df", col="green") interesting results R output 28

Comparison of the two scans #compare with previous results plot(an1,, col="green") # add corrected p-values add.plot(an0, col="red") false signal? Clean data Raw data 29

GWA scan: cleaned data #descriptive table descriptives.scan(an1) Clean data Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.194947 0.000360 0.000505-0.105362-0.616000 0.000929 rs8835506 2 6010852 0.154827 0.000847 0.001142 0.154827 0.154827 0.001297 rs4804634 1 2807417-0.082839 0.001095 0.001459 0.077554-0.220017 0.002649 rs3925525 2 6008501 0.151123 0.001108 0.001476 0.147636 0.161778 0.002010 rs3224311 2 6009769 0.151123 0.001108 0.001476 0.147636 0.161778 0.002010 rs2975760 3 10518480 0.177419 0.001255 0.001661 0.137097 0.275986 0.004795 rs4534929 1 4474374-0.152613 0.002000 0.002591-0.039572-0.287634 0.007430 rs6079246 2 7048058-0.431085 0.002106 0.002723-0.431085-0.862170 0.002106 rs5308595 3 10543128 0.223577 0.002367 0.003044 0.202744 0.390244 0.009551 rs1013473 1 4487262 0.089426 0.002566 0.003287 0.034794 0.141576 0.006782 Raw data Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.189730 0.000280 0.000386-0.102941-0.632353 0.000633 rs2975760 3 10518480 0.182573 0.000298 0.000411 0.141182 0.274763 0.001143 rs7418878 1 2808520 0.170464 0.000974 0.001274 0.154881 0.200980 0.002264 rs5308595 3 10543128 0.223766 0.001054 0.001375 0.170057 0.375940 0.004593 rs4804634 1 2807417-0.079119 0.001197 0.001552 0.061353-0.203788 0.003696 rs3224311 2 6009769 0.142522 0.001329 0.001716 0.133082 0.170370 0.002941 rs26325 3 10617781-0.447811 0.001331 0.001719-0.447811-0.895623 0.001331 rs8835506 2 6010852 0.142857 0.001532 0.001966 0.135566 0.163636 0.003162 rs3925525 2 6008501 0.139601 0.001940 0.002464 0.128991 0.170370 0.004555 rs2521089 3 10487652 0.108577 0.002052 0.002601 0.056511 0.170655 0.006966 30

GWA in presence of genetic stratification Assess population structure Account for pop. structure in the analysis pop<-as.numeric(data1@phdata$id %in% cl1) pop # Assess pop. structure pop<-as.numeric(data1@phdata$id %in% cl1) pop # Stratified association data1.sa<-qtscore(dm2, data=data1, strata=pop) # plots results and compare with analysis removing the outliers plot(an1, cex=0.5, pch=19, ylim=c(1, 5)) add.plot(data1.sa, col="green", cex=1.2) 31

GWA in presence of genetic stratification Adjust both phenotypes and genotypes for possible stratification using principal component analysis (Price s method) data1.eg<-egscore(dm2, data=data1, kin=data1.gkin) plot(an1, cex=0.5, pch=19, ylim=c(1, 5)) add.plot(data1.sa, col="green", cex=1.2) add.plot(data1.eg, col="red", cex=1.3) 32

Other interesting features Genetic data imputations Meta-analysis of GWA scans Analysis of selected regions Conversion of plink files 33

Conclusion GWAS is becoming of major area of research New computational tools and stat methods are needed GenABEL is an interesting program, especially for easy data cleaning and display of results Plink has more features for stat analysis but not yet available in R for Windows! 34

Thank you! 35