Genetic Association Analysis with R Dr. Jing Hua Zhao

Size: px
Start display at page:

Download "Genetic Association Analysis with R Dr. Jing Hua Zhao"

Transcription

1 Genetic Association Analysis with R MRC Epidemiology Unit & Institute of Metabolic Science, Addenbrooke s Hospital, Cambridge CB2 0QQ, UK jinghua.zhao@mrc-epid.cam.ac.uk 1 Outline of the talk Aim & scope R for statistical analysis and genetic data Some examples of elementary analyses Genome-wide association study (GWAS) Issues and challenges Summary Further readings 2 Aim & scope We aim To provide a broad picture of R relative to other software environments To show how genetic data analysis can be done easily with R and To provide further information which will facilitate development with R R can be used for a broad range of problems; however, we will limit our attention to analysis of complex traits which is primarily based on linkage and association analyses, and GWAS in particular 3 1

2 R environment for statistical analysis R is a free software environment for statistical computing and graphics ( A summary of its advantages is as follows, R offers more and up-to-date analytical methods R is flexible with data (without the apparent need to merge them before use as in many other software systems) R has its own and more powerful language, its procedures are open to modify, and it also has full matrix capabilities; it has provided many sample datasets which help with learning to use R has a flexible and high-quality graphical facility R runs on almost any computer R is free and evolving that can be relied on for years to come Muenchen RA. R for SAS and SPSS Users, Springer Rationale of R for genetic data The reliance and complacency among geneticists on standalone applications, e.g., a survey of Salem et al., Hum Genomics 2005; 2:39-66 revealing several dozens of haplotype analysis programs, Excoffier & Heckel Nat Rev Genet 2006; 7: providing a lengthy survival guide for population genetics data analysis Issues associated with their development and use include incompatible data format, inappropriate documentation, lack of validation and poor maintenance, see Zhao & Tan Hum Genomics 2006, 2: R is a viable as it mirrors the development and application of the Linux system. It offers comprehensive facilities and statistical methods and enables easy use of codes for genetic data in C/Fortran/Pascal, etc., and facilitates collaborations 5 R packages for genetic data An R package is a collection of functions, documentations and data to be accessible under R A summary of implementation for genetic data is available from the CRAN task view Some elementary packages of linkage and association at CRAN are genetics, haplo.stats (haplo.ccs, hapassoc), gap, identity, pedigree, hwde, multic, LDheatmap, fbati, multtest, qvalue, meta, rmeta, ROCR Packages for GWAS: SNPassoc, snpmatrix, GenABEL, pbatr There are many non-cran packages such as those at BioConductor and individually distributed packages Many general packages can be used for genetic data 6 2

3 Installation and use of package(s) # CRAN > install.packages(c( genetics, gap, haplo.stats, kinship ), repos= ) # BioConductor > source(" > bioclite("snpmatrix") # or > setrepositories() > install.packages( snpmatrix ) > library(genetics) > library(snpmatrix) 7 Summary of an imaginary data > library(genetics) > d <- c("d/d","d/i","d/d","i/i", "D/D","D/D","D/D","D/D","I/I") > g <- genotype(d) > summary(g) Allele Frequency: (2 alleles) Count Proportion D I Genotype Frequency: Count Proportion D/D D/I I/I Heterozygosity (Hu) = Poly. Inf. Content = Test of Hardy-Weinberg equilibrium > HWE.test (g) Observed vs. Expected genotype Frequencies Obs Exp Obs-Exp D/D I/D D/I I/I Exact Test for Hardy-Weinberg Equilibrium data: g N11 = 6, N12 = 1, N22 = 2, N1 = 13, N2 = 5, p-value =

4 Haplotype analysis > library(haplo.stats) > geno <- as.matrix(hla.demo[,c(17,18,21:24)]) > keep <-!apply(is.na(geno) geno==0, 1, any) > hla.demo <- hla.demo[keep,] > geno <- geno[keep,] > attach(hla.demo) > label <- c("dqb","drb","b") > haplo.score(resp, geno, locus.label=label, label label trait.type = "gaussian") global-stat = , df = 17, p-val = Haplotype-specific Scores: DQB DRB B Hap-Freq Hap-Score p-val [1,] [2,] Pedigree-drawing > library(kinship) >?plot.pedigree > attach(p2) > p <- pedigree (id,fid,mid,sex,aff) > par(xpd=true) > plot(p) Regional association > library(gap) > asplot("rs ", CDKN2A/CDKN2B region "CDKN2A/CDKN2B region", "9", CDKNlocus, CDKNmap, CDKNgenes, 5.4e-8, c(3,6)) This diagram shows the recombination, top hits and gene annotation rved p) -log10(obser LD (r^2) P=5.4e imp. rs CDKN2B CDKN2A Chromosome 9 position (kb) Recombination ra ate (cm/mb) 12 4

5 Example: the Framingham data The data was distributed through genetic analysis workshop 16, containing the original, the first, and the third generation sample totaling ~7,000 individuals with Affymetrix 500K GeneChip single nucleotide polymorphisms (SNPs) The focus was on body mass index as a continuous trait We obtained quality control and basic association statistics via genomewide rapid association using mixed model and regression (GRAMMAR) procedure from GenABEL which accounts for familial relationship (Aulchenko et al., Genetics 2007,177:677-85) We obtained graphical presentation such as Manhattan and Q-Q plots; not shown here was that the distribution of the statistics deviates from the null, we refined our significance level by genomic control according to estimate from snpmatrix 13 GenABEL: identity-by-state (IBS) statistics We had our data in files containing genotypes, pedigree structures and phenotypes We loaded these data and calculated IBS between relatives > library(genabel) # GAW16 Framingham data in three files > convert.snp.tped(tped = "chrall.tped", tfam = "pheno.tfam", out = "chrall.raw", strand = "+") > df <- load.gwaa.data(phe = "pheno.dat", gen = "chrall.raw", force = TRUE) > gkin <- ibs(df@gtdata,w="freq") > save(gkin, file="kin.rdata") 14 GenABEL: polygenic model We performed polygenic modelling whose residual can be used by other software such as SAS We also performed association testing using the aforementioned GRAMMAR procedure > pg <- polygenic(bmi1 ~ sex + age, kin=gkin, df, quiet=true) # To use function qtscore() instead of grammar() > qts <- qtscore(pg$res,data=df) # for other software > write(pg$res, file="genabel.dat", 1) > save(pg, file="bmi.rdata") 15 5

6 Q-Q and Manhattan plots > library(gap) > par(mfrow=c(2,1), mai=c(1,1,0.2,0.8)) # adjusted p values > qqunif(p, bg= blue, bty= n, xlim=c(0,6), cex=0.02) > par(las=2) > mhtplot(test, usepos=t, pch=21, colors=rep(c( blue, green ),11), cutoffs=4:6, cex=0.02) -log10 (observed value) -log10 (observed p) -log10 (expected value) Chromosome 16 To recap: GWAS Input: genotype, phenotype, map For available SNP(s) do Extract appropriate per-snp information (position, labels, etc.) and per-individual genotypic and phenotypic information Code the genotype according to specific disease model (e.g., g additive, dominant, recessive) e) at run-time Obtain quality control statistics Perform association testing under framework such as generalised linear model (GLM) End do Typically, we can define the second allele (in alphabetical order) at a SNP to be effective 17 Issues and challenges While candidate gene analysis merits further research, GWAS poses particular problems: Large data including external data is ubiquitous; a GWAS with ~0.5 million SNPs can be extended to include ~2.5 million SNPs, most of which are imputed based on publicly available data such as HapMap Multiple testing adjustment is difficult since the number of true signals is expected to be low which is different from the kind of situations gene expression studies; a major remedy has been meta-analysis and replication in large consortia A need of deeper understanding beyond the basic association testing, e.g., biology and the nature-nurture interplay 18 6

7 Meta-analysis Consider two studies with sample sizes and 8000 both with p values 1e-8, we have a combined two-sided p value of 1.49e-14 but genome-wide association (p=4.89e-8) can also be achieved from two moderate p values of p 1 =1e-4 and p 2 =1e-5 This is according to weighted z-score method and more results are available from metap in package gap 19 Use of publicly available data Use of HapMap is now routine and an ongoing effort is the 1000 genomes project The promotion of wider collaborations; one example is the database of Genotypes and Phenotypes (dbgap, sites/entrez?db=gap) to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. The advent of highthroughput, cost-effective methods for genotyping and sequencing has provided powerful tools that allow for the generation of the massive amount of genotypic data required to make these analyses possible Genetic Analysis Workshops (GAWs) are organised every two years, see 20 A technical summary R implements a lot of elementary procedures for genetic data R is also helpful for all aspects of GWAS, e.g., Gene discovery Gene-gene, gene-environment interaction Risk prediction Elucidation of biological pathways It goes beyond GWAS in areas such as gene expression (BioConductor), phylogenetics and evolution 21 7

8 A general perspective The general appeals in use of R for genetic data analysis lie in it s a variety of aspects: easy implementation, graphical representation, validated and improved algorithm and availability of advanced statistical techniques and other facilities such as use of Internet and database The R community expands steadily with its development becoming more intensive and rigorous; it is exciting to get on board and be involved with its use and the development 22 Further readings The R project for statistical computing: Chambers JM. Software for Data Analysis-Programming with R. Springer 2008 Spector P. Data Manipulation with R. Springer 2008 Jones O, Maillardet R. Introduction to Scientific Programming and Simulation Using R. Taylor & Francis Group 2009 Foulkes AS. Applied Statistical Genetics with R for Population-based Association Studies. Springer 2009 Gentleman R, V Carey, W Huber, R Irizarry, S Dudoit. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer 2005 Paradis E. Analysis of Phylogenetics and Evolution with R. Springer 2006 Siegmund D, B Yakir. The Statistics of Gene Mapping. Springer 2007 Wu R, C-X Ma, G Casella. Statistical Genetics of Quantitative Traits-Linkage, Maps and QTL. Springer 2007 Zhao JH. Genetic analysis of complex traits & genomewide association studies, tutorials given at user!2008 and user!