Introduction to statistics for Genome- Wide Association Studies (GWAS) Day 2 Section 8

Similar documents
GenABEL: an R package for Genome Wide Association Analysis Archana Bhardwaj

Practical aspects of GWAS

H3A - Genome-Wide Association testing SOP

THE HEALTH AND RETIREMENT STUDY: GENETIC DATA UPDATE

PLINK gplink Haploview

Genome-wide association studies (GWAS) Part 1

Genome-Wide Association Studies. Ryan Collins, Gerissa Fowler, Sean Gamberg, Josselyn Hudasek & Victoria Mackey

A genome wide association study of metabolic traits in human urine

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

DNA Collection. Data Quality Control. Whole Genome Amplification. Whole Genome Amplification. Measure DNA concentrations. Pros

Genome Wide Association Studies

Understanding genetic association studies. Peter Kamerman

S G. Design and Analysis of Genetic Association Studies. ection. tatistical. enetics

Genome wide association studies. How do we know there is genetics involved in the disease susceptibility?

Genotype quality control with plinkqc Hannah Meyer

Linking Genetic Variation to Important Phenotypes: SNPs, CNVs, GWAS, and eqtls

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Supplementary Figures

Genomics Resources in WHI. WHI ( ) Extension Study Steering Committee Meeting Seattle, WA May 05-06, 2011

Genome-Wide Association Studies (GWAS): Computational Them

Genome-wide analyses in admixed populations: Challenges and opportunities

EPIB 668 Genetic association studies. Aurélie LABBE - Winter 2011

SUPPLEMENTARY METHODS AND RESULTS

Analysis of genome-wide genotype data

Using the Association Workflow in Partek Genomics Suite

Familial Breast Cancer

Single Nucleotide Polymorphisms (SNPs)

emerge-ii site report Vanderbilt

Statistical challenges to genome-wide association study

SNPTransformer: A Lightweight Toolkit for Genome-Wide Association Studies

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Population stratification. Background & PLINK practical

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

SUPPLEMENTARY INFORMATION. Common variants in TMPRSS6 are associated with iron status and erythrocyte volume

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

Introduction to Add Health GWAS Data Part I. Christy Avery Department of Epidemiology University of North Carolina at Chapel Hill

SNPs - GWAS - eqtls. Sebastian Schmeier

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

SNPassoc: an R package to perform whole genome association studies

Nature Genetics: doi: /ng.3143

Derrek Paul Hibar

Population and Statistical Genetics including Hardy-Weinberg Equilibrium (HWE) and Genetic Drift

Department of Psychology, Ben Gurion University of the Negev, Beer Sheva, Israel;

Appendix 5: Details of statistical methods in the CRP CHD Genetics Collaboration (CCGC) [posted as supplied by

1b. How do people differ genetically?

Contrasting regional architectures of schizophrenia and other complex diseases using fast variance components analysis

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2015

Lecture 3: Introduction to the PLINK Software. Summer Institute in Statistical Genetics 2017

Papers for 11 September

Topics in Statistical Genetics

Supplementary Information. Werner Koch, Petra Hoppmann, Jakob C. Mueller, Albert Schömig & Adnan Kastrati

Bioinformatic Analysis of SNP Data for Genetic Association Studies EPI573

Global Screening Array (GSA)

Module 2: Introduction to PLINK and Quality Control

Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

GENOME WIDE ASSOCIATION STUDY OF INSECT BITE HYPERSENSITIVITY IN TWO POPULATION OF ICELANDIC HORSES

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ejhg.2015.

Genome-wide association study identifies multiple susceptibility loci for pulmonary fibrosis

General aspects of genome-wide association studies

Linking Genetic Variation to Important Phenotypes

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

AN EVALUATION OF POWER TO DETECT LOW-FREQUENCY VARIANT ASSOCIATIONS USING ALLELE-MATCHING TESTS THAT ACCOUNT FOR UNCERTAINTY

Axiom Biobank Genotyping Solution

Computational Workflows for Genome-Wide Association Study: I

Linkage Disequilibrium

Nutrigenomics and nutrigenetics are they the keys for healthy nutrition?

Algorithms for Genetics: Introduction, and sources of variation

Office Hours. We will try to find a time

Supplementary Figure 1. Study design of a multi-stage GWAS of gout.

SAC review Haplotype mapping in human disease

Supplementary Note: Detecting population structure in rare variant data

Genetic Association Analysis with R Dr. Jing Hua Zhao

arxiv: v1 [stat.ap] 31 Jul 2014

Whole Genome Sequencing. Biostatistics 666

Human Genetics and Gene Mapping of Complex Traits

B I O I N F O R M A T I C S

Polygenic Influences on Boys & Girls Pubertal Timing & Tempo. Gregor Horvath, Valerie Knopik, Kristine Marceau Purdue University

Core Resources Working Group Report. Opportunities for Investigator Engagement

5/18/2017. Genotypic, phenotypic or allelic frequencies each sum to 1. Changes in allele frequencies determine gene pool composition over generations

Genetics and Bioinformatics

Association studies (Linkage disequilibrium)

Cross Haplotype Sharing Statistic: Haplotype length based method for whole genome association testing

Data quality control in genetic case-control association studies

Downloaded from:

linkage signal sufficiently to identify a causative gene. GWA studies build on the valuable lessons learned from candidate gene and family linkage stu

Concepts and relevance of genome-wide association studies

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Prediction and Meta-Analysis

Introduc)on to Sta)s)cal Gene)cs: emphasis on Gene)c Associa)on Studies

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

OVERVIEW OF GOALS EXAMPLE DATASETS AND SOFTWARE

Package snpready. April 11, 2018

BIOINFORMATICS ORIGINAL PAPER

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Human Genetics and Gene Mapping of Complex Traits

BTRY 7210: Topics in Quantitative Genomics and Genetics

Author's response to reviews

Statistical Tools for Predicting Ancestry from Genetic Data

Transcription:

Introduction to statistics for Genome- Wide Association Studies (GWAS) 1

Outline Background on GWAS Presentation of GenABEL Data checking with GenABEL Data analysis with GenABEL Display of results 2

R Packages for GWAS Plink developped by the M.I.T. but only available for linux platform only. (http://pngu.mgh.harvard.edu/~purcell/plink/). SNPassoc (Juan R. González 1, et al. Bioinformatics, 2007 23(5):654-655) GenABEL (Aulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. Bioinformatics. 2007, 23(10):1294-6.) 3

What is a GWAS? A genome-wide association study is an approach that involves rapidly scanning markers across genome ( 0.5M or 1M) of many people ( 2K) to find genetic variations associated with a particular disease. A large number of subjects are needed because (1)associations between SNPs and causal variants are expected to show low odds ratios, typically below 1.5 (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases 4

What is a GWAS? 5

Why are such studies possible now? The completion of the Human Genome Project in 2003 and the International HapMap Project in 2005, researchers now have a set of research tools that make it possible to find the genetic contributions to common diseases 6

GWAS for complex diseases 7

Overview of the general design and workflow of a genome-wide association (GWA) study 8

What have GWAS found? In 2005, it was learned through GWAS that age-related macular degeneration is associated with variation in the gene for complement factor H, which produces a protein that regulates inflammation (Klein et al. (2005) Science, 308, 385 389) In 2007, the Wellcome Trust Case-Control Consortium (WTCCC) carried out GWAS for the diseases coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension. This study was successful in uncovering many new disease genes underlying these diseases. See next page for more publications in GWAS 9

Examples of GWAS Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat Genet. 2007 Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Wellcome Trust Case Control Consortium Nature. 2007;447;661-78 Genomewide association analysis of coronary artery disease. Samani et al. N Engl J Med. 2007;357;443-53 Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Parkes et al. Nat Genet. 2007;39;830-2 Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Todd et al. Nat Genet. 2007;39;857-64 A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Frayling et al. Science. 2007;316;889-94 Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Zeggini et al. Science. 2007;316;1336-41 Scott et al. (2007) A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science, 316, 1341 1345. 10

Example: Data & Results 11

Problem(s) How to make inference about SNP-Disease associations? Which computational tools to use? 12

Features of GenABEL Specifically designed for GWAS Provides specific facilities for storage and manipulation of large data Very fast tests for GWAS Specific functions to analyze and display the results More efficient than the library genetics 13

GeneABEL: GWAS.data class 14

Exploring GWAS.data class objects library("genabel") data(ge03d2ex) # phenotype data summary(ge03d2ex@phdata) R output id sex age dm2 Length:136 Min. :0.0000 Min. :23.84 Min. :0.0000 Class :character 1st Qu.:0.0000 1st Qu.:38.33 1st Qu.:0.0000 Mode :character Median :1.0000 Median :48.71 Median :1.0000 Mean :0.5294 Mean :49.07 Mean :0.6324 3rd Qu.:1.0000 3rd Qu.:58.57 3rd Qu.:1.0000 Max. :1.0000 Max. :81.57 Max. :1.0000 height weight diet bmi Min. :150.2 Min. : 46.63 Min. :0.00000 Min. :17.30 1st Qu.:161.5 1st Qu.: 69.02 1st Qu.:0.00000 1st Qu.:24.56 Median :169.4 Median : 81.15 Median :0.00000 Median :28.35 Mean :169.4 Mean : 87.40 Mean :0.05882 Mean :30.30 3rd Qu.:175.9 3rd Qu.:102.79 3rd Qu.:0.00000 3rd Qu.:35.69 Max. :191.8 Max. :161.24 Max. :1.00000 Max. :59.83 NA's : 1.0 NA's : 1.00 NA's : 1.00 15

Exploring GWAS.data class objects library("genabel") data(ge03d2ex) # phenotype data summary(ge03d2ex@phdata) # number of people in study ge03d2ex@gtdata@nids # number of SNPs ge03d2ex@gtdata@nsnps # SNP names ge03d2ex@gtdata@snpnames[1:10] # Chromosome labels ge03d2ex@gtdata@chromosome[1:10] # SNPs map positions ge03d2ex@gtdata@map[1:10] 16

Descriptive statistics: phenotypes descriptive.trait(ge03d2ex) R output No Mean SD id 136 NA NA sex 136 0.529 0.501 age 136 49.069 12.926 dm2 136 0.632 0.484 height 135 169.440 9.814 weight 135 87.397 25.510 diet 136 0.059 0.236 bmi 135 30.301 8.082 type 2 diabetes status descriptives.trait(ge03d2ex, by=ge03d2ex@phdata$dm2)) = by case-control status 17

Descriptive statistics: markers descriptives.marker(ge03d2ex) $`Minor allele frequency distribution` X<=0.01 0.01<X<=0.05 0.05<X<=0.1 0.1<X<=0.2 X>0.2 No 146.000 684.000 711.000 904.000 1555.000 Prop 0.036 0.171 0.178 0.226 0.389 $`Distribution of number of SNPs out of HWE, at different alpha` X<=1e-04 X<=0.001 X<=0.01 X<=0.05 X>0.05 No 46.000 71.000 125.000 275.000 4000 Prop 0.011 0.018 0.031 0.069 1 $`Distribution of porportion of successful genotypes (per SNP)` X<=0.9 0.9<X<=0.95 0.95<X<=0.98 0.98<X<=0.99 X>0.99 No 1.000 0 0 135.000 0 Prop 0.007 0 0 0.993 0 R output $`Distribution of porportion of successful genotypes (per person)` X<=0.9 0.9<X<=0.95 0.95<X<=0.98 0.98<X<=0.99 X>0.99 No 37.000 6.000 996.000 1177.000 1784.000 Prop 0.009 0.002 0.249 0.294 0.446 $`Mean heterozygosity for a SNP` [1] 0.2582298 $`Standard deviation of the mean heterozygosity for a SNP` [1] 0.1592255 $`Mean heterozygosity for a person` [1] 0.2476507 $`Standard deviation of mean heterozygosity for a person` [1] 0.04291038 18

Test of Hardy-Weinberg equilibrium # Test of Hardy-Weinberg equilibrium in control group s<-summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 0),]) pexcas<-s[,"pexact"] estlambda(pexcas) # Test of Hardy-Weinberg equilibrium in case group s<-summary(ge03d2ex@gtdata[(ge03d2ex@phdata$dm2 == 1),]) pexcas<-s[,"pexact"] estlambda(pexcas) R output Controls Cases 19

Data checking: procedure qc1<-check.marker(ge03d2ex, p.level=0) R output RUN 1 3993 markers and 134 people in total 304 (7.613323%) markers excluded as having low (<1.865672%) minor allele frequency 36 (0.9015778%) markers excluded because of low (<95%) call rate 0 (0%) markers excluded because they are out of HWE (P <0) 1 (0.7462687%) people excluded because of low (<95%) call rate 3 (2.238806%) people excluded because too high autosomal heterozygosity (FDR <1%) Mean autosomal HET was 0.2747262 (s.e. 0.03721277), people excluded had HET >= 0.5041617 1 (0.7462687%) people excluded because of too high IBS (>=0.95) Mean IBS was 0.785972 (s.e. 0.02000698), as based on 2000 autosomal markers In total, 3653 (91.4851%) markers passed all criteria In total, 129 (96.26866%) people passed all criteria 20

Data checking: summary table summary(qc1) R output $`Per-SNP fails statistics` NoCall NoMAF NoHWE Redundant Xsnpfail NoCall 42 0 0 0 0 NoMAF NA 376 0 0 0 NoHWE NA NA 0 0 0 Redundant NA NA NA 0 0 Xsnpfail NA NA NA NA 1 $`Per-person fails statistics` IDnoCall HetFail IBSFail isfemale ismale IDnoCall 1 0 0 0 0 HetFail NA 3 0 0 0 IBSFail NA NA 1 0 0 isfemale NA NA NA 2 0 ismale NA NA NA NA 0 21

Data checking: output The procedure provides the list of individuals (idok) and SNPs (snpok) who passed all QC criteria. It is then possible to obtain a clean dataset: data1<-ge03d2ex[qc1$idok, qc1$snpok] 22

Data checking: HW plots after cleaning s1<-summary(data1@gtdata[(data1@phdata$dm2 == 1),]) pexcas1<-s1[,"pexact"] estlambda(pexcas1) R output After Before 23

Finding genetic sub-structure # matrix of genomic kindship between all pairs of individuals data1.gkin <-ibs(data1[,data1@gtdata@chromosome!= "X"], weight="freq") # distance matrix data1.dist<-as.dist(0.5-data1.gkin) #use classical multidimensional scaling data1.mds<-cmdscale(data1.dist) #plot the two first components plot(data1.mds) Exclude these individuals 24

Remove outliers km<-kmeans(data1.mds, centers=2, nstart=1000) cl1<-names(which(km$cluster==1)) cl2<-names(which(km$cluster==2)) data2<-data1[cl1,] Then, repeat the QC analysis allowing for HWE checks (using controls and exclude markers with FDR 0.2) qc2<-check.marker(data2, hweids=(data2@phdata$dm2 ==0), fdr=0.2) summary(qc2) R output NoCall NoMAF NoHWE Redundant Xsnpfail NoCall 0 0 0 0 0 NoMAF NA 40 0 0 0 NoHWE NA NA 0 0 0 Redundant NA NA NA 0 0 Xsnpfail NA NA NA NA 0 IDnoCall HetFail IBSFail isfemale ismale IDnoCall 0 0 0 0 0 HetFail NA 0 0 0 0 IBSFail NA NA 0 0 0 isfemale NA NA NA 0 0 ismale NA NA NA NA 0 25

GWA scan: raw data Scan of the raw data (before quality control) using a score test, as implemented in the qtscore() function. an0<-qtscore(dm2, ge03d2ex, trait="binomial") plot(an0) # add corrected p-values in green add.plot(an0, df="pc1df", col="green") interesting results? R output 26

GWA scan: raw data Scan of the raw data (before quality control) using a score test, as implemented in the qtscore() function. #descriptive table descriptives.scan(an0) R output: Top 10 results Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.189730 0.000280 0.000386-0.102941-0.632353 0.000633 rs2975760 3 10518480 0.182573 0.000298 0.000411 0.141182 0.274763 0.001143 rs7418878 1 2808520 0.170464 0.000974 0.001274 0.154881 0.200980 0.002264 rs5308595 3 10543128 0.223766 0.001054 0.001375 0.170057 0.375940 0.004593 rs4804634 1 2807417-0.079119 0.001197 0.001552 0.061353-0.203788 0.003696 rs3224311 2 6009769 0.142522 0.001329 0.001716 0.133082 0.170370 0.002941 rs26325 3 10617781-0.447811 0.001331 0.001719-0.447811-0.895623 0.001331 rs8835506 2 6010852 0.142857 0.001532 0.001966 0.135566 0.163636 0.003162 rs3925525 2 6008501 0.139601 0.001940 0.002464 0.128991 0.170370 0.004555 rs2521089 3 10487652 0.108577 0.002052 0.002601 0.056511 0.170655 0.006966 27

GWA scan: cleaned data data2<-data2[qc2$idok, qc2$snpok] # plot an1<-qtscore(dm2, data2, trait="binomial") plot(an1) # add corrected p-values add.plot(an1, df="pc1df", col="green") interesting results R output 28

Comparison of the two scans #compare with previous results plot(an1,, col="green") # add corrected p-values add.plot(an0, col="red") false signal? Clean data Raw data 29

GWA scan: cleaned data #descriptive table descriptives.scan(an1) Clean data Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.194947 0.000360 0.000505-0.105362-0.616000 0.000929 rs8835506 2 6010852 0.154827 0.000847 0.001142 0.154827 0.154827 0.001297 rs4804634 1 2807417-0.082839 0.001095 0.001459 0.077554-0.220017 0.002649 rs3925525 2 6008501 0.151123 0.001108 0.001476 0.147636 0.161778 0.002010 rs3224311 2 6009769 0.151123 0.001108 0.001476 0.147636 0.161778 0.002010 rs2975760 3 10518480 0.177419 0.001255 0.001661 0.137097 0.275986 0.004795 rs4534929 1 4474374-0.152613 0.002000 0.002591-0.039572-0.287634 0.007430 rs6079246 2 7048058-0.431085 0.002106 0.002723-0.431085-0.862170 0.002106 rs5308595 3 10543128 0.223577 0.002367 0.003044 0.202744 0.390244 0.009551 rs1013473 1 4487262 0.089426 0.002566 0.003287 0.034794 0.141576 0.006782 Raw data Chromosome Position effb P1df Pc1df effab effbb P2df rs1719133 1 4495479-0.189730 0.000280 0.000386-0.102941-0.632353 0.000633 rs2975760 3 10518480 0.182573 0.000298 0.000411 0.141182 0.274763 0.001143 rs7418878 1 2808520 0.170464 0.000974 0.001274 0.154881 0.200980 0.002264 rs5308595 3 10543128 0.223766 0.001054 0.001375 0.170057 0.375940 0.004593 rs4804634 1 2807417-0.079119 0.001197 0.001552 0.061353-0.203788 0.003696 rs3224311 2 6009769 0.142522 0.001329 0.001716 0.133082 0.170370 0.002941 rs26325 3 10617781-0.447811 0.001331 0.001719-0.447811-0.895623 0.001331 rs8835506 2 6010852 0.142857 0.001532 0.001966 0.135566 0.163636 0.003162 rs3925525 2 6008501 0.139601 0.001940 0.002464 0.128991 0.170370 0.004555 rs2521089 3 10487652 0.108577 0.002052 0.002601 0.056511 0.170655 0.006966 30

GWA in presence of genetic stratification Assess population structure Account for pop. structure in the analysis pop<-as.numeric(data1@phdata$id %in% cl1) pop # Assess pop. structure pop<-as.numeric(data1@phdata$id %in% cl1) pop # Stratified association data1.sa<-qtscore(dm2, data=data1, strata=pop) # plots results and compare with analysis removing the outliers plot(an1, cex=0.5, pch=19, ylim=c(1, 5)) add.plot(data1.sa, col="green", cex=1.2) 31

GWA in presence of genetic stratification Adjust both phenotypes and genotypes for possible stratification using principal component analysis (Price s method) data1.eg<-egscore(dm2, data=data1, kin=data1.gkin) plot(an1, cex=0.5, pch=19, ylim=c(1, 5)) add.plot(data1.sa, col="green", cex=1.2) add.plot(data1.eg, col="red", cex=1.3) 32

Other interesting features Genetic data imputations Meta-analysis of GWA scans Analysis of selected regions Conversion of plink files 33

Conclusion GWAS is becoming of major area of research New computational tools and stat methods are needed GenABEL is an interesting program, especially for easy data cleaning and display of results Plink has more features for stat analysis but not yet available in R for Windows! 34

Thank you! 35