Statistical challenges to genome-wide association study

Similar documents
PLINK gplink Haploview

Introduction to statistics for Genome- Wide Association Studies (GWAS) Day 2 Section 8

Exploring the Genetic Basis of Congenital Heart Defects

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

Genome-Wide Association Studies (GWAS): Computational Them

CS273B: Deep Learning in Genomics and Biomedicine. Recitation 1 30/9/2016

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Supplementary Note: Detecting population structure in rare variant data

General aspects of genome-wide association studies

Author's response to reviews

Linkage Disequilibrium. Adele Crane & Angela Taravella

Evaluation of Genome wide SNP Haplotype Blocks for Human Identification Applications

ACCEPTED. Victoria J. Wright Corresponding author.

Age-Adjusted Death Rates for Coronary Heart Disease, U.S.,

ARTICLE Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS

MAPPING GENES TO TRAITS IN DOGS USING SNPs

Beyond single genes or proteins

emerge-ii site report Vanderbilt

Lees J.A., Vehkala M. et al., 2016 In Review

University of Groningen. The value of haplotypes Vries, Anne René de

Concepts and relevance of genome-wide association studies

An introductory overview of the current state of statistical genetics

Illumina s GWAS Roadmap: next-generation genotyping studies in the post-1kgp era

Gene-centric Genomewide Association Study via Entropy

Measures of human population structure show heterogeneity among genomic regions

Gap Filling for a Human MHC Haplotype Sequence

An introduction to genetics and molecular biology

Runs of Homozygosity Analysis Tutorial

Basic Concepts of Human Genetics


KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies

POPULATION GENETICS Winter 2005 Lecture 18 Quantitative genetics and QTL mapping

Mapping and Mapping Populations

Gene-Environment Interactions In Complex Human Diseases

Improved Power by Use of a Weighted Score Test for Linkage Disequilibrium Mapping

S SG. Metabolomics meets Genomics. Hemant K. Tiwari, Ph.D. Professor and Head. Metabolomics: Bench to Bedside. ection ON tatistical.

UK Biobank Axiom Array

Data Mining and Applications in Genomics

Population Genetics & Drug Discovery

A strategy for multiple linkage disequilibrium mapping methods to validate additive QTL. Abstracts

SNP calling and VCF format

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Genetics of extreme body size evolution in mice from Gough Island

Linking Genetic Variation to Important Phenotypes

Aristos Aristodimou, Athos Antoniades, Constantinos Pattichis University of Cyprus David Tian, Ann Gledson, John Keane University of Manchester

Personal Genomics Platform White Paper Last Updated November 15, Executive Summary

Biomedical Big Data and Precision Medicine

Disclaimer This presentation expresses my personal views on this topic and must not be interpreted as the regulatory views or the policy of the FDA

Human linkage analysis. fundamental concepts

INTRODUCTION TO GENETIC EPIDEMIOLOGY (Antwerp Series) Prof. Dr. Dr. K. Van Steen

Correlations in Genetic Risk Scores Produced by Direct-to-Consumer Genetic Testing Companies

Data Sources and Biobanks in the Asia-Pacific Region. Wei Zhou, MD, Ph.D. Department of Epidemiology, Merck Research Laboratories October 23, 2014

Detecting ancient admixture using DNA sequence data

Exploring genomic databases: Practical session "

Targeting 160 Candidate Genes for Blood Pressure Regulation with a Genome-Wide Genotyping Array

Genetics and Inflammatory Bowel Disease

Haplotype phasing in large cohorts: Modeling, search, or both?

Package powerpkg. February 20, 2015

MMAP Genomic Matrix Calculations

Strategy for applying genome-wide selection in dairy cattle

Polarization of the Effects of Autoimmune and Neurodegenerative Risk Alleles in Leukocytes

ARTICLE Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach

Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes

Basic Concepts of Human Genetics

Introduction to Pharmacogenetics Competency

Quantitative Genetics

Biased Tests of Association: Comparisons of Allele Frequencies when Departing from Hardy-Weinberg Proportions

Detecting selection on nucleotide polymorphisms

Identifying Selected Regions from Heterozygosity and Divergence Using a Light-Coverage Genomic Dataset from Two Human Populations

Axiom mydesign Custom Array design guide for human genotyping applications

Introduction to population genetics. CRITFC Genetics Training December 13-14, 2016

Variant calling in NGS experiments

SNP GENOTYPING WITH iplex REAGENTS AND THE MASSARRAY SYSTEM

"Genetics in geographically structured populations: defining, estimating and interpreting FST."

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Application of Deep Learning to Drug Discovery

A Statistical Framework for Joint eqtl Analysis in Multiple Tissues

PV92 PCR Bio Informatics

Statistical Tests for Admixture Mapping with Case-Control and Cases-Only Data

Joint Study of Genetic Regulators for Expression

Genomic Research: Issues to Consider. IRB Brown Bag August 28, 2014 Sharon Aufox, MS, LGC

Genome-wide genetic heterogeneity discovery with categorical covariates

Gap hunting to characterize clustered probe signals in Illumina methylation array data

SNPassoc: an R package to perform whole genome association studies

Lecture 8: Sequencing and SNP. Sept 15, 2006

Pharmacogenomics and Health Policy

Amapofhumangenomevariationfrom population-scale sequencing

Whole genome sequencing in drug discovery research: a one fits all solution?

Association genetics in barley

Gene mutation and DNA polymorphism

Application of Deep Learning to Drug Discovery

A HapMap harvest of insights into the genetics of common disease

Bioinformatics Advice on Experimental Design

Mapping and selection of bacterial spot resistance in complex populations. David Francis, Sung-Chur Sim, Hui Wang, Matt Robbins, Wencai Yang.

GENOME-WIDE ASSOCIATION STUDIES: THEORETICAL AND PRACTICAL CONCERNS

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Transcription:

1 Statistical challenges to genome-wide association study Naoyuki Kamatani, M.D., Ph.D. 1. Director and Professor, Institute of Rheumatology, Tokyo Women s Medical University 2. Director, Medical Informatics Group, SNP Research Center, RIKEN

2 Scientific breakthrough of the year 2007 Genome-wide association study (GWAS) boomed in 2007 Human genetic variation Proof of the Poincaré Conjecture for 2006

3 Can we identify disease-associated genes on the genome-wide basis? Yes, by the Linkage Analysis 300 500 markers can cover the whole genome Causes since there of are the 3 majority x 10 9 base of pairs, Mendelian and diseases have been elucidated. 10 7 base pair sequence is transmitted together to the next generation However, the effect size should be large and family data are necessary. The phenotype-associated locus is here!

4 Can we identify disease-associated genes on the genome-wide basis, even if the effect size is small or family data are unavailable? Yes, by the GWAS (genome-wide association study) Cases 100,000 1,000,000 markers can cover the whole genome Causes of complex diseases may be Controls elucidated by GWAS. since 10 4-10 5 base pair sequence is associated with each other range of linkage disequilibrium

5 Era of GWAS has come! GWAS: Genome-wide association study 1. A list of SNPs covering the whole genome was made (HapMap) 2. Chips and Beads used for the genotyping for 100,000 1,000,000 individual SNPs are now commercially available. 3. Methods for analyzing the large size genotyping data are available.

6 Reports of GWAS in 2007 Samani NJ et al. Genomewide association analysis of coronary artery disease. N Engl J Med 357, 443, 2007 Stefansson H. et al. A genetic risk factor for periodic limb movements in sleep. N Engl J Med 357, 639, 2007 Dunckley et al. Whole-genome analysis of sporadic amyotrophic lateral sclerosis. N Engl J Med 357, 775, 2007 The international multiple sclerosis genetics consorsium. Risk alleles for multiple sclerosis identified by a genomewide study. N Engl J Med 357, 851, 2007 Plenge RM et al. TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study. N Engl J Med. 357, 1199, 2007 Majority of genetic causes of major diseases will be elucidated within a few years!!

7

8

9

GWAS(genome-wide association) reports from RIKEN 10 1. Ozaki K et al. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002 Dec;32(4):650-4. 2. Suzuki A et al. Functional haplotypes of PADI4, encoding citrullinating enzyme peptidylarginine deiminase 4, are associated with rheumatoid arthritis. Nat Genet. 2003 Aug;34(4):395-402. 3. Tokuhiro S et al.. An intronic SNP in a RUNX1 binding site of SLC22A4, encoding an organic cation transporter, is associated with rheumatoid arthritis. Nat Genet. 2003 Dec;35(4):341-8. 4. Ozaki K et al. Functional variation in LGALS2 confers risk of myocardial infarction and regulates lymphotoxin-alpha secretion in vitro. Nature. 2004 May 6;429(6987):72-5. 5: Kizawa H et al. An aspartic acid repeat polymorphism in asporin inhibits chondrogenesis and increases susceptibility to osteoarthritis. Nat Genet. 2005 Feb;37(2):138-44. 6. Kochi Y et al. A functional variant in FCRL3, encoding Fc receptor-like 3, is associated with rheumatoid arthritis and several autoimmunities. Nat Genet. 2005 May;37(5):478-85. 7. Seki S et al. A functional SNP in CILP, encoding cartilage intermediate layer protein, is associated with susceptibility to lumbar disc disease. Nat Genet. 2005 Jun;37(6):607-12. 8. Ozaki K et al. A functional SNP in PSMA6 confers risk of myocardial infarction in the Japanese population. Nat Genet. 2006 Aug;38(8):921-5.

ORs (proportional to sample size) with 95% CIs from each study testing the association of RA with the risk allele of PADI4 gene. The pooled ORs with 95% CI for overall analysis and subgroup analysis in populations of European descent were calculated with the Mantel Haenszel method (diamonds). The first study by Suzuki et al. [6] is shown for reference only and was not included in the meta-analysis. Meta-analysis of the association between PADI4 polymorphism and RA 11 OR (95% CI) Suzuki A Japanese, the first study 1.40 (1.20-1.63) Ikari K Japanese 1.23 (1.09-1.40) Barton A UK 1.16 (0.98-1.38) Harney SM UK 1.20 (0.78-1.84) Martinez A Spanish 1.01 (0.80-1.27) Plenge RM Pooled Swedish 1.01 (0.89-1.14) North American 1.24 (1.07-1.43) Meta-analysis has supported the association between a SNP and a disease All replication Iwamoto studies et al. Rheumatology 2006 1.14 (1.07-1.21) European descent studies 1.10 (1.02-1.19) 0.5 1.0 2.0 Odds ratio

12 QC of large quantity of data (500,000 SNP genotypes from > 10,000 subjects) 1. QC (quality control) is extremely laborious. 2. Mistypes lead to false significance. 3. We can use both genetics and statistics based methods for QC. 4. Reliable conclusion from GWAS is dependent on sophisticated QC filter. More than 10 10 data points.

Observed value 13 Report of the results of an association study QQ plot Expected value -log 10 P profile Chr 1 Chr 22

14 Multiple-comparison problem 1. If a test of independence is performed for 500,000 SNPs with a significance level of 0.05, about 2,500 SNPs will become false positive. 2. Since many SNPs are associated with each other, Bonferroni s correction is too conservative. 3. Several correction methods have been proposed (a) Use of the concept FDR (false-discovery rate) (b) Permutation test (c) Exact calculation of type 1 error rate (d) Bayesian method (FPRP)

(A) (B) Permutation test requires long time calculation 15 Observed data A permutation outcome (ω) Ω(Sample space)size n! Number of cases n 1 Shuffle the phenotypes and obtain empirical distribution of a statistic under the null hypothesis. Number of controls n 2 For 500 cases and controls There are 10 2568 outcomes ω(a permutation outcome) For 500 cases and controls, 10 299 events

16 Problem of population structuring 1. A large sample size is necessary to identity a SNP with a small effect size. 2. If the sample size is large, however, the problem of population structuring emerges.

17 Inflation of type I error rate by mixing two different subpopulations case Subpopulation 2 Subpopulation 1 Subpopulation 1 Subpopulation 2 q 1 q 2 Subpopulation 2 Subpopulation 1 p 1 p 2 If allele or genotype frequencies are different between subpopulations, control and the prevalence of a disease is different between the subpopulations, OR is 1 when p 1 =p 2, or q 1 = q 2 then, false-positive associations will occur.

To avoid false positive associations, we may use a clustering technique. 18

Principle component analysis 19 Draw a line in the space with 140,000 dimension so that the variance of the projections of the points to the line becomes the largest. A point corresponds to a subject 7,000 points in a 140,000 dimension space Use of the projections of the points to separate subjects This is impossible because the calculation of covariance matrix for 140,000 x 140,000 matrix is impossible.

20 Principle component analysis (implemented in EIGENSTRAT) Draw a line in the space with 7,000 dimension so that the variance of the projections of the points to the line becomes the largest. A point corresponds to a SNP A/G 140,000 points in a 7,000 dimension space Use of factors of Eigenvectors to separate subjects

21 Genotype data from 7,000 subjects with 140,000 SNPs SNP 140,000 rows Subjects 7,000 lines Normalize for each SNP (mean p, variance 2 p (1-p))

22 Normalized genotype data SNP 140,000 rows Subjects 7,000 lines 24,500,000 pairs

Subject B 23 24,500,000 pairs each of which has 140,000 SNPs data. Subject A Covariance is calculated for each of 24,500,000 pairs

24 Covariance matrix Subjects Subjects 7,000 Calculate Eigenvectors for this table

Second component PCA analysis for African, European and Asian subjects 25 Asian African European First component

Forth component PCA analysis for African, European and Asian subjects 26 Third component

Second component PCA analysis for Asian subjects 27 Hondo Okinawa Han-Chinese First component

Second component All Japanese samples + HapMap Han-Chinese and Japanese samples 28 Yayoi? (~3,000 1,700 years ago) Hondo cluster Okinawa cluster Jomon? Han-Chinese cluster First component

29 Hokkaido Tohoku Kinki Kanto-Koshinetsu Kyushu Okai-Hokuriku Okinawa

Second component Samples from Okinawa 30 Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component 31 Samples from Kyushu Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component 32 Samples from Kinki Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component Samples from Tokai-Hokuriku 33 Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component Samples from Kanto-Koshinetsu 34 Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component Samples from Tohoku 35 Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component Samples from Hokkaido 36 Hondo cluster Okinawa cluster Han-Chinese cluster First component

Second component 37 Comparison of samples from Kinki and Tohoku areas Tohoku subpopulation Comparison of Tohoku and Kinki subpopulations as cases and controls are problematic when the sample size is over 400 Kinki subpopulation Okinawa cluster Han-Chinese cluster First component

38 Nonsynonymous SNPs ranked according to P values of Armitage test based on genotypes for Hondo and Okinawa clusters rs_number chr chr_pos P value gene Known to be associated with the thickness of the hair rs3827760 2 108880033 1.61E-20 EDAR rs17822931 16 46815699 8.48E-20 ABCC11 rs4285045 4 144355168 1.23E-19 USP38 Known to be associated with dry or wet ear wax rs1799986 12 55821533 3.44E-17 LRP1 rs2274067 1 229443429 1.49E-15 C1orf131 rs2230611 19 5163482 3.49E-15 PTPRS rs1872056 15 69827828 1.57E-14 FLJ13710 rs2298645 18 75829123 1.97E-14 LOC440498 rs3744921 18 28121686 3.03E-14 FAM59A This method may be useful to identify genes that have been the targets of natural selection rs631248 1 43843808 8.79E-14 PTPRF rs9932051 16 10482297 7.99E-13 ATF7IP2

Lambda value Inflation of type 1 error due to population structuring (expressed by lambda value for genomic control, mean and sd). 39 1.4 1.35 1.3 Control 100% Hondo cluster Case x% Okinawa cluster 1000 vs 1000 1.25 1.2 1.15 1.1 1.05 1 controls cases 10.6% 14.6% 0% 5% 10% 15% 20% 500 vs 500 Subjects in Hondo and Okinawa clusters were mixed to construct a 500 or 1,000 size case group. Control group consisted of only the subjects from Hondo cluster.

40 Method for avoiding the Inflation of type I error rate by mixing two different subpopulations case Subpopulation 2 Subpopulation 1 Subpopulation 1 Subpopulation 2 q 1 q 2 Subpopulation 2 Subpopulation 1 p 1 p 2 control Adjust the proportion of subpopulation 1 so that q 1 = q 2 followed by a simple chi square test or by Mantel-Haenzel test OR is 1 when p 1 = p 2, or q 1 = q 2

41 Method for avoiding the Inflation of type I error rate by mixing two different subpopulations case Subpopulation 2 Subpopulation 1 Subpopulation 1 Subpopulation 2 Subpopulation 2 Subpopulation 1 p 1 p 2 control Simply exlcude subpopulation 1 case Mantel-Haenzel test Subpopulation 2 Subpopulation 2 Subpopulation 1 Subpopulation 1 control

Method for avoiding the Inflation of type I error rate by mixing two different subpopulations 42 Han-Chinese cluster Okinawa cluster Hondo cluster Selection of matched controls at random from a large-size control sample

43 Conclusion The data management and statistical analysis for millions or billions of individual genotypes in GWAS are extremely laborious; however, they are a challenging world for statisticians.