Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer

Size: px
Start display at page:

Download "Multi-SNP Models for Fine-Mapping Studies: Application to an. Kallikrein Region and Prostate Cancer"

Transcription

1 Multi-SNP Models for Fine-Mapping Studies: Application to an association study of the Kallikrein Region and Prostate Cancer November 11, 2014

2 Contents Background 1 Background

3 Study Motivation Background Prostate cancer (PCa) the second leading cause of cancer mortality in men, after lung cancer. Serum Prostate Specific Antigen (PSA) marker, a simple blood test, the only current biochemical test for prostate cancer. However, PSA testing has resulted in over-diagnosis of cancers. According to previous findings, many kallikrein (KLK) genes are differentially regulated in PCa and their levels of expression or their serum concentration can be used for patient prognosis or diagnosis. For example, PSA is just the serum level of KLK3 protein. However, besides KLK3 genetic variations, the genetic contribution of the KLK region to PCa remains largely unexplored.

4 Genetic Association Genetic association studies aim to detect association between one or more genetic polymorphisms and a trait, such as disease status, progressiveness. Single Nucleotide Polymorphism (SNP): a DNA variant representing variation in a single nucleotide base, which includes four types: A, C, G, T.

5 Genetic Association Cont d An allele is one of a number of alternative forms of the same gene or same genetic locus. Linkage disequilibrium (LD): allelic association due to proximity of loci on the genome. Haplotype (block): a series of alleles at linked loci along a single chromosome inherited as a unit.

6 Genetic Association Cont d Rationale for the existence of association: Direct association, polymorphism has a causal role Indirect association, LD between the alleles at a number of loci and a nearby causal variant Relies on LD, the functional variant does not need to be genotyped, as long as one measures a variant that is in LD with it Examine multiple markers simultaneously or use haplotype can improve the detection of true causal association. That is why fine-mapping is valuable to our study. Confounded association, due to underlying stratification or admixture of the population.

7 Fine Mapping and Genotype Imputation Two commonly used fine-mapping strategies: Direct sequencing or genotyping of a dense set of markers in a given region, e.g. a region found associated with an outcome in GWAS. Imputation of new markers within the candidate region When LD is high, the redundancy between markers implies that most of the information can be captured without genotyping all the markers. Genotype imputation: using known haplotypes in a population, for instance from Hapmap and 1000 Genomes project, to statistically impute additional unobserved genotypes.

8 Study Objective Background Single SNP association Definition: individually measure the dependency between each variant and the response. Complex issues assessing the significance of a large number of null hypothesis tests Exploits only a fraction of the information available Objective: Select an appropriate multi-snp statistical approach to assess the association between KLK region and PCa. Capture the joint effect of multiple SNPs Explore the model space efficiently Applicable to a genetic region with complex LD pattern

9 Data Description Background Samples Toronto data: 540 PCa and 308 healthy controls Swiss data: part of the Swiss arm of the European Randomized Study of Screening for PCa (ERSPC), including 380 PCa and 596 controls Genotyping An initial panel of 123 tagsnps from the Kallikrein region were genotyped using Illumina platform. Additional SNPs were imputed based on Hapmap panel and 1000 genomes reference panel. Only focus on SNPs with MAF > 0.02 Three possible genotypes for diallelic locus can be coded as 0, 1, 2, in the model, representing copy number of the minor allele.

10 Kallikrein Region Background The human tissue kallikrein family consists of 15 genes (KLKs) that all clustered on chromosome 19q Haplotype blocks characterised by high LD within blocks and low LD between blocks. There are 50 blocks in KLK region.

11 Group Lasso Background Group Lasso (glasso), is designed to select pre-defined groups of predictors. Yuan and Lin (2006). Suppose p predictors are divided into J groups with size k 1,..., k J. The group Lasso estimators are obtained by minimizing J J y X j β j 2 + λ β j Kj j=1 It is an extension of Lasso, aiming to select important groups. The coefficients of individual variants within a haplotype block are either all zero or all nonzero. j=1

12 Group Bridge Background Group Bridge (gbridge), involves bi-level selection, carrying out variable selection both at the group level and at the variant level, i.e. selecting important blocks as well as important SNPs within those blocks. Huang et al. (2009) The group bridge estimator is obtained by minimizing J J y X j β j 2 + λ C j β Aj γ 1 j=1 j=1 The group bridge penalty combines two penalties, namely the bridge penalty for group selection and the lasso penalty for within group selection.

13 Mode Oriented Stochastic Search Mode Oriented Stochastic Search (MOSS) is a two-stage Bayesian variable selection procedure that aims to identify combinations of the best predictive SNPs associated with a response. Dobra and Massam (2010) Rationale of MOSS Step one, to identify most relevant saturated models, i.e. these models with high marginal likelihood P(Y X). Step two, in each model selected in step one, select more relevant terms: main effect and interacted effect. Neighbourhoods of the current most promising models are explored in an effort to aggressively move towards regions of high posterior probability. Once a set of promising log-linear graphical models has been found, model averaging can be used to build a classifier for prediction.

14 Sequence Kernel Association Test Sequence Kernel Association Test (SKAT) is a regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait. Wu, et al. (2011) SKAT is only applicable to regions not to single SNPs. SKAT allows different variants to have different directions and magnitude of effects, including no effects SKAT avoids doing a prior selection of the variants within each haplotype block.

15 Criteria for Model Comparison False Discovery Rate The main effect FDR, FDR m : the proportion of non causal SNPs discovered by each approach among all the SNPs discovered by this approach The cluster FDR, FDR c : the proportion of blocks with no causal SNPs in them discovered by each approach among all the blocks discovered by this approach The lenient cluster FDR, FDR l : the proportion of non-lenient blocks (neither the block including causal SNPs nor the two closest adjacent blocks) discovered by each approach among all the blocks discovered by this approach

16 Criterion of Model Seclection True Discovery Rate The main effect TDR, TDR m : the number of causal SNPs discovered by each approach divided by the total number of true causal SNPs within the dataset The cluster TDR TDR c : the number of blocks including causal SNPs discovered by each approach divided by the total number of true blocks within the dataset The lenient cluster TDR, TDR l : the number of lenient blocks (either the block including causal SNPs or the two closest adjacent blocks) discovered by each approach divided by the total number of the lenient blocks

17 Simulation Implementation Simulation procedure: Randomly select five haplotype blocks among the KLK region, and within each block, one or two SNPs are treated as causal SNPs Simulate outcome variable based on causal SNPs selected above, which would have Bernoulli distribution. Four scenarios for the data simulation: 1. Seven causal SNPs within five blocks, based on all imputed genotype data (p=898) 2. Five causal SNPs within five blocks, based on all imputed genotype data (p=898) 3. Seven causal SNPs within five blocks, with less correlated genotype data (p=306) 4. Five causal SNPs within five blocks, with less correlated genotype data (p=306)

18 Information for causal variants Scenario #(causal SNPs) Minor Allele Frequency (MAF) Odds Ratio , 0.18, 0.16, 0.17, 0.14, 0.20, , 2, 2, 2, 2, 2, , 0.16, 0.17, 0.20, , 2, 2, 2, , 0.18, 0.16, 0.17, 0.14, 0.20, , 2, 2, 2, 2, 2, , 0.16, 0.17, 0.20, , 2, 2, 2, 2

19 Simulation Results False Discovery Rate True Discovery Rate FDR m FDR c FDR l TDR m TDR c TDR l 1 MOSS glasso gbridge SKAT MOSS glasso gbridge SKAT

20 Background False Discovery Rate True Discovery Rate FDR m FDR c FDR l TDR m TDR c TDR l 3 MOSS glasso gbridge SKAT MOSS glasso gbridge SKAT

21 Conclusion Background When some of the causal SNPs are in high LD, gbridge performs the best; when the multiple causal SNPs are from different blocks, MOSS performs the best. SKAT and glasso are always worse than the other two approaches in all four scenarios. Model results depend on the correlation structure in the chromosome region fine-mapped.

22 Further Study Background Modify the MOSS algorithm to capture the complex block correlation structure of the region fine-mapped. Extend the use of MOSS to study SNP*SNP interactions. Generalize our model to analyze different study designs (case-control, case-only, prospective, longitudinal, etc.).

23 Acknowledgment Background Supervisor: Dr. Laurent Briollais Lunenfeld-Tanenbaum Research Institute,Mount Sinai Hospital and Dalla Lana School of Public Health, University of Toronto Other Principal Investigators: Dr. Alexandre R. Zlotta, Department Of Surgery, Urology, Mount Sinai Hospital and Department Of Surgical Oncology, Urology, Princess Margaret Hospital Dr. Bharati Bapat, Lunenfeld-Tanenbaum Research Institute,Mount Sinai Hospital Dr. Hilmi Ozcelik, Lunenfeld-Tanenbaum Research Institute,Mount Sinai Hospital Dr. Theodorus van der Kwast,Department of Pathology, Toronto General Hospital Dr. Eleftherios P. Diamandis, Lunenfeld-Tanenbaum Research Institute,Mount Sinai Hospital