AntEpiSeeker2.0: extending epistasis detection to epistasisassociated pathway inference using ant colony optimization

Size: px
Start display at page:

Download "AntEpiSeeker2.0: extending epistasis detection to epistasisassociated pathway inference using ant colony optimization"

Transcription

1 AntEpiSeeker2.0: extending epistasis detection to epistasisassociated pathway inference using ant colony optimization Yupeng Wang 1,*, Xinyu Liu 2 and Romdhane Rekaya 1,2 1 Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA 2 Department of Statistics, University of Georgia, Athens, GA 30602, USA *Corresponding author wyp1125@uga.edu 1

2 Abstract Genome-wide association studies (GWAS) have become a standard method for finding genetic variations that contribute to common, complex diseases. Recently, it is suggested that these diseases may be caused by epistatic interactions of multiple genetic variations. Although tens of software tools have been developed for epistasis detection, few are able to infer pathway importance from the identified epistatic interactions. AntEpiSeeker is originally an algorithm for detecting epistatic interactions in case-control studies, using a two-stage ant colony optimization (ACO) algorithm. We have developed AntEpiSeeker2.0, which extends the AntEpiSeeker algorithm to inference of epistasis-associated pathways, based on a natural use of the ACO pheromones. By looking at pheromone distribution across pathways, epistasis-associated pathways can be easily identified. The effectiveness of AntEpiSeeker2.0 in inferring epistasisassociated pathways is demonstrated through a simulation study and a real data application. AntEpiSeeker 2.0 was designed to provide efficient inference of epistasis-associated pathways based on ant colony optimization and is freely available at Keywords Epistasis, pathway, ant colony optimization, software 2

3 Introduction Genome-wide association studies (GWAS) have become a standard method for finding genetic variations that contribute to common, complex diseases [1]. However, single locus analyses often do not reveal significant associations. Complex diseases have long been suspected to be caused by the joint effects of multiple genetic variations. The feature of such joint effects is that they may show little or no individual effect but strong interactions, which are often referred to as epistasis or epistatic interactions [2]. Recently, biological pathway analysis has become a popular approach to prioritize GWAS results [1, 3]. However, epistatic interactions are often difficult to detect in GWAS due to computational intractability. Ant colony optimization (ACO) is an algorithm to solve difficult optimization problems such as the traveling salesman problem [4]. ACO simulates the positive feedback process that real-world ant colonies find the shortest path to a food source through communicating using pheromones. ACO has been successfully implemented in epistasis detection [5-7]. AntEpiSeeker1.0, a software tool for epistasis detection in GWAS based on a two-stage ACO algorithm, was shown to outperform its recent competitors based on a series of simulation studies and a real GWAS example [6]. Recently, a comparison study of five representative epistasis detection methods showed that AntEpiSeeker1.0 performed best on detecting epistasis displaying marginal effects and was an efficient and effective method in terms of overall performance [8]. Because of the existence of numerous correlated SNPs in a GWAS, there are often hundreds or thousands of, or even more SNP pairs showing epistatic interactions, most of which are false positives but difficult to prune out. To interpret the result of epistatic interactions, inference of epistasis-associated pathways could be a promising approach. In AntEpiSeeker algorithm, the pheromone of a SNP represents its information content, i.e. relative contribution to epistasis [6]. Simply but reasonably, the pheromone of a pathway can be approximated by the average pheromone of its associated SNPs, which provides a method for unbiasedly ranking the contribution of each investigated pathway to epistatic interactions. Based on this idea, we have developed AntEpiSeeker2.0, which provides efficient inference of epistasis-associated pathways. 3

4 Implementation Inference of pathway importance The procedure for epistasis detection employed by AntEpiSeeker algorithm is described in details in its 1.0 version [6]. In AntEpiSeeker2.0, once the ACO completes, the pheromone of each pre-defined pathway, defined as the average pheromone of its associated SNPs, is computed. All pathways are ranked in descending order of pheromones, where top ranked pathways are more likely to be associated with epistatic interactions. Note that not all SNPs in a pathway are informative because there can be many SNPs with random pheromones. To alleviate the effect of random SNP pheromones, the pathway pheromone may be computed from its top 25% or 50% associated SNPs (ranked by SNP pheromones). Usage of AntEpiSeeker2.0 AntEpiseeker2.0 was written in C++. GNU Scientific Library (GSL) needs to be installed on the user's computer before compiling. The parameters for executing the program should be specified in the parameters.txt file. The input SNP data should be tab-delimited, with the first row specifying the sample status (0 or 1). All subsequent rows should contain genotypes (coded by 0, 1 and 2) for each SNP with the first column specifying SNP names. A tab-delimited pathway- SNP association file should be also provided, where each row contains a pathway, with the first column showing the name of the pathway and the following columns showing its associated SNPs. There are four output files. "AntEpiSeeker.log" and results_maximized.txt record intermediate results and all detected epistatic interactions respectively, and two user-specified output files show the epistatic interactions with minimized false positives and sorted pathways with pheromones respectively. Parameter setting is described in the readme file. Results Simulation study To evaluate the performance of AntEpiSeeker2.0 on detecting epistasis-associated pathways, the real data based simulation presented by [6] was adopted and extended. The raw SNP data were the genotypes on human chromosome 1, retrieved from the 912 individuals of 11 populations in the International HapMap project (Phase 3) [9]. We removed the loci with missing genotypes or minor allele frequency<0.1, getting a total of 73,355 SNP markers for analysis. Because the 4

5 HapMap samples are not in case-control format, we randomly selected a half of samples as cases (the other half as controls). Then, we embedded 132 epistatic interactions with P-value< following 1) additive effects, 2) multiplicative effects and 3) threshold effects [10] into the data with randomly selected causative loci. Then, 200 pathways were simulated with each pathway consisting of 2~500 randomly selected non-causative SNPs. The 11th~14th pathways were selected as epistasis-associated pathways. 11th and 12th pathways were appended with 2~84 within-pathway epistatic interactions respectively. 13th and 14th pathways were appended with 2~84 cross-pathway epistatic interactions. 50 datasets as described above were generated and AntEpiSeeker2.0 was then implemented with the following parameters: 5,000 ants, SNP set size [3, 6], 200 iterations for each SNP set size and initial pheromone=100. The result showed that 11th ~14th pathways ranked the top four pathways on all of the 50 datasets. To further test whether AntEpiSeeker2.0 assessed pathways without bias, all epistatic interactions were removed from the simulated data and we found that all pathways resulted in nearly equal pheromones after the ACO procedure, distinct from the pheromone distribution based on the datasets with epistatic interactions (Figure 1). Real data application AntEpiSeeker2.0 (parameters: 1000 iterations for each SNP set size and pathway pheromones computed from top half associated SNPs) was used to infer epistasis-associated pathways for the GWAS on rheumatoid arthritis (RA) from the Wellcome Trust Case Control Consortium (WTCCC) [11]. 235 KEGG human pathways [12] were assessed. SNPs were assigned to a gene if they were located between 1 kb upstream and downstream of the gene. To validate the result of epistasis-associated pathway inference, we compared the pheromones of top ranking pathways between real and permuted genotype data (Figure 2). The comparison suggests that there may be several tens of pathways associated with epistasis in RA. We found that top six pathways (~2.5%) showed highest biological relevance. There were two signaling pathways: notch and MAPK, which have been suggested to play an important role in progress of RA [13, 14]. The other four pathways were cardiac muscle contraction, hypertrophic cardiomyopathy, dilated cardiomyopathy and arrhythmogenic right ventricular cardiomyopathy, suggesting that RA is often accompanied by cardiovascular disorders [15]. 5

6 Discussion AntEpiSeeker belongs to stochastic optimization methods. Compared with brute-force optimization, AntEpiSeeker is much more efficient. For example, in the aforementioned simulation study, AntEpiSeeker identified 92 (69.9%) true epistatic interactions, while within the same computation time, brute-force optimization detected only one (0.7%) true epistatic interaction. Although tens of software tools have been developed for epistasis detection, few are able to relate identified epistasis to associated pathways. AntEpiSeeker2.0 was designed to provide efficient inference of epistasis-associated pathways. AntEpiSeeker2.0 itself does not incorporate a permutation procedure for assessing the significance of reported pathways, rending it able to report pathway ranking upon completion of the ACO procedure. However, AntEpiSeeker2.0 reports a pheromone value for each pathway. To quickly determine whether the result of pathway inference is meaningful, users may simply plot pathway pheromones against their ranks / IDs. Epistasis-associated pathways are expected to have obvious pheromone peaks, while nonassociated pathways are expected to have low and comparable pheromones. In addition, we have shown that AntEpiSeeker2.0 can unbiasedly assess pathways of different sizes (i.e. number of genes/snps in a pathway). Users may construct null distribution of pathway pheromones through applying AntEpiSeeker2.0 to permuted data (e.g. permuting the sample status) only for reconfirmation of the result of epistasis-associated pathway inference. Conclusions AntEpiSeeker2.0 has extended epistasis detection to epistasis-associated pathway inference, based on sorting pathway pheromones. In this study, AntEpiSeeker2.0 is shown to be effective in inferring epistasis-associated pathways based on a simulation study and a real data application. Author's contributions YW conceived the project, wrote the software and analyzed data. YW, XL and RR wrote the manuscript. All Authors read and approved the final manuscript. 6

7 Acknowledgements This study was supported in part by resources and technical expertise from the University of Georgia Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. References 1. Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet 2010, 86(1): Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 2002, 11(20): Wang K, Li M, Hakonarson H: Analysing biological pathways in genome-wide association studies. Nat Rev Genet 2010, 11(12): Dorigo M, Gambardella LM: Ant colonies for the travelling salesman problem. Biosystems 1997, 43(2): Greene CS, White BC, Moore JH: Ant Colony Optimization for Genome-Wide Genetic Analysis. Lect Notes Comput Sci 2008, 5217/2008: Wang Y, Liu X, Robbins K, Rekaya R: AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Res Notes 2010, 3: Christmas J, Keedwell E, Frayling TM, Perry JRB: Ant colony optimisation to identify genetic variant association with type 2 diabetes. Inform Sciences 2011, 181(9): Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y: Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics 2011, 12: Thorisson GA, Smith AV, Krishnan L, Stein LD: The International HapMap Project Web site. Genome Res 2005, 15(11): Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 2005, 37(4): Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ et al: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447(7145): Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 38(Database issue):d Schett G, Zwerina J, Firestein G: The p38 mitogen-activated protein kinase (MAPK) pathway in rheumatoid arthritis. Ann Rheum Dis 2008, 67(7): Jiao Z, Wang W, Xu H, Wang S, Guo M, Chen Y, Gao J: Engagement of activated Notch signalling in collagen II-specific T helper type 1 (Th1)- and Th17-type expansion involving Notch3 and Delta-like1. Clin Exp Immunol 2011, 164(1): Solomon DH, Kremer J, Curtis JR, Hochberg MC, Reed G, Tsao P, Farkouh ME, Setoguchi S, Greenberg JD: Explaining the cardiovascular risk associated with rheumatoid arthritis: traditional risk factors versus markers of rheumatoid arthritis severity. Ann Rheum Dis 2010, 69(11):

8 Figure legends Figure 1. Comparison of pathway pheromones between the data with epistatic interactions and null data (averaged for 50 datasets). Figure 2. Comparison of pheromones of top ranking pathways between real and permuted WTCCC RA data (averaged for 20 permuted datasets). 8

9 Figure 1 9

10 Figure 2 10