NETWORK BASED PRIORITIZATION OF DISEASE GENES

Size: px

Start display at page:

Download "NETWORK BASED PRIORITIZATION OF DISEASE GENES"

Luke Ferguson
5 years ago
Views:

1 NETWORK BASED PRIORITIZATION OF DISEASE GENES by MEHMET SİNAN ERTEN Submitted in partial fulfillment of the requirements for the degree of Master of Science Thesis Advisor: Mehmet Koyutürk Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY January, 2010

2 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of candidate for the degree *. (signed) (chair of the committee) (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 To my family

4 Contents List of Tables iii List of Figures iv 1 Introduction 1 2 Background Genetic Disorders Computational Disease Gene Prioritization Protein-Protein Interaction Networks Using Networks to Better Understand Diseases Problem Definition and Methods Problem Definition Existing Methods Based on PPI Networks Localized Methods Random Walk with Restarts Network Propagation Incorporating Disease Similarity Information Statistical Adjustment Reference Model Based on Seed Degrees Reference Model Based on Candidate Degree i

5 3.4.3 Reference Model Using Eigenvector Centrality Uniform Prioritization Experimental Results Experimental Setting Datasets Used Existing Methods Effect of the Restart Probability Comparison of Existing Methods Effect of Disease Similarity Effect of Statistical Adjustment Uniform Prioritization Size of the Candidate Set Case Examples Example 1: Example 2: Conclusion 40 Bibliography 42 ii

6 List of Tables 2.1 Shortest distance and direct interactions of disease pairs vs. random pairs The effect of the restart probability on the performance of random walk with restarts Comparison of the existing methods in terms of average ranks of the disease genes among the candidates and area under ROC curves. Global methods clearly outperform localized approaches in terms of both of the metrics used The effect of incorporating disease similarity The effect of statistical adjustment on performance Performance of proposed uniform prioritization methods for all combinations of statistical adjustment and hybrid scoring schemes Percentage of disease proteins ranked in top 1% and in top 5% among other candidates Comparison of the proposed method with existing approaches iii

7 List of Figures 3.1 Motivating example for consideration of multiple paths in assessing functional association between proteins Random walk with restarts vs. shortest distance in capturing functional association The effect of degree in global prioritization methods and the histogram of degrees of genes The standard deviation of the degrees of genes implicated in the same disease, divided by their mean ROC curves showing the effect of the restart probability on the performance of random walk with restarts The effect of incorporating disease similarity ROC curves of statistical correction methods ROC curves of the performance of proposed statistical correction schemes and extant prioritization methods in prioritizing known disease genes with degree Comparison of proposed method with existing approaches The effect of degree to the performance of both existing global approaches, adjusted method based on candidate degrees as well as the proposed hybrid model Performance comparison for different number of candidate proteins.. 36 iv

8 4.8 Case Example for the Azoospermia disease Case Example for the Microphthalmia disease v

9 Acknowledgements I would like to thank to my research advisor Mehmet Koyutürk for the initial idea and for his endless support during this work. vi

10 Network Based Prioritization of Disease Genes Abstract by MEHMET SİNAN ERTEN High-throughput molecular interaction data have been used effectively to prioritize candidate disease genes, based on the notion that genes associated with similar diseases are likely to interact with each other heavily in a network of protein-protein interactions (PPIs). An important challenge for these applications is the noisy nature of PPI data. Existing global methods alleviate these problems to a certain extent, by considering indirect interactions and multiplicity of paths. However, such methods favor highly connected genes, making prioritization sensitive to the degree distribution of PPI networks. Here, we propose several statistical adjustment schemes that aim to minimize these problems. While the proposed schemes are very effective in detecting loosely connected disease genes, they over-correct the highly connected genes. Motivated by these results, we develop hybrid models that integrate existing approaches with the proposed adjustment schemes. Experimental results show that proposed hybrid models significantly outperform existing methods in prioritizing disease genes.

11 Chapter 1 Introduction Identification of genes associated with diseases remains a challenge. Large scale experimental methods such as linkage analysis and genetic association studies generate large sets of candidates containing up to 300 genes [1]. Investigation of these candidates based on sequencing is an expensive task, thus not always a feasible option. Consequently, computational methods are primarily used to prioritize and identify the most likely disease-associated genes among the candidates by utilizing a variety of data sources such as protein-protein interaction (PPI) networks, gene expression data, functional annotations etc. With recent advances in biotechnology, availability of high-throughput biological data enables investigation of biological processes from a systems perspective. Biological processes are often orchestrated through interaction of multiple molecules. Considered together, these interactions form complex biological networks that underlie cellular organization [2]. Computational analysis of the structure of these networks provide significant insights into the mechanisms that drive complex biological systems [3]. In recent years, several attempts have been made to utilize the topological analyses of the PPI data in understanding human diseases [4 6]. These attempts mostly focus 1

12 on prioritization of candidate genes and they are mainly based on the notion that the products of genes associated with similar diseases are likely to be close to each other and interact heavily in a network of PPIs. However, an important challenge for these applications is the incomplete and noisy nature of the PPI data [7]. Vast amount of missing interactions and false positives effect the accuracy of methods based on local network information such as direct interactions and shortest distances. Few global methods based on simulation of random walk [5,6] and network propagation [8] get around this problem to a certain extent by considering multiple alternate paths and whole topology of PPI networks. In this work, we first investigate the performance of existing methods with respect to several parameters including the degree distribution of candidate genes, disease similarity threshold when incorporating similar diseases to calculations and the restart probability to be used in the model based on random walks. We show that existing algorithms have a rich gets richer effect on identification of new disease genes, in that highly connected genes are likely to be favored against loosely connected disease genes. We present new adjustment strategies based on statistical significance for the problem of computational disease gene prioritization and apply these strategies to enhance existing methods. We show that the proposed correction schemes are very effective in detecting loosely connected disease genes which are generally less studied, thus potentially more interesting for many applications. However, we observe that these schemes might overcorrect the highly connected disease genes, thus might perform less favorably in identifying them. Motivated by these results, we develop several uniform prioritization methods that effectively integrate existing algorithms with the proposed statistical adjustment schemes. Comprehensive experimental results indicate that the resulting hybrid schemes significantly outperform existing approaches in correctly identifying disease-associated genes. This thesis is organized as follows: In Chapter 2, we give a brief biological back- 2

13 ground in the relevant issues and prepare the reader for the rest of the chapters. In Chapter 3, we formally define the problem and discuss the existing approaches based on network in depth, followed by the presentation of proposed statistical adjustment schemes. Subsequently, we define our uniform prioritization strategies using these schemes. In Chapter 4, we first introduce the experimental setup and the performance evaluation metrics used, followed by the illustration of the effect of disease similarity and statistical adjustment on the performance of existing methods. We then present the results achieved by the uniform prioritization methods we presented. Finally in Chapter 5, we conclude the work. 3

14 Chapter 2 Background 2.1 Genetic Disorders A genetic disorder is a disease caused by changes and mutations that occur in a single gene or multiple genes in the genome. Some examples to well known singlegene disorders are; sick cell anemia, Marfan syndrome, Huntingtons disease, and hereditary hemochromatosis. On the other hand, some diseases such as heart disease, high blood pressure, Alzheimers disease, arthritis, diabetes, cancer, and obesity are complex diseases that require the interplay of multiple genes. Identification of the genes associated with the latter kind of diseases is a bigger challenge, since the impact of each gene involved can be minimal and difficult to identify separately. 2.2 Computational Disease Gene Prioritization Identification of disease genes is important to better understand gene functions as well as to improve medical care [9]. Genome-wide linkage and association studies in healthy and affected populations provide chromosomal regions that most likely contain disease-associated genes [10, 11]. Manual identification of most relevant genes among hundreds of candidates by utilizing a wide variety of data sources is an ex- 4

15 pensive and time consuming task. Consequently, computational methods based on sequence similarity [12 14], functional annotations [13 16], gene expression analysis [4, 14] and PPI [4 6, 8, 17 20] data are proposed so far. PROSPECTR [12] is an approach that is based on the intrinsic properties of candidate genes. In this method, candidate genes are prioritized by relying on the observation that genes associated with similar diseases are likely to share patterns of sequence based features. These features include the physical structure of the genes and their products as well as evolutionary information such as mutation rates. On the other hand, methods such as ENDEAVOUR [14], TOM [16] and POCUS [15] rely on the similarity of candidate genes to genes that are already associated with diseases, in terms of the functional annotations, gene expression profiles, regulatory information, etc. However, the scope of methods that rely on functional annotations is limited because only a small fraction of genes in the genome are currently annotated. Moreover, signals inferred from gene expression profiles are not easily utilized especially for diseases caused by multiple genes, where the impact of each contributor gene can be minimal. There are also other methods based on phenotypic analysis that utilize animal models which are later mapped to human data [21, 22]. 2.3 Protein-Protein Interaction Networks In this study, we primarily focus on PPI networks that model physical interactions and functional relationships between proteins. These interactions are captured via a variety of experimental and computational methods [23 25]. Reconstructing a reliable and complete protein interaction network for different species is one of the most important challenges in molecular biology. Detailed analyses on these networks are shown to be very important in understanding cells and diseases on a systemwide level. Despite the emergence of high-throughput technology, reconstruction of 5

16 a complete network is still far from a realistic goal. Existing public PPI databases only cover a fraction of these interactions. Consequently, compilations of data from several sources are used in many applications. PPI networks are often abstracted by graph models, in which the proteins are represented by nodes and the interactions (often physical) among them are represented by undirected edges. This abstraction enables application of graph theoretical approaches to the analysis of cellular organization. Existing applications of these analyses include functional module identification [26 28], prediction of functions of proteins [29, 30], identification of conserved substructures [31 33], evolutionary analyses [34, 35] and more. The latter analyses provide insights about the conservation of functional modules in the evolutionary timeline and how these modules can be utilized to reconstruct the phylogenetic tree of species. 2.4 Using Networks to Better Understand Diseases As mentioned previously, proteins associated with similar diseases are generally close to each other and interact frequently compared to random pairs in a network of PPIs. Indeed, the statistics shown in Table 2.1 clearly support this notion. The table compares the average shortest distances as well as the percentage of direct interactors among random protein pairs and among proteins that are associated with the same disease, based on a comprehensive human PPI network. It is evident from the table that disease pairs are significantly closer to each other in terms of network distance and interact more frequently, compared to pairs selected randomly across the network. Table 2.1: Average shortest distance and percentage of direct interactors among proteins associated with the same disease vs. random protein pairs in the PPI network. Avg. Shortest Distance Perc. of Direct Interactors Disease pairs Random pairs

17 Significant insights about disease-protein associations can be inferred using this notion based on PPI networks [4 6,8,17 20]. We discuss these approaches in a more detailed way in the following chapters. 7

18 Chapter 3 Problem Definition and Methods 3.1 Problem Definition A PPI network G =(V,E) consists of a set of nodes V and a set of undirected interactions E between these proteins where uv E represents an interaction between proteins u V and v V. We also define the set of interacting partners of a protein v V as N(v) ={u V : uv E}. The input to a candidate disease gene prioritization problem consists of a PPI network G, a seed set S which contains the prior information about the disease, i.e. genes known to be associated with the disease, and a set C of candidate genes to be prioritized, extracted from the linkage interval of that particular disease. Additionally, disease similarity information can also be utilized to improve the accuracy of the calculations. Details about how this similarity values are computed and incorporated into existing methods are discussed in the following sections. The overall objective is to calculate a score α(v) for each protein v C that represents the likelihood of the candidate protein to be associated with that disease. This score is computed by calculating the network proximity between v to the proteins in S by utilizing a variety of methods. Candidate proteins are then ranked according to these scores and new 8

19 disease associated genes are proposed for those diseases. 3.2 Existing Methods Based on PPI Networks As mentioned before, several approaches have been proposed to tackle this problem in the recent years. These algorithms utilize different algorithmic techniques and integrate a variety of different data sources. There is a wide range of methods based on the analysis on the topological properties of the PPI data, which can be classified into three main categories; (i) localized methods, i.e. methods based on direct interactions and shortest paths, (ii) methods based on random walks and (iii) methods based on network propagation. In the following sections, we present these methods as well as their limitations and formulate them since they are used as reference methods in our experiments. Afterwards, we explain how the disease similarity information is integrated to the calculations in these experiments Localized Methods Lage et al. [4] developed a method based on the observation that genes that have a role in similar diseases tend to interact heavily with each other in the PPI network. For a given set of candidate genes extracted from the linkage interval of a particular disease, they identify candidate complexes for each gene. These complexes consist of interacting partners of the candidate genes. Among the interacting partners, genes that are associated with the disease of interest or a similar disease are then identified and a score is computed for each complex based on the association level of their elements to the disease of interest. Candidate genes are then ranked according to the score of the complexes they make up. In order to represent this approach, we compute the proximity based on direct 9

20 interactions as follows: For each protein v in the candidate set, we calculate the proximity score α di (v), of v C to the seed set S of a disease as the total number of direct interactions of a protein with disease genes. Namely, for each protein v C we calculate the proximity score as α di (v) = n : n (N(v) S). Clearly, this approach is vulnerable to missing interactions and false positives, which are known to exist in vast amount in the publicly available PPI databases. Another major limitation of this method is that it does not take into account indirect interactions among proteins. However, previous work [36] show that proteins that do not directly interact, but share neighbors or lie close in the network tend to have similar biological functions and participate in common pathways. This limitation is partly addressed by methods that make use of network distance in terms of shortest paths. These methods utilize network distances by taking the average of the shortest distance of protein v C to each protein u S. Let π vu denote the length of the shortest path connecting proteins v and u in network G. Then, α sp (v) =( u S π vu)/ S. Despite this improvement, considering the shortest distance between two proteins and ignoring the possible alternate paths is still flawed. In previous studies, it has been shown that multiple alternate paths between proteins imply stronger functional association [37] and improve the robustness of these networks to mutations [38]. This idea is illustrated in Figure 3.1. Instead of these localized methods, global approaches that utilize the whole topology and multiple alternate paths in the network are favorable in assesing networkbased functional associations between proteins more accurately. Global methods are more robust to missing interactions and false positives in PPI data. In the next two sections, we discuss two different methods of this kind. 10

21 S T (a) S T S T (b) (c) Figure 3.1: Motivating example for consideration of multiple paths in assessing functional association between proteins. Although the distance of the source and target nodes is 2 in all three subgraphs, we expect the functional association of S and T to be greater in (b), compared to (a) since there are two connecting paths. On the other hand, the common neighbor in (c) is most likely a hub protein, thus may not imply any functional association between S and T because hub proteins are most likely involved in many different functions Random Walk with Restarts Random walk with restarts is a technique based on the simulation of a random walker, to compute the proximity between two nodes by exploiting the global structure of the network [39, 40]. This approach is used in a wide range of applications, including functional module identification [41] and candidate disease gene prioritization [5, 6]. The steps that take place in this simulation are as follows: Simulation of random walk starts from one of the proteins in S selected with the probability of 1/ S. At each step, random walk either continues with one of the interacting partners N(v) of the current protein v with probability 1 r or it jumps back to one of the seed proteins with probability r. Here, r is a user-defined parameter that is referred to as restart probability, where 0 r 1. Note that, r =0 corresponds to the random walk model used for assessing network centrality by 11

22 Google s page rank algorithm [42]. The next protein u N(v) to move from the current protein v at each step is selected uniformly at random with probability P (u, v) = 1/ N(v). After an infinitely many number of steps, the probability of being at a node provides a measure of the proximity of the candidate proteins to the seed set by taking into account the whole topology of the PPI network. In practice, this proximity score is computed by the following iterative formula; α t+1 =(1 r)pα t + rρ (3.1) where ρ denotes the restart vector with ρ(u) = 1/ S for each u S and P denotes the stochastic matrix derived from G =(V,E), i.e. P (u, v) =1/ N(v) if vu E. The elements in the resulting vector α represent the proximity of each protein to the proteins in the seed set. To put it another way, computed proximity measure is an estimation of the percentage of the time spent on a node after a sufficiently large number of steps in the simulation process. Consequently, this algorithm favors highly connected proteins located in centralized positions in the network topology. In order to compare the random walk with restarts to localized methods in terms of their ability to capture functional similarities among proteins, we compute the correlation of functional association of protein pairs to their network proximity using shortest distance and random walk with restarts separately. We use two sets of protein pairs, namely (i) OMIM pairs and (ii) all pairs. OMIM pairs represent pairs of proteins that are associated with the same disease. This data is extracted from the publicly available Online Mendelian Inheritance in Man (OMIM) [43] database. Details about this dataset are given in Results section. On the other hand, the other 12

23 Correlation with functional similarity human ppi all pairs shortest distance random walk with restarts OMIM pairs Figure 3.2: Comparison of shortest distance and random walk with restarts in terms of capturing the functional association between proteins. OMIM pairs include all protein pairs that are associated with the same disease whereas all pairs is the set of all other pairs in the PPI network. dataset is composed of all possible protein pairs in the PPI data, except for those in the OMIM pairs. In this experiment, functional similarity of proteins is calculated using an information content based similarity measure that uses similarity of Gene Ontology (GO) terms that are associated with these proteins, by also considering the hierarchical relationship of the GO terms [44]. As illustrated in Figure 3.2, the method based on random walk with restarts captures the functional association of proteins better compared to shortest distance, for all pairs of proteins, as well as those that are implicated in the same disease. Correlation values are much higher in OMIM pairs since proteins involved in same disease usually share more biological functions. This observation clearly represents the advantage of using global methods over localized approaches for capturing the proximity of proteins in the network. 13

24 3.2.3 Network Propagation In recent work, Vanunu and Sharan [8] propose a global network based propagation algorithm to compute the proximity of candidate proteins to disease sets. They define a prioritization function which models simulation of an information pump that originates at the seed sets. Although in practice this idea is very similar to simulating random walk with restarts, one of the key points of their algorithm is in the derivation of the weight matrix P from the adjacency matrix of the PPI network. Here, the idea is to normalize the weight of each edge by the degrees of both of its end points (as opposed to only one end point as in random walk with restarts). The iterative formulation resulting from the propagation model is the following: α t+1 =(1 r)p α t + rρ (3.2) where P is the normalized propagation matrix. In other words, each entry P ij of this matrix is computed as P ij =1/ N(v i ) N(v j ) where v i,v j V. 3.3 Incorporating Disease Similarity Information As mentioned previously, proteins associated with similar diseases or phenotypes lie in close proximity in the PPI network. This brings the idea of utilizing disease similarity information in to identification of disease genes [4, 8]. Driel et al. [45] incorporate disease similarity information using text mining with an algorithm that can be summarized as follows: First, each OMIM record is parsed and the keywords are searched for existence in the anatomy (A) and the disease (C) sections of the Medical Subject Headings Vocabulary (MeSH), which is a controlled vocabulary of U.S. National Library 14

25 of Medicine. MeSH is especially useful for applications that use information that contains different terminology for the same concepts. Each OMIM record is then represented by binary vectors where each entry of the vector corresponds to the existence of a term in that record. Similarity of two diseases is then computed by calculating the cosine of the angle between their representative vectors. After these calculations, a similarity score between 0 and 1 for every pair of OMIM diseases is available. In practice, this disease similarity information can be easily incorporated into the random walk with restart and network propagation models. This is achieved as follows: Let d c denote the disease of interest, θ = {d 1,d 2,d 3,..., d t } represent the set of other diseases for which information is available and φ(d i,d j ) denote the similarity between diseases d i and d j. Also, let S i denote the set of genes associated with disease d i. Note that, these sets are not always disjoint, i.e. there are some genes that are associated with multiple diseases. For the disease of interest d c, these methods consider other diseases d i in θ such that φ(d c,d i ) >σwhere σ is a user-defined threshold that is used to decide which diseases are considered similar. Now since vector ρ in Equations 3.1 and 3.2 represent the prior information on the association between each gene and the disease, ρ(v) for each v V can be set to the similarity of the disease that involves gene v and the disease of interest d c. In other words: First, we set ρ(u) =1foreachu S and ρ(v) =max{φ(d c,d i )} for each v S i where d i D(c) andd(c) represents all diseases that are similar to d c.inother words, D(c) ={d i : φ(d c,d i ) σ} Then, we normalize the vector ρ such that ρ = ρ/ ρ 1. 15

26 Observe that, if a protein is a member of more than one similar disease to the current disease, its apriori association with disease is set to the maximum among the similarity values of the diseases it is associated with. Please note that, Vanunu and Sharan [8] use a logistic function to represent the reliability of the disease similarity values. However, for simplicity, we use a user defined similarity threshold value as discussed above and incorporate disease similarity in the same manner for all models we compare. 3.4 Statistical Adjustment Detailed performance analyses for these existing approaches are presented in the next chapter. But to motivate our key approach, we evaluate here the performance of these methods with respect to degree distribution. As seen in Figure 3.3, existing global methods such as random walk with restarts and network propagation are clearly biased with the degree of the candidate proteins. Here, the performance measure is the average rank of the true candidate protein among other 99 proteins in the same linkage interval. Details about the experimental setting are explained in the next chapter. As seen in the figure, existing global methods work very well in predicting correctly highly connected proteins, whereas they perform poorly for loosely connected proteins, especially for those with a degree less than 6. This happens because highly connected proteins generally represent hub proteins that are easily reachable from every node in the network. Thus, the relative amount of time spent on these proteins are higher than loosely connected proteins in a simulation of random walks. Network propagation based method provides a slightly more balanced performance compared to the random walk based methods. This is achieved by normalizing all the edge weights by the degree of both of their incident nodes. However, its performance is still influenced heavily by node degrees. 16

27 Figure 3.3: (a)the effect of degree to the performance of both existing global approaches is shown. x-axis is the degree range while y-axis represents the average rank for the true disease genes. (b)histogram of the degree of disease genes. (c) Histogram of the degree of all genes. One should note that loosely connected proteins generally represent less studied proteins, thus inferring knowledge about these proteins can be more interesting for many applications in terms of generating novel biological knowledge. Therefore, these methods should be improved in a way to perform as good for these loosely connected proteins. In this thesis, we systematically tackle this problem and try to remove the bias of these methods with respect to the degree of candidate proteins by using statistical significance schemes. We show that our adjustment schemes clearly outperform existing 17

28 methods especially in identifying loosely connected proteins. However, as we present in the next chapter, these schemes may overcorrect for some of the highly connected proteins, thus causing these methods to be biased in the opposite direction. We get around this problem by developing several uniform prioritization methods, which are explained in detail in the following section. We now propose three alternate methods for statistical adjustment of disease association scores obtained by random walk with restarts Reference Model Based on Seed Degrees The objective here is to remove the bias by generating a reference model that captures the degree distribution of seed proteins accurately. We basically compare the proximity values achieved for each protein with respect to the values achieved by using random seed sets (by conserving the degree distribution of the seed elements). False positives that correspond to centralized and highly connected proteins are expected to have high proximity values even with respect to these randomly generated seed sets. This implies that their high proximity with respect to the actual seed set is most likely artificial, thus the significance of their proximity scores is low. On the other hand, loosely connected proteins that lie close to the seed proteins will be assigned high significance although their proximity scores are generally lower than that of highly connected proteins. Given a seed set S, and a candidate set of proteins C, the steps involved in this process are the following: First, the proximity vector α S for the original seed set S is computed using Equation 3.1. Then, given the original seed set S, we generate a random instance S (i) that represents S in terms of degree distribution. S (i) is generated as follows: 18

29 First, a bucket B(u) is created for each protein u S. Then, each protein v V is assigned to bucket B(u) if N(v) N(u) < N(v) N(u ) for all u S and ties are broken randomly. Subsequently, S (i) is generated by choosing a protein from each bucket, uniformly at random. It can be observed that each protein in S is represented by exactly one protein in S (i), thus the total degree of proteins in S is expected to be very close to that of S (i). The proximity vector α (i) using each seed set S (i) is computed using Equation 3.1. This experiment is repeated n times, where n is sufficiently large (we use n = 1000 in our experiments) to obtain a sampling {α (1),α (2),α (3),..., α (n) } of proximity vectors by using seed sets that are representative of S in terms of the degree distribution and size. We then calculate the mean vector μ as, μ S = 1 i n α(i) /n and the standard deviation σ as, σ 2 S = 1 i n ((α(i) μ S )(α (i) μ S ) T )/(n 1). Significance scores for each protein is then computed as, z S (v) =(α(v) μ S )/σ S, which is used as the proximity measure to compare and rank the candidate proteins Reference Model Based on Candidate Degree In this reference model, instead of generating random seed sets, we assess the statistical significance of the proximity α S (v) ofaproteinv V to seed set S by comparing it to the proteins that have a similar degree to that of v. For a particular protein v, if other proteins with similar degrees usually have high proximity values for a given seed set, the significance of the proximity score of v can be considered low. On the other hand, low degree proteins are compared to other 19

30 low degree proteins, which potentially have lower proximity scores, thus the artificial advantage of the high degree proteins is removed. The procedure for generation of this reference model can be summarized as follows: First, we compute the original proximity vector α S with respect to the given seed set S, again by using Equation 3.1. Then, for each candidate protein v C, we sort all other proteins u V with respect to the value N(v) N(u) in ascending order. We create a new set M that contains the first n most similar proteins to v in terms of degree. We use n = 1000 in our experiments. Subsequently, we calculate the significance of the proximity score α S (v) of each candidate protein v as z(v) = (α S (v) μ(m))/σ(m) whereμ(m) = u M α S(u)/n and σ 2 (M) = u M (α S(u) μ(m))/(n 1). This adjusted score is used as the proximity measure to compare and rank the candidate genes Reference Model Using Eigenvector Centrality Here, we normalize the proximity of a protein with respect to the seed set by dividing it to the eigenvector centrality score of that node. The objective is to remove the bias introduced by network centrality, since central nodes are favored in the prioritization process in the existing global methods. Centrality score of a protein is computed again by simulating random walks, but this time with the restart probability set to 0. More formally, the proximity vector α that contains the proximities of each protein to the seed set is calculated as in Equation 1 with the restart probability r =0.3 (the reason to select this particular value is explained in detail in Results section), while the centrality score of each protein is calculated by setting r = 0. Indeed, setting r = 0, corresponds to the case where the seed set is empty, thereby making 20

31 the resulting proximity score a function of the node s network centrality. For each candidate protein v C, adjusted score z(v) is computed by dividing the proximity score by its centrality score, i.e. z(v) =α (r=0.3) (v)/α (r=0) (v). Finally each candidate is ranked according to this adjusted score, thus the artificial advantage of the centralized proteins over loosely connected proteins is minimized. 3.5 Uniform Prioritization As mentioned before, although these adjustment strategies work well in identifying loosely connected proteins, they tend to overcorrect highly connected disease proteins, as supported by the experimental results presented in the next chapter. Based on this observation, we propose several hybrid scoring schemes to combine the advantage of both raw proximity scores and statistically adjusted scores. The objective is to obtain a uniform prioritization method that utilizes the adjusted scores only for those candidate proteins that are loosely connected, and use the raw proximity scores for highly connected candidate proteins. Let R raw (i) denote the rank calculated by using random walk with restarts and R adj (i) denote the rank achieved by using one of the statistical adjustment schemes for gene v i C. We propose three strategies to obtain a new ranking by merging these two ranking lists which are described in the following discussion. Based on the degree of candidate genes: We create another list, such that for each candidate gene v i C, if N(v) > λ we use R raw (i) and otherwise we use R adj (i), where λ is a user defined threshold to distinguish high and low degree proteins. Thus the combined rank consists of rankings taken from these two different lists, depending on the degree of each candidate in the candidate set. If a candidate is loosely connected, it makes more sense to use its rank achieved by the adjusted scheme, on the other hand for a highly connected candidate, the rank achieved by 21

32 stddev/mean observed random number of disease genes Figure 3.4: The standard deviation of the degrees of genes implicated in the same disease, divided by their mean, with respect to the number of genes associated with that particular disease. raw proximity calculations is utilized. If the rankings of two proteins are identical in the combined list, their ranks in the unused lists are utilized to break the tie. Assume, for example, that a protein v with degree 2 is ranked as number 1 in R adj and 15 in R raw, whereas another protein u with degree 20 is ranked as number 30 in R adj and 1 in R raw,whereλ is set to 5. The merged list will contain the value 1 for each of these proteins. In the prioritization process on the merged list, v will be ranked higher compared to u since its rank in its unutilized list R raw is higher than the rank of u in R adj. Best of each rank: For each candidate gene v i C, we create another ranking list composed of max{r raw (i),r adj (i)}. Again, we break the ties by using the rankings in the unutilized lists for each candidate in the merged list. 22

33 Based on degree of known disease genes Based on the notion that some diseases are studied more in detail compared to other diseases, we expect the degrees of genes associated with similar diseases to be somewhat close to each other. In fact, this expectation is supported by experimental evidence, as shown in Figure 3.4. In this figure, we plot the standard deviation of the degrees of genes implicated in the same disease, divided by their mean. Results indicate that, compared to randomly generated sets of genes, degrees of proteins associated with the same disease are closer to each other. We take advantage of this observation by using it as an indicator of the degree of the true candidate gene among other candidates. If all the known proteins associated with a disease have high degrees, we favor high-degree proteins among the candidates for that particular disease and vice versa. Specifically, for each candidate gene u C, we compute the average degree of the genes for that particular disease and use that value to decide which ranking to utilize for all candidates. In other words, for each disease, we compute d =( v S N(v) )/ S. After this step, if d>λ,weuser raw for each candidate gene, and otherwise, we use R adj. Observe that, this approach is global, i.e. rankings used do not depend on the degree of each candidate gene, but the average degree of the seed set. On the other hand, the other two hybrid schemes presented earlier are local, i.e. ranking used for each candidate gene depends on the degree of that particular gene. 23

34 Chapter 4 Experimental Results In this chapter, we comprehensively evaluate the performance of the methods we presented in the previous chapter. We start by introducing our experimental setup and the performance metrics we use. Next, we compare the performance of existing methods in terms of these metrics. Subsequently, we show how the results are enhancing by incorporation of disease similarity information. Then, we illustrate the effect of statistical correction schemes in detail. Finally, we present the proposed uniform prioritization strategies and show that our final proposed hybrid method outperforms existing methods for disease gene prioritization. 4.1 Experimental Setting In order to evaluate and compare different methods in terms of prioritizing diseaseassociated genes, we apply leave-one-out cross-validation. For each protein that is associated with a disease, we conduct the following experiment: We remove a gene from the set of genes associated with a particular disease. We generate an artificial linkage interval, containing this removed disease-associated gene with other 99 genes, located nearest in terms of the genomic distance on 24

35 the same chromosome. Please note that, later in this chapter we investigate the effect of selection of the size of candidate sets to the performance. Using each of the methods described in the previous chapter, we score each of these candidate genes based on its network proximity to the seed set. Then, we sort the candidate genes with respect to these scores. The rank of the true gene among all other candidates is used as an indicator of the accuracy of the scoring method used. In order to systematically compare the performance of different methods, we use the following metrics: Average Rank: After repeating the above steps for all proteins associated with at least one disease, we calculate the average rank of the true proteins among the other candidates. This value is the percentage of the rank of the true protein among other candidates. ROC Curves: We also plot ROC Curves, i.e. sensitivity vs 1-specificity, by varying the rank threshold from 1 to 100. Sensitivity is defined as the percentage of the true genes ranked above the particular threshold where as specificity is the percentage of all candidate genes ranked below the threshold. Area under ROC curve (AUROC) is used as another performance measure to compare different methods. Percentage of the disease proteins ranked in top 1%: This value gives the percentage of the true disease proteins that were ranked as one of the proteins in the top 1% among the candidates in the prioritization experiments. Percentage of the disease proteins ranked in top 5%: This value gives the percentage of the true disease proteins that were ranked as one of the proteins in the top 5% among the candidates in the prioritization experiments. 25

36 4.1.1 Datasets Used For our experiments, we use the human PPI data extracted from NCBI Entrez Gene Database [46]. This database integrates interaction data from several other databases available, such as HPRD, BioGrid and BIND. After the removal of nodes with no interactions, we work on a PPI data composed of 8959 proteins and distinct interactions among these proteins. Weextract disease information fromonline Mendelian Inheritance in Man (OMIM) dataset as mentioned before. OMIM provides a publicly available comprehensive database of genotype-phenotype relationship in humans. We map genes associated with diseases to the PPI data we use and remove those diseases for which we are unable to map more than two associated genes. After this step, we have a total of 206 diseases with at least 3 associated genes. Number of genes associated with these diseases ranges from 3 to 36, with the average number of associations for each disease being approximately Existing Methods Here, we compare the performance achieved by the localized methods, i.e. shortest distances and direct interactions, to the global methods, i.e. network propagation and random walk with restarts. But first, we conduct an experiment to decide on the restart probability to use for the global methods Effect of the Restart Probability The restart probability represented by r in Equations 3.1 and 3.2 is a parameter that is used to adjust the preference between the importance of a protein with respect to the seed set and network topology. We conduct experiments using different values for this restart probability to observe its effect on the performance of the method based 26

37 Sensitivity r=0.01 r=0.1 r=0.3 r=0.5 r=0.7 r= Specificity Figure 4.1: ROC curves showing the effect of the restart probability on the performance of random walk with restarts. on random walk with restarts. As illustrated in Table 4.1 and Figure 4.1, unless a very small value is used, the selection of the restart probability does not effect the performance significantly. The performance degrades significantly for the value r =0.01, which is expected because the effect of the seed genes is minimized in that case, thus the proximity of a protein is calculated based primarily on the centrality of that protein. Table 4.1: The effect of the restart probability on the performance of disease prioritization based on random walk with restarts in terms of the average rank of the true candidate proteins as well as the area under ROC curves. Restart Prob. Avg. Rank AUROC

38 4.2.2 Comparison of Existing Methods As listed in Table 4.2, global methods clearly outperform the localized methods in terms of the average rank and AUROC values achieved. As expected, shortest distance performs better than the method that is based solely on direct interactions. On the other hand, network propagation and random walk with restarts perform very similarly, with network propagation showing some marginal improvement. Please note that disease similarity information is not utilized in these experiments. Based on these results, in the rest of this chapter, we only focus on global methods and eliminate the localized approaches at this stage. Table 4.2: Comparison of the existing methods in terms of average ranks of the disease genes among the candidates and area under ROC curves. Global methods clearly outperform localized approaches in terms of both of the metrics used. METHOD Avg. Rank AUROC Network propagation Random walk w/ restarts Shortest distances Direct interactions Effect of Disease Similarity Low similarity scores for disease pairs do not usually contain any valuable information, as mentioned in [45]. Thus, selection of the similarity threshold is an important task in our work. We perform experiments with different thresholds on disease similarity, by ignoring similarities below that particular threshold. Detailed results on this threshold analysis can be found in Figure 4.2 and Table 4.3. Our results are in line with previous results in terms of the optimal disease threshold to use. Using a low threshold (< 0.3) apparently adds false positives to the calculation thus decreases the accuracy. Similarly, using a large threshold 0.8, we do not utilize all similar diseases, thus missing some valuable information. Based on these results, the threshold value 28

39 is set to 0.3 for all methods based on network propagation and random walk with restarts. For the sake of completeness, we perform one additional test, to investigate the effect of disease similarity with similar disease proteins selected randomly. In other words, for the diseases identified to be similar using the threshold value set to 0.3, we select the associated proteins to the similar diseases randomly by conserving the number of associations for each disease. As expected, this performs even worse than adding no disease similarity information at all. The purpose for this experiment is to show that disease similarity information does not add any artificial advantage to the performance. In Figure 4.2, the ROC curve clearly demonstrate the advantage of using disease similarity information when the threshold is set to 0.3. Table 4.3: The effect of incorporating disease similarity on the method based on random walk with restarts. The optimal threshold for disease similarity measures is observed to be 0.3. As expected, if genes associated with similar diseases are selected randomly, the performance degrades compared to not using any disease similarity information. Method Avg. Rank AUROC No Disease Similarity Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Random Disease Similarity Effect of Statistical Adjustment As mentioned before, the performance of the global methods are highly biased with the degree of the true candidate protein. The effect of the degree for global methods is previously demonstrated in Figure 3.3. Especially for candidate genes with degree 5, existing global approaches perform very poorly in terms of identifying the true 29

40 Sensitivity no disease similarity disease similarity using threshold 0.3 random disease similarity Specificity Figure 4.2: ROC curves for the method using optimal disease similarity threshold 0.3, no disease similarity and random disease similarity are plotted. As expected, incorporating disease similarity information improves the performance where as if genes associated with similar diseases are selected randomly, this degrades the performance. disease gene. However, candidate genes with degree 5 make up 51% of all known disease genes. Since known disease genes are likely to be studied more, they are also likely to have more known interactions. To this end, accurate ranking of low-degree genes is very important in terms of being able to extract new information on relatively less studied genes. Therefore, we tackle this problem by applying the statistical adjustment strategies described in the previous chapter by using the random walk with restarts as the base method. We conduct separate experiments by using all disease proteins as well as disease proteins with degree 5 only. By examining the Figures 4.3, 4.4 and Table 4.4, one can easily see the impact of our correction schemes. Proposed models significantly outperform existing methods especially when dealing with loosely connected proteins. However, this comes at a price of overcorrecting some highly connected disease pro- 30

41 Sensitivity random walk with restarts network propagation model based on candidate degree model based on centrality model based on seed degrees Specificity Figure 4.3: ROC curves of statistical correction methods. All three adjustment models work better compared to the existing global methods. teins. Although the correction schemes outperform existing methods even when we use all disease proteins, it is clear that these methods can even perform better by applying hybrid methods, as we demonstrate in detail in the following section. 4.4 Uniform Prioritization As shown, the proposed statistical schemes perform better for loosely connected proteins, however, existing global methods perform somewhat better for highly connected proteins. In this section, we evaluate the performance when we combine these two approaches as discussed in the previous chapter. We first examine the average rank and AUROC values for different hybrid methods we use, by combining our three statistical adjustment methods using three different hybrid models explained in depth in the previous chapter. Thus, we present uniform 31

42 Sensitivity random walk with restarts network propagation Model based on candidate degree Model based on centrality Model based on seed degrees Specificity Figure 4.4: ROC curves of the performance of proposed statistical correction schemes and extant prioritization methods in prioritizing known disease genes with degree 5. prioritization results of nine different methods here. In these experiments, we set the high-degree threshold λ, to 5. For convenience, we refer to the hybrid methods in the tables as explained below: Hyb1: Hybrid method based on degree of candidate genes. Hyb2: Hybrid method based on best of each rank. Hyb3: Hybrid method based on average degree of seed genes. As shown in Tables 4.5 and 4.6, hybrid models that combine raw proximity scores with statistical adjustments based on degree of the candidates outperform other adjustment techniques. We choose the hybrid method based on degree of candidate genes (Hyb1) that combines the raw scores with our adjustment scheme based on candidate degrees as our final proposed method, since this approach provides the best accuracy in correctly predicting the disease protein as the top candidate (21.7%) as well as the highest 32

43 Table 4.4: The effect of statistical adjustment on performance. Average Rank of the true disease genes and AUROC values are listed. For the sake of demonstrating our point, we list disease genes with degree 5and > 5. For the highly connected proteins, the performance of existing global methods are comparable to statistical correction schemes. On the other hand, for the loosely connected proteins, the proposed schemes provide significant improvement. As evident, compared to existing methods, these correction schemes are less affected by the degree of the of the true candidate gene. All Genes Degrees 5 Degrees> 5 Method Avg. Rank AUROC Avg. Rank AUROC Avg. Rank AUROC Network Propagation Random walk w/ restarts Based on seed degree Based on candidate degree Basedoncentrality Table 4.5: Performance of proposed uniform prioritization methods for all combinations of statistical adjustment and hybrid scoring schemes. Candidate deg. Seed deg. Centrality Avg. Rank AUROC Avg. Rank AUROC Avg. Rank AUROC Hyb Hyb Hyb average rank of the true candidate gene (23.22). Table 4.6: Percentage of disease proteins ranked in top 1% and in top 5% among other candidates. The most accurate method is the one based on degree of candidate proteins (hyb1) by using the adjustment scheme that utilizes eigenvalue centrality. Candidate deg. Seed deg. Centrality Hyb1 Hyb2 Hyb3 Hyb1 Hyb2 Hyb3 Hyb1 Hyb2 Hyb3 Perc. ranked in top 1% Perc. ranked in top 5% Our proposed hybrid method outperforms both of the existing global approaches in terms of average rank values achieved, AUROC values and accuracy in prediction, i.e. the percentage of disease proteins ranked as the top 1% and 5%. These final results comparing the proposed method with existing global approaches are shown in Figures 4.5, 4.6 and Table

44 Sensitivity random walk with restarts network propagation proposed hybrid method Specificity Figure 4.5: ROC curves to compare the proposed method with existing global approaches. Table 4.7: Comparison of the proposed method with existing global approaches in terms of different performance metrics. Proposed method outperforms others in terms of all metrics used. METHOD Avg. Rank AUROC Perc. Ranked in top 1% Perc. Ranked in top 5% Proposed Hybrid Method Network propagation Random walk w/ restarts Size of the Candidate Set As mentioned before, linkage analysis provides up to several hundreds of candidate genes for diseases. Thus, here we investigate the effect of candidate set size on the performance of our proposed method as well as the raw method based on random walk with restarts. As illustrated in Figure 4.7, the proposed method outperforms existing approaches for all different candidate set sizes tested. Please note that after a slight increase for the size 60, the performance of both methods do not change much for different candidate numbers up to

45 60 average rank of disease genes random walk with restarts network propagation adjusted by candidate degree proposed hybrid model degree range of disease genes Figure 4.6: The effect of degree to the performance of both existing global approaches, adjusted method based on candidate degrees as well as the proposed hybrid model are shown. 4.5 Case Examples In this section, we provide two real examples to show the power of our proposed hybrid method that uses statistical adjustment on existing global approaches in terms of capturing the disease protein better especially when it is loosely connected Example 1: The first example is for the disease called Azoospermia. This disease has 3 proteins directly associated with it in our PPI network, namely USP9Y, SYCP3 and SCP3. In our cross validation experiments, we remove USP9Y and try to predict that using the other two proteins. Existing methods fail to identify USP9Y because it is not a centralized protein with a degree of only 1. Thus, random walk with restarts model ranks this true protein as 57 th and network propagation ranks it as 27 th among 100 candidates. On the other hand, our method was able to correctly rank this protein as the 1 st candidate. This 35

46 random walk with restarts proposed hybrid method rank of the disease gene (%) number of candidate genes Figure 4.7: Performance comparison for different number of candidate proteins in terms of the average rank of the true disease protein among the candidates. case is illustrated in Figure 4.8 where we draw the 2-neighborhood of proteins USP9Y, SYCP3 and SCP3. Proteins in the seed set as well as the proteins associated with similar diseases are shown as green circles. The intensity of these green circles are proportional to the disease similarity between the disease of interest and the disease they are associated with. Subsequently, USP9Y and SYCP3 are the darkest green proteins. For this example, none of the diseases are identified as significantly similar to Azoospermia so the proteins used are from the seed set only. True candidate that we try to capture is represented by a yellow circle. The protein that is incorrectly ranked as the 1 st among the candidates using both of the existing global approaches is represented by a diamond. As expected, this incorrectly identified protein is a centralized node with a high degree connected to other centralized proteins. 36

47 4.5.2 Example 2: The second example is for the disease called Microphthalmia. This disease also has 3 proteins directly associated with it in our PPI network, namely SIX6, CHX10 and BCOR. In our cross validation experiments, we remove SIX6 and try to predict that using the other two proteins. Again, existing methods fail because SIX6 is not a centralized protein with a degree of only 1. Thus, random walk with restarts model ranks this true protein as 26 th and network propagation ranks it as 16 th among 100 candidates. Again, our method was able to correctly rank this protein as the 1 st candidate. This case is illustrated in Figure 4.9 where we draw the 2-neighborhood of proteins SIX6, CHX10 and BCOR. Proteins in the seed set as well as the proteins associated with similar diseases are shown as green circles. The intensity of these green circles are proportional to the disease similarity between the disease of interest and the disease they are associated with. Subsequently, CHX10 and BCOR are the darkest green proteins. Similar to the previous example, the protein that is ranked incorrectly as the 1 st among the candidates using both of the existing global approaches is represented by a diamond. Again, the incorrectly identified protein is a centralized node with a high degree connected to other centralized proteins. 37

48 Figure 4.8: Case example for the Azoospermia disease. 2-neighborhood of the disease genes is shown. Proteins in the seed set are shown as green circles whereas the true candidate that we try to capture is represented as a yellow circle. Diamond represents the protein that is incorrectly ranked first for both of the existing global approaches. 38

49 Figure 4.9: Case example for the Microphthalmia disease. 2-neighborhood of the disease genes is illustrated. Proteins associated with the disease of interest are illustrated as green circles where the intensity is proportional to the similarity of the disease that protein is associated with, to the disease of interest. The true candidate that we try to capture is represented as a yellow circle. Diamond represents the protein that is incorrectly ranked first for both of the existing global approaches. 39

Role of Centrality in Network Based Prioritization of Disease Genes

Role of Centrality in Network Based Prioritization of Disease Genes Sinan Erten 1 and Mehmet Koyutürk 1,2 Case Western Reserve University (1)Electrical Engineering & Computer Science (2)Center for Proteomics