NETWORK BASED PRIORITIZATION OF DISEASE GENES

Size: px
Start display at page:

Download "NETWORK BASED PRIORITIZATION OF DISEASE GENES"

Transcription

1 NETWORK BASED PRIORITIZATION OF DISEASE GENES by MEHMET SİNAN ERTEN Submitted in partial fulfillment of the requirements for the degree of Master of Science Thesis Advisor: Mehmet Koyutürk Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY January, 2010

2 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of candidate for the degree *. (signed) (chair of the committee) (date) *We also certify that written approval has been obtained for any proprietary material contained therein.

3 To my family

4 Contents List of Tables iii List of Figures iv 1 Introduction 1 2 Background Genetic Disorders Computational Disease Gene Prioritization Protein-Protein Interaction Networks Using Networks to Better Understand Diseases Problem Definition and Methods Problem Definition Existing Methods Based on PPI Networks Localized Methods Random Walk with Restarts Network Propagation Incorporating Disease Similarity Information Statistical Adjustment Reference Model Based on Seed Degrees Reference Model Based on Candidate Degree i

5 3.4.3 Reference Model Using Eigenvector Centrality Uniform Prioritization Experimental Results Experimental Setting Datasets Used Existing Methods Effect of the Restart Probability Comparison of Existing Methods Effect of Disease Similarity Effect of Statistical Adjustment Uniform Prioritization Size of the Candidate Set Case Examples Example 1: Example 2: Conclusion 40 Bibliography 42 ii

6 List of Tables 2.1 Shortest distance and direct interactions of disease pairs vs. random pairs The effect of the restart probability on the performance of random walk with restarts Comparison of the existing methods in terms of average ranks of the disease genes among the candidates and area under ROC curves. Global methods clearly outperform localized approaches in terms of both of the metrics used The effect of incorporating disease similarity The effect of statistical adjustment on performance Performance of proposed uniform prioritization methods for all combinations of statistical adjustment and hybrid scoring schemes Percentage of disease proteins ranked in top 1% and in top 5% among other candidates Comparison of the proposed method with existing approaches iii

7 List of Figures 3.1 Motivating example for consideration of multiple paths in assessing functional association between proteins Random walk with restarts vs. shortest distance in capturing functional association The effect of degree in global prioritization methods and the histogram of degrees of genes The standard deviation of the degrees of genes implicated in the same disease, divided by their mean ROC curves showing the effect of the restart probability on the performance of random walk with restarts The effect of incorporating disease similarity ROC curves of statistical correction methods ROC curves of the performance of proposed statistical correction schemes and extant prioritization methods in prioritizing known disease genes with degree Comparison of proposed method with existing approaches The effect of degree to the performance of both existing global approaches, adjusted method based on candidate degrees as well as the proposed hybrid model Performance comparison for different number of candidate proteins.. 36 iv

8 4.8 Case Example for the Azoospermia disease Case Example for the Microphthalmia disease v

9 Acknowledgements I would like to thank to my research advisor Mehmet Koyutürk for the initial idea and for his endless support during this work. vi

10 Network Based Prioritization of Disease Genes Abstract by MEHMET SİNAN ERTEN High-throughput molecular interaction data have been used effectively to prioritize candidate disease genes, based on the notion that genes associated with similar diseases are likely to interact with each other heavily in a network of protein-protein interactions (PPIs). An important challenge for these applications is the noisy nature of PPI data. Existing global methods alleviate these problems to a certain extent, by considering indirect interactions and multiplicity of paths. However, such methods favor highly connected genes, making prioritization sensitive to the degree distribution of PPI networks. Here, we propose several statistical adjustment schemes that aim to minimize these problems. While the proposed schemes are very effective in detecting loosely connected disease genes, they over-correct the highly connected genes. Motivated by these results, we develop hybrid models that integrate existing approaches with the proposed adjustment schemes. Experimental results show that proposed hybrid models significantly outperform existing methods in prioritizing disease genes.

11 Chapter 1 Introduction Identification of genes associated with diseases remains a challenge. Large scale experimental methods such as linkage analysis and genetic association studies generate large sets of candidates containing up to 300 genes [1]. Investigation of these candidates based on sequencing is an expensive task, thus not always a feasible option. Consequently, computational methods are primarily used to prioritize and identify the most likely disease-associated genes among the candidates by utilizing a variety of data sources such as protein-protein interaction (PPI) networks, gene expression data, functional annotations etc. With recent advances in biotechnology, availability of high-throughput biological data enables investigation of biological processes from a systems perspective. Biological processes are often orchestrated through interaction of multiple molecules. Considered together, these interactions form complex biological networks that underlie cellular organization [2]. Computational analysis of the structure of these networks provide significant insights into the mechanisms that drive complex biological systems [3]. In recent years, several attempts have been made to utilize the topological analyses of the PPI data in understanding human diseases [4 6]. These attempts mostly focus 1

12 on prioritization of candidate genes and they are mainly based on the notion that the products of genes associated with similar diseases are likely to be close to each other and interact heavily in a network of PPIs. However, an important challenge for these applications is the incomplete and noisy nature of the PPI data [7]. Vast amount of missing interactions and false positives effect the accuracy of methods based on local network information such as direct interactions and shortest distances. Few global methods based on simulation of random walk [5,6] and network propagation [8] get around this problem to a certain extent by considering multiple alternate paths and whole topology of PPI networks. In this work, we first investigate the performance of existing methods with respect to several parameters including the degree distribution of candidate genes, disease similarity threshold when incorporating similar diseases to calculations and the restart probability to be used in the model based on random walks. We show that existing algorithms have a rich gets richer effect on identification of new disease genes, in that highly connected genes are likely to be favored against loosely connected disease genes. We present new adjustment strategies based on statistical significance for the problem of computational disease gene prioritization and apply these strategies to enhance existing methods. We show that the proposed correction schemes are very effective in detecting loosely connected disease genes which are generally less studied, thus potentially more interesting for many applications. However, we observe that these schemes might overcorrect the highly connected disease genes, thus might perform less favorably in identifying them. Motivated by these results, we develop several uniform prioritization methods that effectively integrate existing algorithms with the proposed statistical adjustment schemes. Comprehensive experimental results indicate that the resulting hybrid schemes significantly outperform existing approaches in correctly identifying disease-associated genes. This thesis is organized as follows: In Chapter 2, we give a brief biological back- 2

13 ground in the relevant issues and prepare the reader for the rest of the chapters. In Chapter 3, we formally define the problem and discuss the existing approaches based on network in depth, followed by the presentation of proposed statistical adjustment schemes. Subsequently, we define our uniform prioritization strategies using these schemes. In Chapter 4, we first introduce the experimental setup and the performance evaluation metrics used, followed by the illustration of the effect of disease similarity and statistical adjustment on the performance of existing methods. We then present the results achieved by the uniform prioritization methods we presented. Finally in Chapter 5, we conclude the work. 3

14 Chapter 2 Background 2.1 Genetic Disorders A genetic disorder is a disease caused by changes and mutations that occur in a single gene or multiple genes in the genome. Some examples to well known singlegene disorders are; sick cell anemia, Marfan syndrome, Huntingtons disease, and hereditary hemochromatosis. On the other hand, some diseases such as heart disease, high blood pressure, Alzheimers disease, arthritis, diabetes, cancer, and obesity are complex diseases that require the interplay of multiple genes. Identification of the genes associated with the latter kind of diseases is a bigger challenge, since the impact of each gene involved can be minimal and difficult to identify separately. 2.2 Computational Disease Gene Prioritization Identification of disease genes is important to better understand gene functions as well as to improve medical care [9]. Genome-wide linkage and association studies in healthy and affected populations provide chromosomal regions that most likely contain disease-associated genes [10, 11]. Manual identification of most relevant genes among hundreds of candidates by utilizing a wide variety of data sources is an ex- 4

15 pensive and time consuming task. Consequently, computational methods based on sequence similarity [12 14], functional annotations [13 16], gene expression analysis [4, 14] and PPI [4 6, 8, 17 20] data are proposed so far. PROSPECTR [12] is an approach that is based on the intrinsic properties of candidate genes. In this method, candidate genes are prioritized by relying on the observation that genes associated with similar diseases are likely to share patterns of sequence based features. These features include the physical structure of the genes and their products as well as evolutionary information such as mutation rates. On the other hand, methods such as ENDEAVOUR [14], TOM [16] and POCUS [15] rely on the similarity of candidate genes to genes that are already associated with diseases, in terms of the functional annotations, gene expression profiles, regulatory information, etc. However, the scope of methods that rely on functional annotations is limited because only a small fraction of genes in the genome are currently annotated. Moreover, signals inferred from gene expression profiles are not easily utilized especially for diseases caused by multiple genes, where the impact of each contributor gene can be minimal. There are also other methods based on phenotypic analysis that utilize animal models which are later mapped to human data [21, 22]. 2.3 Protein-Protein Interaction Networks In this study, we primarily focus on PPI networks that model physical interactions and functional relationships between proteins. These interactions are captured via a variety of experimental and computational methods [23 25]. Reconstructing a reliable and complete protein interaction network for different species is one of the most important challenges in molecular biology. Detailed analyses on these networks are shown to be very important in understanding cells and diseases on a systemwide level. Despite the emergence of high-throughput technology, reconstruction of 5

16 a complete network is still far from a realistic goal. Existing public PPI databases only cover a fraction of these interactions. Consequently, compilations of data from several sources are used in many applications. PPI networks are often abstracted by graph models, in which the proteins are represented by nodes and the interactions (often physical) among them are represented by undirected edges. This abstraction enables application of graph theoretical approaches to the analysis of cellular organization. Existing applications of these analyses include functional module identification [26 28], prediction of functions of proteins [29, 30], identification of conserved substructures [31 33], evolutionary analyses [34, 35] and more. The latter analyses provide insights about the conservation of functional modules in the evolutionary timeline and how these modules can be utilized to reconstruct the phylogenetic tree of species. 2.4 Using Networks to Better Understand Diseases As mentioned previously, proteins associated with similar diseases are generally close to each other and interact frequently compared to random pairs in a network of PPIs. Indeed, the statistics shown in Table 2.1 clearly support this notion. The table compares the average shortest distances as well as the percentage of direct interactors among random protein pairs and among proteins that are associated with the same disease, based on a comprehensive human PPI network. It is evident from the table that disease pairs are significantly closer to each other in terms of network distance and interact more frequently, compared to pairs selected randomly across the network. Table 2.1: Average shortest distance and percentage of direct interactors among proteins associated with the same disease vs. random protein pairs in the PPI network. Avg. Shortest Distance Perc. of Direct Interactors Disease pairs Random pairs

17 Significant insights about disease-protein associations can be inferred using this notion based on PPI networks [4 6,8,17 20]. We discuss these approaches in a more detailed way in the following chapters. 7

18 Chapter 3 Problem Definition and Methods 3.1 Problem Definition A PPI network G =(V,E) consists of a set of nodes V and a set of undirected interactions E between these proteins where uv E represents an interaction between proteins u V and v V. We also define the set of interacting partners of a protein v V as N(v) ={u V : uv E}. The input to a candidate disease gene prioritization problem consists of a PPI network G, a seed set S which contains the prior information about the disease, i.e. genes known to be associated with the disease, and a set C of candidate genes to be prioritized, extracted from the linkage interval of that particular disease. Additionally, disease similarity information can also be utilized to improve the accuracy of the calculations. Details about how this similarity values are computed and incorporated into existing methods are discussed in the following sections. The overall objective is to calculate a score α(v) for each protein v C that represents the likelihood of the candidate protein to be associated with that disease. This score is computed by calculating the network proximity between v to the proteins in S by utilizing a variety of methods. Candidate proteins are then ranked according to these scores and new 8

19 disease associated genes are proposed for those diseases. 3.2 Existing Methods Based on PPI Networks As mentioned before, several approaches have been proposed to tackle this problem in the recent years. These algorithms utilize different algorithmic techniques and integrate a variety of different data sources. There is a wide range of methods based on the analysis on the topological properties of the PPI data, which can be classified into three main categories; (i) localized methods, i.e. methods based on direct interactions and shortest paths, (ii) methods based on random walks and (iii) methods based on network propagation. In the following sections, we present these methods as well as their limitations and formulate them since they are used as reference methods in our experiments. Afterwards, we explain how the disease similarity information is integrated to the calculations in these experiments Localized Methods Lage et al. [4] developed a method based on the observation that genes that have a role in similar diseases tend to interact heavily with each other in the PPI network. For a given set of candidate genes extracted from the linkage interval of a particular disease, they identify candidate complexes for each gene. These complexes consist of interacting partners of the candidate genes. Among the interacting partners, genes that are associated with the disease of interest or a similar disease are then identified and a score is computed for each complex based on the association level of their elements to the disease of interest. Candidate genes are then ranked according to the score of the complexes they make up. In order to represent this approach, we compute the proximity based on direct 9

20 interactions as follows: For each protein v in the candidate set, we calculate the proximity score α di (v), of v C to the seed set S of a disease as the total number of direct interactions of a protein with disease genes. Namely, for each protein v C we calculate the proximity score as α di (v) = n : n (N(v) S). Clearly, this approach is vulnerable to missing interactions and false positives, which are known to exist in vast amount in the publicly available PPI databases. Another major limitation of this method is that it does not take into account indirect interactions among proteins. However, previous work [36] show that proteins that do not directly interact, but share neighbors or lie close in the network tend to have similar biological functions and participate in common pathways. This limitation is partly addressed by methods that make use of network distance in terms of shortest paths. These methods utilize network distances by taking the average of the shortest distance of protein v C to each protein u S. Let π vu denote the length of the shortest path connecting proteins v and u in network G. Then, α sp (v) =( u S π vu)/ S. Despite this improvement, considering the shortest distance between two proteins and ignoring the possible alternate paths is still flawed. In previous studies, it has been shown that multiple alternate paths between proteins imply stronger functional association [37] and improve the robustness of these networks to mutations [38]. This idea is illustrated in Figure 3.1. Instead of these localized methods, global approaches that utilize the whole topology and multiple alternate paths in the network are favorable in assesing networkbased functional associations between proteins more accurately. Global methods are more robust to missing interactions and false positives in PPI data. In the next two sections, we discuss two different methods of this kind. 10

21 S T (a) S T S T (b) (c) Figure 3.1: Motivating example for consideration of multiple paths in assessing functional association between proteins. Although the distance of the source and target nodes is 2 in all three subgraphs, we expect the functional association of S and T to be greater in (b), compared to (a) since there are two connecting paths. On the other hand, the common neighbor in (c) is most likely a hub protein, thus may not imply any functional association between S and T because hub proteins are most likely involved in many different functions Random Walk with Restarts Random walk with restarts is a technique based on the simulation of a random walker, to compute the proximity between two nodes by exploiting the global structure of the network [39, 40]. This approach is used in a wide range of applications, including functional module identification [41] and candidate disease gene prioritization [5, 6]. The steps that take place in this simulation are as follows: Simulation of random walk starts from one of the proteins in S selected with the probability of 1/ S. At each step, random walk either continues with one of the interacting partners N(v) of the current protein v with probability 1 r or it jumps back to one of the seed proteins with probability r. Here, r is a user-defined parameter that is referred to as restart probability, where 0 r 1. Note that, r =0 corresponds to the random walk model used for assessing network centrality by 11

22 Google s page rank algorithm [42]. The next protein u N(v) to move from the current protein v at each step is selected uniformly at random with probability P (u, v) = 1/ N(v). After an infinitely many number of steps, the probability of being at a node provides a measure of the proximity of the candidate proteins to the seed set by taking into account the whole topology of the PPI network. In practice, this proximity score is computed by the following iterative formula; α t+1 =(1 r)pα t + rρ (3.1) where ρ denotes the restart vector with ρ(u) = 1/ S for each u S and P denotes the stochastic matrix derived from G =(V,E), i.e. P (u, v) =1/ N(v) if vu E. The elements in the resulting vector α represent the proximity of each protein to the proteins in the seed set. To put it another way, computed proximity measure is an estimation of the percentage of the time spent on a node after a sufficiently large number of steps in the simulation process. Consequently, this algorithm favors highly connected proteins located in centralized positions in the network topology. In order to compare the random walk with restarts to localized methods in terms of their ability to capture functional similarities among proteins, we compute the correlation of functional association of protein pairs to their network proximity using shortest distance and random walk with restarts separately. We use two sets of protein pairs, namely (i) OMIM pairs and (ii) all pairs. OMIM pairs represent pairs of proteins that are associated with the same disease. This data is extracted from the publicly available Online Mendelian Inheritance in Man (OMIM) [43] database. Details about this dataset are given in Results section. On the other hand, the other 12

23 Correlation with functional similarity human ppi all pairs shortest distance random walk with restarts OMIM pairs Figure 3.2: Comparison of shortest distance and random walk with restarts in terms of capturing the functional association between proteins. OMIM pairs include all protein pairs that are associated with the same disease whereas all pairs is the set of all other pairs in the PPI network. dataset is composed of all possible protein pairs in the PPI data, except for those in the OMIM pairs. In this experiment, functional similarity of proteins is calculated using an information content based similarity measure that uses similarity of Gene Ontology (GO) terms that are associated with these proteins, by also considering the hierarchical relationship of the GO terms [44]. As illustrated in Figure 3.2, the method based on random walk with restarts captures the functional association of proteins better compared to shortest distance, for all pairs of proteins, as well as those that are implicated in the same disease. Correlation values are much higher in OMIM pairs since proteins involved in same disease usually share more biological functions. This observation clearly represents the advantage of using global methods over localized approaches for capturing the proximity of proteins in the network. 13

24 3.2.3 Network Propagation In recent work, Vanunu and Sharan [8] propose a global network based propagation algorithm to compute the proximity of candidate proteins to disease sets. They define a prioritization function which models simulation of an information pump that originates at the seed sets. Although in practice this idea is very similar to simulating random walk with restarts, one of the key points of their algorithm is in the derivation of the weight matrix P from the adjacency matrix of the PPI network. Here, the idea is to normalize the weight of each edge by the degrees of both of its end points (as opposed to only one end point as in random walk with restarts). The iterative formulation resulting from the propagation model is the following: α t+1 =(1 r)p α t + rρ (3.2) where P is the normalized propagation matrix. In other words, each entry P ij of this matrix is computed as P ij =1/ N(v i ) N(v j ) where v i,v j V. 3.3 Incorporating Disease Similarity Information As mentioned previously, proteins associated with similar diseases or phenotypes lie in close proximity in the PPI network. This brings the idea of utilizing disease similarity information in to identification of disease genes [4, 8]. Driel et al. [45] incorporate disease similarity information using text mining with an algorithm that can be summarized as follows: First, each OMIM record is parsed and the keywords are searched for existence in the anatomy (A) and the disease (C) sections of the Medical Subject Headings Vocabulary (MeSH), which is a controlled vocabulary of U.S. National Library 14

25 of Medicine. MeSH is especially useful for applications that use information that contains different terminology for the same concepts. Each OMIM record is then represented by binary vectors where each entry of the vector corresponds to the existence of a term in that record. Similarity of two diseases is then computed by calculating the cosine of the angle between their representative vectors. After these calculations, a similarity score between 0 and 1 for every pair of OMIM diseases is available. In practice, this disease similarity information can be easily incorporated into the random walk with restart and network propagation models. This is achieved as follows: Let d c denote the disease of interest, θ = {d 1,d 2,d 3,..., d t } represent the set of other diseases for which information is available and φ(d i,d j ) denote the similarity between diseases d i and d j. Also, let S i denote the set of genes associated with disease d i. Note that, these sets are not always disjoint, i.e. there are some genes that are associated with multiple diseases. For the disease of interest d c, these methods consider other diseases d i in θ such that φ(d c,d i ) >σwhere σ is a user-defined threshold that is used to decide which diseases are considered similar. Now since vector ρ in Equations 3.1 and 3.2 represent the prior information on the association between each gene and the disease, ρ(v) for each v V can be set to the similarity of the disease that involves gene v and the disease of interest d c. In other words: First, we set ρ(u) =1foreachu S and ρ(v) =max{φ(d c,d i )} for each v S i where d i D(c) andd(c) represents all diseases that are similar to d c.inother words, D(c) ={d i : φ(d c,d i ) σ} Then, we normalize the vector ρ such that ρ = ρ/ ρ 1. 15

26 Observe that, if a protein is a member of more than one similar disease to the current disease, its apriori association with disease is set to the maximum among the similarity values of the diseases it is associated with. Please note that, Vanunu and Sharan [8] use a logistic function to represent the reliability of the disease similarity values. However, for simplicity, we use a user defined similarity threshold value as discussed above and incorporate disease similarity in the same manner for all models we compare. 3.4 Statistical Adjustment Detailed performance analyses for these existing approaches are presented in the next chapter. But to motivate our key approach, we evaluate here the performance of these methods with respect to degree distribution. As seen in Figure 3.3, existing global methods such as random walk with restarts and network propagation are clearly biased with the degree of the candidate proteins. Here, the performance measure is the average rank of the true candidate protein among other 99 proteins in the same linkage interval. Details about the experimental setting are explained in the next chapter. As seen in the figure, existing global methods work very well in predicting correctly highly connected proteins, whereas they perform poorly for loosely connected proteins, especially for those with a degree less than 6. This happens because highly connected proteins generally represent hub proteins that are easily reachable from every node in the network. Thus, the relative amount of time spent on these proteins are higher than loosely connected proteins in a simulation of random walks. Network propagation based method provides a slightly more balanced performance compared to the random walk based methods. This is achieved by normalizing all the edge weights by the degree of both of their incident nodes. However, its performance is still influenced heavily by node degrees. 16

27 Figure 3.3: (a)the effect of degree to the performance of both existing global approaches is shown. x-axis is the degree range while y-axis represents the average rank for the true disease genes. (b)histogram of the degree of disease genes. (c) Histogram of the degree of all genes. One should note that loosely connected proteins generally represent less studied proteins, thus inferring knowledge about these proteins can be more interesting for many applications in terms of generating novel biological knowledge. Therefore, these methods should be improved in a way to perform as good for these loosely connected proteins. In this thesis, we systematically tackle this problem and try to remove the bias of these methods with respect to the degree of candidate proteins by using statistical significance schemes. We show that our adjustment schemes clearly outperform existing 17

28 methods especially in identifying loosely connected proteins. However, as we present in the next chapter, these schemes may overcorrect for some of the highly connected proteins, thus causing these methods to be biased in the opposite direction. We get around this problem by developing several uniform prioritization methods, which are explained in detail in the following section. We now propose three alternate methods for statistical adjustment of disease association scores obtained by random walk with restarts Reference Model Based on Seed Degrees The objective here is to remove the bias by generating a reference model that captures the degree distribution of seed proteins accurately. We basically compare the proximity values achieved for each protein with respect to the values achieved by using random seed sets (by conserving the degree distribution of the seed elements). False positives that correspond to centralized and highly connected proteins are expected to have high proximity values even with respect to these randomly generated seed sets. This implies that their high proximity with respect to the actual seed set is most likely artificial, thus the significance of their proximity scores is low. On the other hand, loosely connected proteins that lie close to the seed proteins will be assigned high significance although their proximity scores are generally lower than that of highly connected proteins. Given a seed set S, and a candidate set of proteins C, the steps involved in this process are the following: First, the proximity vector α S for the original seed set S is computed using Equation 3.1. Then, given the original seed set S, we generate a random instance S (i) that represents S in terms of degree distribution. S (i) is generated as follows: 18

29 First, a bucket B(u) is created for each protein u S. Then, each protein v V is assigned to bucket B(u) if N(v) N(u) < N(v) N(u ) for all u S and ties are broken randomly. Subsequently, S (i) is generated by choosing a protein from each bucket, uniformly at random. It can be observed that each protein in S is represented by exactly one protein in S (i), thus the total degree of proteins in S is expected to be very close to that of S (i). The proximity vector α (i) using each seed set S (i) is computed using Equation 3.1. This experiment is repeated n times, where n is sufficiently large (we use n = 1000 in our experiments) to obtain a sampling {α (1),α (2),α (3),..., α (n) } of proximity vectors by using seed sets that are representative of S in terms of the degree distribution and size. We then calculate the mean vector μ as, μ S = 1 i n α(i) /n and the standard deviation σ as, σ 2 S = 1 i n ((α(i) μ S )(α (i) μ S ) T )/(n 1). Significance scores for each protein is then computed as, z S (v) =(α(v) μ S )/σ S, which is used as the proximity measure to compare and rank the candidate proteins Reference Model Based on Candidate Degree In this reference model, instead of generating random seed sets, we assess the statistical significance of the proximity α S (v) ofaproteinv V to seed set S by comparing it to the proteins that have a similar degree to that of v. For a particular protein v, if other proteins with similar degrees usually have high proximity values for a given seed set, the significance of the proximity score of v can be considered low. On the other hand, low degree proteins are compared to other 19

30 low degree proteins, which potentially have lower proximity scores, thus the artificial advantage of the high degree proteins is removed. The procedure for generation of this reference model can be summarized as follows: First, we compute the original proximity vector α S with respect to the given seed set S, again by using Equation 3.1. Then, for each candidate protein v C, we sort all other proteins u V with respect to the value N(v) N(u) in ascending order. We create a new set M that contains the first n most similar proteins to v in terms of degree. We use n = 1000 in our experiments. Subsequently, we calculate the significance of the proximity score α S (v) of each candidate protein v as z(v) = (α S (v) μ(m))/σ(m) whereμ(m) = u M α S(u)/n and σ 2 (M) = u M (α S(u) μ(m))/(n 1). This adjusted score is used as the proximity measure to compare and rank the candidate genes Reference Model Using Eigenvector Centrality Here, we normalize the proximity of a protein with respect to the seed set by dividing it to the eigenvector centrality score of that node. The objective is to remove the bias introduced by network centrality, since central nodes are favored in the prioritization process in the existing global methods. Centrality score of a protein is computed again by simulating random walks, but this time with the restart probability set to 0. More formally, the proximity vector α that contains the proximities of each protein to the seed set is calculated as in Equation 1 with the restart probability r =0.3 (the reason to select this particular value is explained in detail in Results section), while the centrality score of each protein is calculated by setting r = 0. Indeed, setting r = 0, corresponds to the case where the seed set is empty, thereby making 20

31 the resulting proximity score a function of the node s network centrality. For each candidate protein v C, adjusted score z(v) is computed by dividing the proximity score by its centrality score, i.e. z(v) =α (r=0.3) (v)/α (r=0) (v). Finally each candidate is ranked according to this adjusted score, thus the artificial advantage of the centralized proteins over loosely connected proteins is minimized. 3.5 Uniform Prioritization As mentioned before, although these adjustment strategies work well in identifying loosely connected proteins, they tend to overcorrect highly connected disease proteins, as supported by the experimental results presented in the next chapter. Based on this observation, we propose several hybrid scoring schemes to combine the advantage of both raw proximity scores and statistically adjusted scores. The objective is to obtain a uniform prioritization method that utilizes the adjusted scores only for those candidate proteins that are loosely connected, and use the raw proximity scores for highly connected candidate proteins. Let R raw (i) denote the rank calculated by using random walk with restarts and R adj (i) denote the rank achieved by using one of the statistical adjustment schemes for gene v i C. We propose three strategies to obtain a new ranking by merging these two ranking lists which are described in the following discussion. Based on the degree of candidate genes: We create another list, such that for each candidate gene v i C, if N(v) > λ we use R raw (i) and otherwise we use R adj (i), where λ is a user defined threshold to distinguish high and low degree proteins. Thus the combined rank consists of rankings taken from these two different lists, depending on the degree of each candidate in the candidate set. If a candidate is loosely connected, it makes more sense to use its rank achieved by the adjusted scheme, on the other hand for a highly connected candidate, the rank achieved by 21

32 stddev/mean observed random number of disease genes Figure 3.4: The standard deviation of the degrees of genes implicated in the same disease, divided by their mean, with respect to the number of genes associated with that particular disease. raw proximity calculations is utilized. If the rankings of two proteins are identical in the combined list, their ranks in the unused lists are utilized to break the tie. Assume, for example, that a protein v with degree 2 is ranked as number 1 in R adj and 15 in R raw, whereas another protein u with degree 20 is ranked as number 30 in R adj and 1 in R raw,whereλ is set to 5. The merged list will contain the value 1 for each of these proteins. In the prioritization process on the merged list, v will be ranked higher compared to u since its rank in its unutilized list R raw is higher than the rank of u in R adj. Best of each rank: For each candidate gene v i C, we create another ranking list composed of max{r raw (i),r adj (i)}. Again, we break the ties by using the rankings in the unutilized lists for each candidate in the merged list. 22

33 Based on degree of known disease genes Based on the notion that some diseases are studied more in detail compared to other diseases, we expect the degrees of genes associated with similar diseases to be somewhat close to each other. In fact, this expectation is supported by experimental evidence, as shown in Figure 3.4. In this figure, we plot the standard deviation of the degrees of genes implicated in the same disease, divided by their mean. Results indicate that, compared to randomly generated sets of genes, degrees of proteins associated with the same disease are closer to each other. We take advantage of this observation by using it as an indicator of the degree of the true candidate gene among other candidates. If all the known proteins associated with a disease have high degrees, we favor high-degree proteins among the candidates for that particular disease and vice versa. Specifically, for each candidate gene u C, we compute the average degree of the genes for that particular disease and use that value to decide which ranking to utilize for all candidates. In other words, for each disease, we compute d =( v S N(v) )/ S. After this step, if d>λ,weuser raw for each candidate gene, and otherwise, we use R adj. Observe that, this approach is global, i.e. rankings used do not depend on the degree of each candidate gene, but the average degree of the seed set. On the other hand, the other two hybrid schemes presented earlier are local, i.e. ranking used for each candidate gene depends on the degree of that particular gene. 23

34 Chapter 4 Experimental Results In this chapter, we comprehensively evaluate the performance of the methods we presented in the previous chapter. We start by introducing our experimental setup and the performance metrics we use. Next, we compare the performance of existing methods in terms of these metrics. Subsequently, we show how the results are enhancing by incorporation of disease similarity information. Then, we illustrate the effect of statistical correction schemes in detail. Finally, we present the proposed uniform prioritization strategies and show that our final proposed hybrid method outperforms existing methods for disease gene prioritization. 4.1 Experimental Setting In order to evaluate and compare different methods in terms of prioritizing diseaseassociated genes, we apply leave-one-out cross-validation. For each protein that is associated with a disease, we conduct the following experiment: We remove a gene from the set of genes associated with a particular disease. We generate an artificial linkage interval, containing this removed disease-associated gene with other 99 genes, located nearest in terms of the genomic distance on 24

35 the same chromosome. Please note that, later in this chapter we investigate the effect of selection of the size of candidate sets to the performance. Using each of the methods described in the previous chapter, we score each of these candidate genes based on its network proximity to the seed set. Then, we sort the candidate genes with respect to these scores. The rank of the true gene among all other candidates is used as an indicator of the accuracy of the scoring method used. In order to systematically compare the performance of different methods, we use the following metrics: Average Rank: After repeating the above steps for all proteins associated with at least one disease, we calculate the average rank of the true proteins among the other candidates. This value is the percentage of the rank of the true protein among other candidates. ROC Curves: We also plot ROC Curves, i.e. sensitivity vs 1-specificity, by varying the rank threshold from 1 to 100. Sensitivity is defined as the percentage of the true genes ranked above the particular threshold where as specificity is the percentage of all candidate genes ranked below the threshold. Area under ROC curve (AUROC) is used as another performance measure to compare different methods. Percentage of the disease proteins ranked in top 1%: This value gives the percentage of the true disease proteins that were ranked as one of the proteins in the top 1% among the candidates in the prioritization experiments. Percentage of the disease proteins ranked in top 5%: This value gives the percentage of the true disease proteins that were ranked as one of the proteins in the top 5% among the candidates in the prioritization experiments. 25

36 4.1.1 Datasets Used For our experiments, we use the human PPI data extracted from NCBI Entrez Gene Database [46]. This database integrates interaction data from several other databases available, such as HPRD, BioGrid and BIND. After the removal of nodes with no interactions, we work on a PPI data composed of 8959 proteins and distinct interactions among these proteins. Weextract disease information fromonline Mendelian Inheritance in Man (OMIM) dataset as mentioned before. OMIM provides a publicly available comprehensive database of genotype-phenotype relationship in humans. We map genes associated with diseases to the PPI data we use and remove those diseases for which we are unable to map more than two associated genes. After this step, we have a total of 206 diseases with at least 3 associated genes. Number of genes associated with these diseases ranges from 3 to 36, with the average number of associations for each disease being approximately Existing Methods Here, we compare the performance achieved by the localized methods, i.e. shortest distances and direct interactions, to the global methods, i.e. network propagation and random walk with restarts. But first, we conduct an experiment to decide on the restart probability to use for the global methods Effect of the Restart Probability The restart probability represented by r in Equations 3.1 and 3.2 is a parameter that is used to adjust the preference between the importance of a protein with respect to the seed set and network topology. We conduct experiments using different values for this restart probability to observe its effect on the performance of the method based 26

37 Sensitivity r=0.01 r=0.1 r=0.3 r=0.5 r=0.7 r= Specificity Figure 4.1: ROC curves showing the effect of the restart probability on the performance of random walk with restarts. on random walk with restarts. As illustrated in Table 4.1 and Figure 4.1, unless a very small value is used, the selection of the restart probability does not effect the performance significantly. The performance degrades significantly for the value r =0.01, which is expected because the effect of the seed genes is minimized in that case, thus the proximity of a protein is calculated based primarily on the centrality of that protein. Table 4.1: The effect of the restart probability on the performance of disease prioritization based on random walk with restarts in terms of the average rank of the true candidate proteins as well as the area under ROC curves. Restart Prob. Avg. Rank AUROC

38 4.2.2 Comparison of Existing Methods As listed in Table 4.2, global methods clearly outperform the localized methods in terms of the average rank and AUROC values achieved. As expected, shortest distance performs better than the method that is based solely on direct interactions. On the other hand, network propagation and random walk with restarts perform very similarly, with network propagation showing some marginal improvement. Please note that disease similarity information is not utilized in these experiments. Based on these results, in the rest of this chapter, we only focus on global methods and eliminate the localized approaches at this stage. Table 4.2: Comparison of the existing methods in terms of average ranks of the disease genes among the candidates and area under ROC curves. Global methods clearly outperform localized approaches in terms of both of the metrics used. METHOD Avg. Rank AUROC Network propagation Random walk w/ restarts Shortest distances Direct interactions Effect of Disease Similarity Low similarity scores for disease pairs do not usually contain any valuable information, as mentioned in [45]. Thus, selection of the similarity threshold is an important task in our work. We perform experiments with different thresholds on disease similarity, by ignoring similarities below that particular threshold. Detailed results on this threshold analysis can be found in Figure 4.2 and Table 4.3. Our results are in line with previous results in terms of the optimal disease threshold to use. Using a low threshold (< 0.3) apparently adds false positives to the calculation thus decreases the accuracy. Similarly, using a large threshold 0.8, we do not utilize all similar diseases, thus missing some valuable information. Based on these results, the threshold value 28

39 is set to 0.3 for all methods based on network propagation and random walk with restarts. For the sake of completeness, we perform one additional test, to investigate the effect of disease similarity with similar disease proteins selected randomly. In other words, for the diseases identified to be similar using the threshold value set to 0.3, we select the associated proteins to the similar diseases randomly by conserving the number of associations for each disease. As expected, this performs even worse than adding no disease similarity information at all. The purpose for this experiment is to show that disease similarity information does not add any artificial advantage to the performance. In Figure 4.2, the ROC curve clearly demonstrate the advantage of using disease similarity information when the threshold is set to 0.3. Table 4.3: The effect of incorporating disease similarity on the method based on random walk with restarts. The optimal threshold for disease similarity measures is observed to be 0.3. As expected, if genes associated with similar diseases are selected randomly, the performance degrades compared to not using any disease similarity information. Method Avg. Rank AUROC No Disease Similarity Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Disease Similarity with thres Random Disease Similarity Effect of Statistical Adjustment As mentioned before, the performance of the global methods are highly biased with the degree of the true candidate protein. The effect of the degree for global methods is previously demonstrated in Figure 3.3. Especially for candidate genes with degree 5, existing global approaches perform very poorly in terms of identifying the true 29

40 Sensitivity no disease similarity disease similarity using threshold 0.3 random disease similarity Specificity Figure 4.2: ROC curves for the method using optimal disease similarity threshold 0.3, no disease similarity and random disease similarity are plotted. As expected, incorporating disease similarity information improves the performance where as if genes associated with similar diseases are selected randomly, this degrades the performance. disease gene. However, candidate genes with degree 5 make up 51% of all known disease genes. Since known disease genes are likely to be studied more, they are also likely to have more known interactions. To this end, accurate ranking of low-degree genes is very important in terms of being able to extract new information on relatively less studied genes. Therefore, we tackle this problem by applying the statistical adjustment strategies described in the previous chapter by using the random walk with restarts as the base method. We conduct separate experiments by using all disease proteins as well as disease proteins with degree 5 only. By examining the Figures 4.3, 4.4 and Table 4.4, one can easily see the impact of our correction schemes. Proposed models significantly outperform existing methods especially when dealing with loosely connected proteins. However, this comes at a price of overcorrecting some highly connected disease pro- 30

41 Sensitivity random walk with restarts network propagation model based on candidate degree model based on centrality model based on seed degrees Specificity Figure 4.3: ROC curves of statistical correction methods. All three adjustment models work better compared to the existing global methods. teins. Although the correction schemes outperform existing methods even when we use all disease proteins, it is clear that these methods can even perform better by applying hybrid methods, as we demonstrate in detail in the following section. 4.4 Uniform Prioritization As shown, the proposed statistical schemes perform better for loosely connected proteins, however, existing global methods perform somewhat better for highly connected proteins. In this section, we evaluate the performance when we combine these two approaches as discussed in the previous chapter. We first examine the average rank and AUROC values for different hybrid methods we use, by combining our three statistical adjustment methods using three different hybrid models explained in depth in the previous chapter. Thus, we present uniform 31

42 Sensitivity random walk with restarts network propagation Model based on candidate degree Model based on centrality Model based on seed degrees Specificity Figure 4.4: ROC curves of the performance of proposed statistical correction schemes and extant prioritization methods in prioritizing known disease genes with degree 5. prioritization results of nine different methods here. In these experiments, we set the high-degree threshold λ, to 5. For convenience, we refer to the hybrid methods in the tables as explained below: Hyb1: Hybrid method based on degree of candidate genes. Hyb2: Hybrid method based on best of each rank. Hyb3: Hybrid method based on average degree of seed genes. As shown in Tables 4.5 and 4.6, hybrid models that combine raw proximity scores with statistical adjustments based on degree of the candidates outperform other adjustment techniques. We choose the hybrid method based on degree of candidate genes (Hyb1) that combines the raw scores with our adjustment scheme based on candidate degrees as our final proposed method, since this approach provides the best accuracy in correctly predicting the disease protein as the top candidate (21.7%) as well as the highest 32

43 Table 4.4: The effect of statistical adjustment on performance. Average Rank of the true disease genes and AUROC values are listed. For the sake of demonstrating our point, we list disease genes with degree 5and > 5. For the highly connected proteins, the performance of existing global methods are comparable to statistical correction schemes. On the other hand, for the loosely connected proteins, the proposed schemes provide significant improvement. As evident, compared to existing methods, these correction schemes are less affected by the degree of the of the true candidate gene. All Genes Degrees 5 Degrees> 5 Method Avg. Rank AUROC Avg. Rank AUROC Avg. Rank AUROC Network Propagation Random walk w/ restarts Based on seed degree Based on candidate degree Basedoncentrality Table 4.5: Performance of proposed uniform prioritization methods for all combinations of statistical adjustment and hybrid scoring schemes. Candidate deg. Seed deg. Centrality Avg. Rank AUROC Avg. Rank AUROC Avg. Rank AUROC Hyb Hyb Hyb average rank of the true candidate gene (23.22). Table 4.6: Percentage of disease proteins ranked in top 1% and in top 5% among other candidates. The most accurate method is the one based on degree of candidate proteins (hyb1) by using the adjustment scheme that utilizes eigenvalue centrality. Candidate deg. Seed deg. Centrality Hyb1 Hyb2 Hyb3 Hyb1 Hyb2 Hyb3 Hyb1 Hyb2 Hyb3 Perc. ranked in top 1% Perc. ranked in top 5% Our proposed hybrid method outperforms both of the existing global approaches in terms of average rank values achieved, AUROC values and accuracy in prediction, i.e. the percentage of disease proteins ranked as the top 1% and 5%. These final results comparing the proposed method with existing global approaches are shown in Figures 4.5, 4.6 and Table

44 Sensitivity random walk with restarts network propagation proposed hybrid method Specificity Figure 4.5: ROC curves to compare the proposed method with existing global approaches. Table 4.7: Comparison of the proposed method with existing global approaches in terms of different performance metrics. Proposed method outperforms others in terms of all metrics used. METHOD Avg. Rank AUROC Perc. Ranked in top 1% Perc. Ranked in top 5% Proposed Hybrid Method Network propagation Random walk w/ restarts Size of the Candidate Set As mentioned before, linkage analysis provides up to several hundreds of candidate genes for diseases. Thus, here we investigate the effect of candidate set size on the performance of our proposed method as well as the raw method based on random walk with restarts. As illustrated in Figure 4.7, the proposed method outperforms existing approaches for all different candidate set sizes tested. Please note that after a slight increase for the size 60, the performance of both methods do not change much for different candidate numbers up to

45 60 average rank of disease genes random walk with restarts network propagation adjusted by candidate degree proposed hybrid model degree range of disease genes Figure 4.6: The effect of degree to the performance of both existing global approaches, adjusted method based on candidate degrees as well as the proposed hybrid model are shown. 4.5 Case Examples In this section, we provide two real examples to show the power of our proposed hybrid method that uses statistical adjustment on existing global approaches in terms of capturing the disease protein better especially when it is loosely connected Example 1: The first example is for the disease called Azoospermia. This disease has 3 proteins directly associated with it in our PPI network, namely USP9Y, SYCP3 and SCP3. In our cross validation experiments, we remove USP9Y and try to predict that using the other two proteins. Existing methods fail to identify USP9Y because it is not a centralized protein with a degree of only 1. Thus, random walk with restarts model ranks this true protein as 57 th and network propagation ranks it as 27 th among 100 candidates. On the other hand, our method was able to correctly rank this protein as the 1 st candidate. This 35

46 random walk with restarts proposed hybrid method rank of the disease gene (%) number of candidate genes Figure 4.7: Performance comparison for different number of candidate proteins in terms of the average rank of the true disease protein among the candidates. case is illustrated in Figure 4.8 where we draw the 2-neighborhood of proteins USP9Y, SYCP3 and SCP3. Proteins in the seed set as well as the proteins associated with similar diseases are shown as green circles. The intensity of these green circles are proportional to the disease similarity between the disease of interest and the disease they are associated with. Subsequently, USP9Y and SYCP3 are the darkest green proteins. For this example, none of the diseases are identified as significantly similar to Azoospermia so the proteins used are from the seed set only. True candidate that we try to capture is represented by a yellow circle. The protein that is incorrectly ranked as the 1 st among the candidates using both of the existing global approaches is represented by a diamond. As expected, this incorrectly identified protein is a centralized node with a high degree connected to other centralized proteins. 36

47 4.5.2 Example 2: The second example is for the disease called Microphthalmia. This disease also has 3 proteins directly associated with it in our PPI network, namely SIX6, CHX10 and BCOR. In our cross validation experiments, we remove SIX6 and try to predict that using the other two proteins. Again, existing methods fail because SIX6 is not a centralized protein with a degree of only 1. Thus, random walk with restarts model ranks this true protein as 26 th and network propagation ranks it as 16 th among 100 candidates. Again, our method was able to correctly rank this protein as the 1 st candidate. This case is illustrated in Figure 4.9 where we draw the 2-neighborhood of proteins SIX6, CHX10 and BCOR. Proteins in the seed set as well as the proteins associated with similar diseases are shown as green circles. The intensity of these green circles are proportional to the disease similarity between the disease of interest and the disease they are associated with. Subsequently, CHX10 and BCOR are the darkest green proteins. Similar to the previous example, the protein that is ranked incorrectly as the 1 st among the candidates using both of the existing global approaches is represented by a diamond. Again, the incorrectly identified protein is a centralized node with a high degree connected to other centralized proteins. 37

48 Figure 4.8: Case example for the Azoospermia disease. 2-neighborhood of the disease genes is shown. Proteins in the seed set are shown as green circles whereas the true candidate that we try to capture is represented as a yellow circle. Diamond represents the protein that is incorrectly ranked first for both of the existing global approaches. 38

49 Figure 4.9: Case example for the Microphthalmia disease. 2-neighborhood of the disease genes is illustrated. Proteins associated with the disease of interest are illustrated as green circles where the intensity is proportional to the similarity of the disease that protein is associated with, to the disease of interest. The true candidate that we try to capture is represented as a yellow circle. Diamond represents the protein that is incorrectly ranked first for both of the existing global approaches. 39

Role of Centrality in Network Based Prioritization of Disease Genes

Role of Centrality in Network Based Prioritization of Disease Genes Role of Centrality in Network Based Prioritization of Disease Genes Sinan Erten 1 and Mehmet Koyutürk 1,2 Case Western Reserve University (1)Electrical Engineering & Computer Science (2)Center for Proteomics

More information

A Propagation-based Algorithm for Inferring Gene-Disease Associations

A Propagation-based Algorithm for Inferring Gene-Disease Associations A Propagation-based Algorithm for Inferring Gene-Disease Associations Oron Vanunu Roded Sharan Abstract: A fundamental challenge in human health is the identification of diseasecausing genes. Recently,

More information

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger Protein-Protein-Interaction Networks Ulf Leser, Samira Jaeger This Lecture Protein-protein interactions Characteristics Experimental detection methods Databases Protein-protein interaction networks Ulf

More information

Integrative omics approaches in systems biology of complex phenotypes

Integrative omics approaches in systems biology of complex phenotypes Integrative omics approaches in systems biology of complex phenotypes Mehmet Koyutürk Case Western Reserve University (1)Electrical Engineering & Computer Science (2)Center for Proteomics & Bioinformatics

More information

The Interaction-Interaction Model for Disease Protein Discovery

The Interaction-Interaction Model for Disease Protein Discovery The Interaction-Interaction Model for Disease Protein Discovery Ken Cheng December 10, 2017 Abstract Network medicine, the field of using biological networks to develop insight into disease and medicine,

More information

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger Protein-Protein-Interaction Networks Ulf Leser, Samira Jaeger This Lecture Protein-protein interactions Characteristics Experimental detection methods Databases Biological networks Ulf Leser: Introduction

More information

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

SOCIAL MEDIA MINING. Behavior Analytics

SOCIAL MEDIA MINING. Behavior Analytics SOCIAL MEDIA MINING Behavior Analytics Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Finding Compensatory Pathways in Yeast Genome

Finding Compensatory Pathways in Yeast Genome Finding Compensatory Pathways in Yeast Genome Olga Ohrimenko Abstract Pathways of genes found in protein interaction networks are used to establish a functional linkage between genes. A challenging problem

More information

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger Protein-Protein-Interaction Networks Ulf Leser, Samira Jaeger SHK Stelle frei Ab 1.9.2015, 2 Jahre, 41h/Monat Verbundprojekt MaptTorNet: Pankreatische endokrine Tumore Insb. statistische Aufbereitung und

More information

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data Vol 1 no 1 2005 Pages 1 5 Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data Ozgun Babur 1 1 Center for Bioinformatics, Computer Engineering Department, Bilkent University,

More information

Biomine: Predicting links between biological entities using network models of heterogeneous databases

Biomine: Predicting links between biological entities using network models of heterogeneous databases https://helda.helsinki.fi Biomine: Predicting links between biological entities using network models of heterogeneous databases Eronen, Lauri 202 Eronen, L & Toivonen, H 202, ' Biomine: Predicting links

More information

Disease Protein Pathway Discovery using Higher-Order Network Structures

Disease Protein Pathway Discovery using Higher-Order Network Structures Disease Protein Pathway Discovery using Higher-Order Network Structures Kevin Q Li, Glenn Yu, Jeffrey Zhang 1 Introduction A disease pathway is a set of proteins that influence the disease. The discovery

More information

1.2 Geometry of a simplified two dimensional model molecule Discrete flow of empty space illustrated for two dimensional disks.

1.2 Geometry of a simplified two dimensional model molecule Discrete flow of empty space illustrated for two dimensional disks. 1.1 Geometric models of protein surfaces. 3 1.2 Geometry of a simplified two dimensional model molecule. 5 1.3 The family of alpha shapes or dual simplicial complexes for a two-dimensional toy molecule.

More information

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions

Near-Balanced Incomplete Block Designs with An Application to Poster Competitions Near-Balanced Incomplete Block Designs with An Application to Poster Competitions arxiv:1806.00034v1 [stat.ap] 31 May 2018 Xiaoyue Niu and James L. Rosenberger Department of Statistics, The Pennsylvania

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

TUTORIAL. Revised in Apr 2015

TUTORIAL. Revised in Apr 2015 TUTORIAL Revised in Apr 2015 Contents I. Overview II. Fly prioritizer Function prioritization III. Fly prioritizer Gene prioritization Gene Set Analysis IV. Human prioritizer Human disease prioritization

More information

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X Microarray Data Analysis in GeneSpring GX 11 Month ##, 200X Agenda Genome Browser GO GSEA Pathway Analysis Network building Find significant pathways Extract relations via NLP Data Visualization Options

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield

ReCombinatorics. The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination. Dan Gusfield ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination! Dan Gusfield NCBS CS and BIO Meeting December 19, 2016 !2 SNP Data A SNP is a Single Nucleotide Polymorphism

More information

Secondary Math Margin of Error

Secondary Math Margin of Error Secondary Math 3 1-4 Margin of Error What you will learn: How to use data from a sample survey to estimate a population mean or proportion. How to develop a margin of error through the use of simulation

More information

Homophily and Influence in Social Networks

Homophily and Influence in Social Networks Homophily and Influence in Social Networks Nicola Barbieri nicolabarbieri1@gmail.com References: Maximizing the Spread of Influence through a Social Network, Kempe et Al 2003 Influence and Correlation

More information

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background

Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Background Title: Genome-Wide Predictions of Transcription Factor Binding Events using Multi- Dimensional Genomic and Epigenomic Features Team members: David Moskowitz and Emily Tsang Background Transcription factors

More information

ACCURATE AND RELIABLE CANCER CLASSIFICATION BASED ON PATHWAY-MARKERS AND SUBNETWORK-MARKERS. A Thesis JUNJIE SU

ACCURATE AND RELIABLE CANCER CLASSIFICATION BASED ON PATHWAY-MARKERS AND SUBNETWORK-MARKERS. A Thesis JUNJIE SU ACCURATE AND RELIABLE CANCER CLASSIFICATION BASED ON PATHWAY-MARKERS AND SUBNETWORK-MARKERS A Thesis by JUNJIE SU Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment

More information

Application of Decision Trees in Mining High-Value Credit Card Customers

Application of Decision Trees in Mining High-Value Credit Card Customers Application of Decision Trees in Mining High-Value Credit Card Customers Jian Wang Bo Yuan Wenhuang Liu Graduate School at Shenzhen, Tsinghua University, Shenzhen 8, P.R. China E-mail: gregret24@gmail.com,

More information

Statistical Methods for Network Analysis of Biological Data

Statistical Methods for Network Analysis of Biological Data The Protein Interaction Workshop, 8 12 June 2015, IMS Statistical Methods for Network Analysis of Biological Data Minghua Deng, dengmh@pku.edu.cn School of Mathematical Sciences Center for Quantitative

More information

Inference and computing with decomposable graphs

Inference and computing with decomposable graphs Inference and computing with decomposable graphs Peter Green 1 Alun Thomas 2 1 School of Mathematics University of Bristol 2 Genetic Epidemiology University of Utah 6 September 2011 / Bayes 250 Green/Thomas

More information

Retrieval of gene information at NCBI

Retrieval of gene information at NCBI Retrieval of gene information at NCBI Some notes 1. http://www.cs.ucf.edu/~xiaoman/fall/ 2. Slides are for presenting the main paper, should minimize the copy and paste from the paper, should write in

More information

ChIP-seq and RNA-seq. Farhat Habib

ChIP-seq and RNA-seq. Farhat Habib ChIP-seq and RNA-seq Farhat Habib fhabib@iiserpune.ac.in Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions

More information

Can Cascades be Predicted?

Can Cascades be Predicted? Can Cascades be Predicted? Rediet Abebe and Thibaut Horel September 22, 2014 1 Introduction In this presentation, we discuss the paper Can Cascades be Predicted? by Cheng, Adamic, Dow, Kleinberg, and Leskovec,

More information

Prioritizing Disease Genes by Bi-Random Walk

Prioritizing Disease Genes by Bi-Random Walk Prioritizing Disease Genes by Bi-Random Walk Maoqiang Xie 1, Taehyun Hwang 2, and Rui Kuang 3, 1 College of Software, Nankai University, Tianjin, China 2 Masonic Cancer Center, University of Minnesota,

More information

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY GUY HASKIN FERNALD 1, DORNA KASHEF 2, NICHOLAS P. TATONETTI 1 Center for Biomedical Informatics Research 1, Department of Computer

More information

ChIP-seq and RNA-seq

ChIP-seq and RNA-seq ChIP-seq and RNA-seq Biological Goals Learn how genomes encode the diverse patterns of gene expression that define each cell type and state. Protein-DNA interactions (ChIPchromatin immunoprecipitation)

More information

DNA Based Disease Prediction using pathway Analysis

DNA Based Disease Prediction using pathway Analysis 2017 IEEE 7th International Advance Computing Conference DNA Based Disease Prediction using pathway Analysis Syeeda Farah Dr.Asha T Cauvery B and Sushma M S Department of Computer Science and Shivanand

More information

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs

By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs (3) QTL and GWAS methods By the end of this lecture you should be able to explain: Some of the principles underlying the statistical analysis of QTLs Under what conditions particular methods are suitable

More information

Cascading Behavior in Networks. Anand Swaminathan, Liangzhe Chen CS /23/2013

Cascading Behavior in Networks. Anand Swaminathan, Liangzhe Chen CS /23/2013 Cascading Behavior in Networks Anand Swaminathan, Liangzhe Chen CS 6604 10/23/2013 Outline l Diffusion in networks l Modeling diffusion through a network l Diffusion, Thresholds and role of Weak Ties l

More information

The two-stage recombination operator and its application to the multiobjective 0/1 knapsack problem: a comparative study

The two-stage recombination operator and its application to the multiobjective 0/1 knapsack problem: a comparative study The two-stage recombination operator and its application to the multiobjective 0/1 knapsack problem: a comparative study Brahim AGHEZZAF and Mohamed NAIMI Laboratoire d Informatique et d Aide à la Décision

More information

DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues

DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues Supporting Materials for DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues Jing Yan 1 and Lukasz Kurgan 2, * 1 Department of Electrical and Computer

More information

Cancer Informatics. Comprehensive Evaluation of Composite Gene Features in Cancer Outcome Prediction

Cancer Informatics. Comprehensive Evaluation of Composite Gene Features in Cancer Outcome Prediction Open Access: Full open access to this and thousands of other papers at http://www.la-press.com. Cancer Informatics Supplementary Issue: Computational Advances in Cancer Informatics (B) Comprehensive Evaluation

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

An Integrative -omics Approach to Identify Functional Sub-Networks in Human Colorectal Cancer

An Integrative -omics Approach to Identify Functional Sub-Networks in Human Colorectal Cancer An Integrative -omics Approach to Identify Functional Sub-Networks in Human Colorectal Cancer Rod K. Nibbe 1,2 *, Mehmet Koyutürk 1,3, Mark R. Chance 1,4 1 Center for Proteomics & Bioinformatics, Case

More information

Computational Approaches for Disease Gene Identification

Computational Approaches for Disease Gene Identification Computational Approaches for Disease Gene Identification YANG PENG School of Computer Engineering A Thesis Submitted to Nanyang Technological University In Fulfillment of the Requirement For the Degree

More information

Knowledge-Guided Analysis with KnowEnG Lab

Knowledge-Guided Analysis with KnowEnG Lab Han Sinha Song Weinshilboum Knowledge-Guided Analysis with KnowEnG Lab KnowEnG Center Powerpoint by Charles Blatti Knowledge-Guided Analysis KnowEnG Center 2017 1 Exercise In this exercise we will be doing

More information

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) PROGRAM TITLE DEGREE TITLE Master of Science Program in Bioinformatics and System Biology (International Program) Master of Science (Bioinformatics

More information

Course Announcements

Course Announcements Statistical Methods for Quantitative Trait Loci (QTL) Mapping II Lectures 5 Oct 2, 2 SE 527 omputational Biology, Fall 2 Instructor Su-In Lee T hristopher Miles Monday & Wednesday 2-2 Johnson Hall (JHN)

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary

More information

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment

Changing Mutation Operator of Genetic Algorithms for optimizing Multiple Sequence Alignment International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 11 (2013), pp. 1155-1160 International Research Publications House http://www. irphouse.com /ijict.htm Changing

More information

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data Mahsa Naseri and Simone A. Ludwig Abstract In service-oriented environments, services with different functionalities are combined

More information

Predicting ratings of peer-generated content with personalized metrics

Predicting ratings of peer-generated content with personalized metrics Predicting ratings of peer-generated content with personalized metrics Project report Tyler Casey tyler.casey09@gmail.com Marius Lazer mlazer@stanford.edu [Group #40] Ashish Mathew amathew9@stanford.edu

More information

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a Regime Analysis Regime Analysis is a discrete multi-assessment method suitable to assess projects as well as policies. The strength of the Regime Analysis is that it is able to cope with binary, ordinal,

More information

Data Mining and Applications in Genomics

Data Mining and Applications in Genomics Data Mining and Applications in Genomics Lecture Notes in Electrical Engineering Volume 25 For other titles published in this series, go to www.springer.com/series/7818 Sio-Iong Ao Data Mining and Applications

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

V 1 Introduction! Fri, Oct 24, 2014! Bioinformatics 3 Volkhard Helms!

V 1 Introduction! Fri, Oct 24, 2014! Bioinformatics 3 Volkhard Helms! V 1 Introduction! Fri, Oct 24, 2014! Bioinformatics 3 Volkhard Helms! How Does a Cell Work?! A cell is a crowded environment! => many different proteins,! metabolites, compartments,! On a microscopic level!

More information

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses Course Information Introduction to Algorithms in Computational Biology Lecture 1 Meetings: Lecture, by Dan Geiger: Mondays 16:30 18:30, Taub 4. Tutorial, by Ydo Wexler: Tuesdays 10:30 11:30, Taub 2. Grade:

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method

Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method ISSN(Online) : 2319-8753 ISSN (Print) : 237-6710 Estimation of Genetic Recombination Frequency with the Help of Logarithm Of Odds (LOD) Method Jugal Gogoi 1, Tazid Ali 2 Research Scholar, Department of

More information

Types of Databases - By Scope

Types of Databases - By Scope Biological Databases Bioinformatics Workshop 2009 Chi-Cheng Lin, Ph.D. Department of Computer Science Winona State University clin@winona.edu Biological Databases Data Domains - By Scope - By Level of

More information

Determining presence/absence threshold for your dataset

Determining presence/absence threshold for your dataset Determining presence/absence threshold for your dataset In PanCGHweb there are two ways to determine the presence/absence calling threshold. One is based on Receiver Operating Curves (ROC) generated for

More information

Introduction to Algorithms in Computational Biology Lecture 1

Introduction to Algorithms in Computational Biology Lecture 1 Introduction to Algorithms in Computational Biology Lecture 1 Background Readings: The first three chapters (pages 1-31) in Genetics in Medicine, Nussbaum et al., 2001. This class has been edited from

More information

Final Report: Local Structure and Evolution for Cascade Prediction

Final Report: Local Structure and Evolution for Cascade Prediction Final Report: Local Structure and Evolution for Cascade Prediction Jake Lussier (lussier1@stanford.edu), Jacob Bank (jbank@stanford.edu) December 10, 2011 Abstract Information cascades in large social

More information

Machine learning-based approaches for BioCreative III tasks

Machine learning-based approaches for BioCreative III tasks Machine learning-based approaches for BioCreative III tasks Shashank Agarwal 1, Feifan Liu 2, Zuofeng Li 2 and Hong Yu 1,2,3 1 Medical Informatics, College of Engineering and Applied Sciences, University

More information

Lab 1: A review of linear models

Lab 1: A review of linear models Lab 1: A review of linear models The purpose of this lab is to help you review basic statistical methods in linear models and understanding the implementation of these methods in R. In general, we need

More information

Final Report: Local Structure and Evolution for Cascade Prediction

Final Report: Local Structure and Evolution for Cascade Prediction Final Report: Local Structure and Evolution for Cascade Prediction Jake Lussier (lussier1@stanford.edu), Jacob Bank (jbank@stanford.edu) ABSTRACT Information cascades in large social networks are complex

More information

Negative Example Selection for Protein Function Prediction: The NoGO Database

Negative Example Selection for Protein Function Prediction: The NoGO Database Negative Example Selection for Protein Function Prediction: The NoGO Database Noah Youngs 1, Duncan Penfold-Brown 2, Richard Bonneau 1,3,4 *, Dennis Shasha 1,4 * 1 Department of Computer Science, New York

More information

Optimisation and Operations Research

Optimisation and Operations Research Optimisation and Operations Research Lecture 17: Genetic Algorithms and Evolutionary Computing Matthew Roughan http://www.maths.adelaide.edu.au/matthew.roughan/ Lecture_notes/OORII/

More information

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005 Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of

More information

Uncovering differentially expressed pathways with protein interaction and gene expression data

Uncovering differentially expressed pathways with protein interaction and gene expression data The Second International Symposium on Optimization and Systems Biology (OSB 08) Lijiang, China, October 31 November 3, 2008 Copyright 2008 ORSC & APORC, pp. 74 82 Uncovering differentially expressed pathways

More information

dmgwas: dense module searching for genome wide association studies in protein protein interaction network

dmgwas: dense module searching for genome wide association studies in protein protein interaction network dmgwas: dense module searching for genome wide association studies in protein protein interaction network Peilin Jia 1,2, Siyuan Zheng 1 and Zhongming Zhao 1,2,3 1 Department of Biomedical Informatics,

More information

Functional microrna targets in protein coding sequences. Merve Çakır

Functional microrna targets in protein coding sequences. Merve Çakır Functional microrna targets in protein coding sequences Martin Reczko, Manolis Maragkakis, Panagiotis Alexiou, Ivo Grosse, Artemis G. Hatzigeorgiou Merve Çakır 27.04.2012 microrna * micrornas are small

More information

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat

Overview. Methods for gene mapping and haplotype analysis. Haplotypes. Outline. acatactacataacatacaatagat. aaatactacctaacctacaagagat Overview Methods for gene mapping and haplotype analysis Prof. Hannu Toivonen hannu.toivonen@cs.helsinki.fi Discovery and utilization of patterns in the human genome Shared patterns family relationships,

More information

Random matrix analysis for gene co-expression experiments in cancer cells

Random matrix analysis for gene co-expression experiments in cancer cells Random matrix analysis for gene co-expression experiments in cancer cells OIST-iTHES-CTSR 2016 July 9 th, 2016 Ayumi KIKKAWA (MTPU, OIST) Introduction : What is co-expression of genes? There are 20~30k

More information

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Bioinformatics Tools Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine Overview This lecture will

More information

Genetic Algorithm: An Optimization Technique Concept

Genetic Algorithm: An Optimization Technique Concept Genetic Algorithm: An Optimization Technique Concept 1 Uma Anand, 2 Chain Singh 1 Student M.Tech (3 rd sem) Department of Computer Science Engineering Dronacharya College of Engineering, Gurgaon-123506,

More information

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad.

GENETIC ALGORITHMS. Narra Priyanka. K.Naga Sowjanya. Vasavi College of Engineering. Ibrahimbahg,Hyderabad. GENETIC ALGORITHMS Narra Priyanka K.Naga Sowjanya Vasavi College of Engineering. Ibrahimbahg,Hyderabad mynameissowji@yahoo.com priyankanarra@yahoo.com Abstract Genetic algorithms are a part of evolutionary

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Network-free Inference of Knockout Effects in Yeast

Network-free Inference of Knockout Effects in Yeast Tel-Aviv University The Raymond and Beverly Sackler Faculty of Exact Sciences School of Computer Science Network-free Inference of Knockout Effects in Yeast This thesis is submitted in partial fulfillment

More information

Steering Information Diffusion Dynamically against User Attention Limitation

Steering Information Diffusion Dynamically against User Attention Limitation Steering Information Diffusion Dynamically against User Attention Limitation Shuyang Lin, Qingbo Hu, Fengjiao Wang, Philip S.Yu Department of Computer Science University of Illinois at Chicago Chicago,

More information

Genetic Algorithms for Optimizations

Genetic Algorithms for Optimizations Genetic Algorithms for Optimizations 1. Introduction Genetic Algorithms (GAs) are developed to mimic some of the processes observed in natural evolution. GAs use the concept of Darwin's theory of evolution

More information

Functional genomics + Data mining

Functional genomics + Data mining Functional genomics + Data mining BIO337 Systems Biology / Bioinformatics Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ of Texas/BIO337/Spring 2014 Functional genomics + Data

More information

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech

From genome-wide association studies to disease relationships. Liqing Zhang Department of Computer Science Virginia Tech From genome-wide association studies to disease relationships Liqing Zhang Department of Computer Science Virginia Tech Types of variation in the human genome ( polymorphisms SNPs (single nucleotide Insertions

More information

Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi

Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi Introduction Bitcoin is a form of crypto currency introduced by Satoshi Nakamoto in 2009. Even though it only received interest from

More information

Introduction To Genetic Algorithms

Introduction To Genetic Algorithms Introduction To Genetic Algorithms Cse634 DATA MINING Professor Anita Wasilewska Computer Science Department Stony Brook University 1 Overview Introduction To Genetic Algorithms (GA) GA Operators and Parameters

More information

Topic 2: Population Models and Genetics

Topic 2: Population Models and Genetics SCIE1000 Project, Semester 1 2011 Topic 2: Population Models and Genetics 1 Required background: 1.1 Science To complete this project you will require some background information on three topics: geometric

More information

A New Database of Genetic and. Molecular Pathways. Minoru Kanehisa. sequencing projects have been. Mbp) and for several bacteria including

A New Database of Genetic and. Molecular Pathways. Minoru Kanehisa. sequencing projects have been. Mbp) and for several bacteria including Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways Minoru Kanehisa Institute for Chemical Research, Kyoto University From Genome Sequences to Functions The Human Genome Project

More information

The interactions that occur between two proteins are essential parts of biological systems.

The interactions that occur between two proteins are essential parts of biological systems. Gabor 1 Evaluation of Different Biological Data and Computational Classification Methods for Use in Protein Interaction Prediction in Signaling Pathways in Humans Yanjun Qi 1 and Judith Klein-Seetharaman

More information

Analysis of Biological Networks: Networks in Disease

Analysis of Biological Networks: Networks in Disease Analysis of Biological Networks: Networks in Disease Lecturer: Roded Sharan Scribe: Yaniv Bar and Naama Hazan Lecture 11. June 03, 2009 1 Disease and drug-related networks The idea of using networks to

More information

Robustness Analysis. Contents

Robustness Analysis. Contents Robustness Analysis Terrill L. Frantz 14 June 2006 CASOS Summer Institute Center for Computational Analysis of Social and Organizational Systems http://www.casos.cs.cmu.edu/ Contents Network Topology Basics

More information

296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006 296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006 An Evolutionary Clustering Algorithm for Gene Expression Microarray Data Analysis Patrick C. H. Ma, Keith C. C. Chan, Xin Yao,

More information

Part 1: Motivation, Basic Concepts, Algorithms

Part 1: Motivation, Basic Concepts, Algorithms Part 1: Motivation, Basic Concepts, Algorithms 1 Review of Biological Evolution Evolution is a long time scale process that changes a population of an organism by generating better offspring through reproduction.

More information

Chapter 2 Gene Prioritization Resources and the Evaluation Method

Chapter 2 Gene Prioritization Resources and the Evaluation Method Chapter 2 Gene Prioritization Resources and the Evaluation Method Abstract Different data sources have been successfully exploited to predict the disease relevance of candidate genes, so in this chapter,

More information

Classification in Parkinson s disease. ABDBM (c) Ron Shamir

Classification in Parkinson s disease. ABDBM (c) Ron Shamir Classification in Parkinson s disease 1 Parkinson s Disease The 2nd most common neurodegenerative disorder Impairs motor skills, speech, smell, cognition 1-3 sick per 1 >1% in individuals aged above 7

More information

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination

Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination UC Davis Computer Science Technical Report CSE-2003-29 1 Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination Dan Gusfield, Satish Eddhu, Charles Langley November 24,

More information