Chapter 2 Gene Prioritization Resources and the Evaluation Method

Size: px
Start display at page:

Download "Chapter 2 Gene Prioritization Resources and the Evaluation Method"

Transcription

1 Chapter 2 Gene Prioritization Resources and the Evaluation Method Abstract Different data sources have been successfully exploited to predict the disease relevance of candidate genes, so in this chapter, different types of data sources are introduced. Also, to evaluate and compare different gene prioritization algorithms, the leave-one-out cross-validation method and the performance evaluation metrics for comparison between methods are introduced. 2.1 Type of Data Sources Different data sources have been successfully exploited to predict the disease relevance of candidate genes (Fig. 2.1). For a correct interpretation of the obtained prediction results, it is useful to consider what type of evidence has been used to derive them and to know about possible intrinsic problems, such as a potential bias toward well characterized candidate genes. Also, the amount and quality of the utilized data often have a major impact on the reliability of the results (Piro and Cunto 2012). The types of the data sources in this section are inspired by (Piro and Cunto 2012). Some of the important data sources that are used in gene prioritization tools are mentioned in Table Text Mining of Biomedical Literature The millions of biomedical abstracts provided by PubMed (Acland et al. 2013), or the thousands of phenotype and disease gene descriptions contained in OMIM and comparable databases, represent an enormous amount of knowledge that can be mined using dedicated natural language processing techniques. Indeed, text mining of such biomedical literature was among the first approaches to disease gene prediction. The problem of such full-text resources is the lack of a consistent representation or organization of key concepts because the same concept can often be described or denominated in many different ways. This becomes clear when considering, for instance, the several aliases that can be used for most genes. An automated processing of full-text records, therefore, often relies on controlled vocabularies such as MeSH (Acland et al. 2013), UMLS (Mclnnes et al. 2013) or evoc (Kelso et al. 2003) to map the encountered full-text expressions to well defined and hierarchically organized biomedical terms (Piro and Cunto 2012). A. Masoudi-Nejad, A. Meshkin, Gene Prioritization, SpringerBriefs in Systems Biology, 9 DOI / _2, The Authors 2014

2 10 2 Gene Prioritization Resources and the Evaluation Method Fig. 2.1 Differnet data sources that are used in gene prioritization methods (Tranchevent et al. 2008) Functional Annotations, Pathways, and Ontologies Functional annotations in a broad sense not only including biological processes and molecular functions, but also metabolic or signalling pathways are another rich source of evidence that is frequently used for disease gene prediction. Like the biomedical literature discussed above, such information represents a logical way to initiate the search for good candidate genes, but it is, of course, inherently biased toward better characterized genes. Furthermore, only a small percentage of the functional annotations, contained in many databases, have actually been experimentally verified (Perez-Iratxeta et al. 2007). Nonetheless, even predicted functional annotations can be valuable if they can be confirmed at least in model organisms. Among the most widely used databases of functional annotations and pathways are the gene ontology (GO; Ashburner et al. 2000) annotations and KEGG (Kyoto Encyclopaedia of Genes and Genomes; Kanehisa et al. 2008), respectively. The GO is a controlled vocabulary, a set of standard terms words and phrases used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary (Fig. 2.2). Apart from those already mentioned (GO, MeSH, UMLS, evoc), other ontologies are of interest for disease gene prediction and can be used to annotate genes and proteins. The Mammalian Phenotype Ontology (MPO; Smith et al. 2004) and

3 2.1 Type of Data Sources 11 Table 2.1 Overview of the important data sources with their corresponding websites (Yu et al. 2008) Name Gene-centered Entrez Gene Ensembl Human Swiss-Prot AceView HuGE Navigator OMIM GeneCards Genetics Home Reference SOURCE PubMed Literature HuGE Navigator Genetic Association Database Pharmacogenetics PharmGKB Variation/Prevalence dbsnp dbsnp-genotype dbsnp-geneview ALFRED SNPper Human Gene Mutation Database International HapMap Project The Cancer Genome Anatomy Project Pathway KEGG BioCarta Pathway Interaction Database Microarray NCBI Gene Expression Omnibus Miscellaneous NCBI Bookshelf NCBI Gene Ontology Database GeneTests URL index.html?human startpagepedia.do startpagepublit.do db=geo&term the more recent Human Phenotype Ontology (HPO; Robinson et al. 2008) are good examples. Notably, in addition to a controlled vocabulary of over 8000 terms representing individual phenotypic anomalies, the HPO provides an annotation of all clinical OMIM entries with these terms, thus helping to standardize this important source of disease descriptions. Also, the Disease Ontology (DO; Osborne et al. 2009), based on UMLS, is used by some prediction tools (Piro and Cunto 2012).

4 12 2 Gene Prioritization Resources and the Evaluation Method Fig. 2.2 Hierarchical structure of the Gene Ontology Phenotype Relationships Databases or networks that describe relationships between phenotypes (Barabási et al. 2011), can be used. For example, to define a set of reference genes if the disease of interest is of unknown molecular basis, i.e., does not have any associated disease gene, taking reference genes from similar or related disorders (that are likely to arise from similar mechanisms, Brunner and van Driel 2004) may be appropriate. Phenotype relationships themselves are often derived from other types of data sources discussed here. MimMiner (van Driel et al. 2006), for example, relies upon text mining of OMIM phenotype entries and uses MeSH as a controlled vocabulary (Piro and Cunto 2012) Intrinsic Gene Properties Intrinsic gene (or protein) properties such as gene or protein length, phylogenetic breadth, degree of conservation, and paralogy may also provide a clue about a possible relevance for hereditary disorders because these properties differ statistically between disease genes and genes not known to be involved in disease (López-Bigas and Ouzounis 2004). This is exploited by several prediction tools. However, as Tiffin et al. argue, the evaluation of the predictive power of such intrinsic properties relies on the definition of genes as disease genes and nondisease genes. This may be meaningful for monogenic (Mendelian) disorders but is less justified for more complex diseases where genes, rather than producing an obvious phenotype, contribute to disease susceptibility or act as modifiers, i.e., they affect the severity of disease causing mutations in other genes (Tiffin et al. 2009).

5 2.1 Type of Data Sources 13 Among intrinsic gene properties, the presence of protein domains is of particular significance because these may additionally hint at molecular functions in which a gene could be involved. If, for example, genes known to be involved in a disease or disease class (e.g., metabolic disorders) are significantly enriched for a particular protein domain, then the presence or absence of that domain in candidate genes may be a meaningful criterion for their evaluation. (In this example, no use of functional annotations is made. It is also possible, however, to additionally rely on knowledge about the molecular functions of protein domains. In this case both intrinsic gene properties and functional annotations must, of course, be considered as types of data sources on which predictions are based) (Piro and Cunto 2012) Sequence Data A rarely utilized type of data source whose importance will certainly increase in the future is, data obtained through next generation sequencing techniques with the aim of directly identifying mutations in the genomes of patients and evaluating their potential disease relevance. Although, coding sequences and their associated regulatory elements may in principle be considered as intrinsic properties of genes (described above), it is important to underscore the conceptual difference between general properties that genes or proteins show over the entire population (like their length, degree of conservation, etc.) and case- or patient-specific properties like structural variants and amino acid substitutions (Piro and Cunto 2012) Protein Protein Interactions The protein interactome, i.e., the network that represents physical interactions between proteins, is one of the most frequently used types of data sources for disease gene prediction (Navlakha and Kingsford 2010) because it is intuitively clear that proteins that physically interact with each other will often do so to exert a common function. Therefore, a deleterious alteration of any one of them is likely to lead to the development of similar phenotypes. In fact, this assumption is confirmed by the widespread association of protein complexes with human disease (Brunner and van Driel 2004). A major concern, however, is the amount and quality of the available experimental data. Most protein protein networks consist of very few curated and well-studied interactions, and many interactions are derived from experimental techniques like mass spectrometry and the yeast two-hybrid method, which still suffer from sensitivity and specificity problems. Often these experimentally inferred human interactions are complemented by interactions from model species and by protein domain based predictions. Generally, protein protein interaction based methods suffer from the incompleteness and low quality of the data currently available for interaction networks

6 14 2 Gene Prioritization Resources and the Evaluation Method in mammals (Kann 2010). This introduces some bias toward better characterized genes and proteins, although this bias is probably far less pronounced than the inherent bias of text mining of biomedical literature and many functional annotations. Frequently used, publicly available collections of protein protein interactions include human protein reference database (HPRD; Goel et al. 2011) and search tool for the retrieval of interacting genes/proteins (STRING; Jensen et al. 2009), the latter of which also integrates and weighs known and predicted functional interactions (Piro and Cunto 2012) Gene Expression Information Gene expression is an important aspect of gene function. Indeed, cellular functions are the result not only of the molecular functions of the single components of a cell, but to a large extent also of their coordinated expression both in space and time. In other words, even though the molecular function of a gene product is largely determined by its enzymatic function, its DNA binding capabilities, or more in general its interactions with other cellular molecules, gene expression is one of the major determinants of when and where this function is exerted (Piro and Cunto 2012). Accordingly, gene expression patterns can give valuable hints about functional relationships and interactions both between single genes and between gene groups (Eisen et al. 1998; Quackenbush 2001). Gene expression information is one of the least biased types of data sources that is provided by high-throughput experiments with techniques such as serial analysis of gene expression (SAGE; Velculescu et al. 1995), cdna, and oligonucleotide microarrays (Quackenbush 2001; Brown and Botstein 1999) and next-generation sequencing applied to mrna instead of DNA (RNA-Seq; Wang et al. 2009) Regulatory Information Gene regulatory networks (GRNs; Arda and Walhout 2010) are a common form of representation of direct regulatory interactions between genes and can be used for disease gene prediction tools. For instance, a transcription factor that regulates several known disease genes can itself be considered a good candidate for being involved in that disease. GRNs can themselves be inferred from other types of data sources, like gene expression or regulatory sequence information, because the number of experimentally confirmed regulatory interactions is still comparably low. Regulatory information, although potentially of great interest for disease gene prediction, suffers from the same (if not worse) incompleteness and low quality of available data as information on protein protein interactions. Some tools, instead of using regulatory information in the form of GRNs, try to infer disease relevance directly from regulatory sequence information such as

7 2.2 Why Data Integration? 15 the presence or absence of transcription factor or microrna binding sites (see for example Gefen et al. 2010). In any case, the exact origin and reliability of regulatory information should be taken into account for correct interpretation of the results obtained from prediction tools (Piro and Cunto 2012) Orthology and Conservation The knowledge gained from model organisms has always played a fundamental role in molecular biology. It is therefore straightforward to try to integrate some of this knowledge in disease gene prediction methodologies. Basically, all the data sources mentioned above can be combined with the notions of orthology and conservation. This may be important when human data are limited or not available at all (Yu et al. 2008). In this case, it can often be justified to use data from closely related species instead. Additionally, instead of simply replacing human data, knowledge from other species can be combined in various ways with available human data. On one hand, the data sources from different species can be directly integrated into a single more comprehensive data source. On the other hand, data from other species can also be used to filter human data so as to reduce noise and/or shift the focus to essential aspects that have been preserved in the course of evolution(piro and Cunto 2012). 2.2 Why Data Integration? The application of single type of data source to disease gene prediction is rare. Since the different data sources can provide quite complementary disease relevant information, in many cases they are practically, and often even conceptually, combined. Protein protein interactions, for example, can indicate functional relationships even when the transcriptional correlation between genes is not very strong. Likewise, a strong transcriptional coexpression can hint at a functional relationship even when gene products do not physically interact with each other. Data sources are the core of the gene prioritization problem since both high coverage and high quality data sources are needed to make accurate predictions. Most of the tools make use of a wide range of data sources. A fundamental issue in studies using a single data source is the potential bias of their results caused by the incompleteness and noise of one particular data set. The gene prioritization research environment is similar to an old story about a group of blind people touching an elephant, as shown in Fig Each one of them touches a different body part and makes a conclusion of what an elephant looks like. They all have partly correct conclusions but failed to see the whole picture. Intuitively, multiple data sources tend to provide better signal-to-noise ratio and thus may improve prediction accuracy.

8 16 2 Gene Prioritization Resources and the Evaluation Method Fig. 2.3 The old story about a group of blind people touching an elephant (Tranchevent et al. 2008) 2.3 Data Integration There is a plethora of data sources that contain large amounts of relevant gene and protein data such as sequences, molecular functions, roles in pathways and biological processes, expression profiles, regulatory mechanisms, interactions with other biomolecules, and biomedical literature. Such biological data sources are at the core of gene prioritization methods, because prioritization algorithms shift through these data to create a computational model of promising candidates. The integration of high-quality biological data sources is necessary, but not sufficient, to obtain accurate predictions. A typical workflow of integrating multiple data sources to the prioritization of candidate genes is shown in Fig Genome and phenome knowledge sources are considered to create different relationships among diseases/genes (Fig. 2.4a). Similarities between diseases are calculated and a phenome network is constructed as a weighted graph (Fig. 2.4b). Similarities between genes can be calculated in two ways: (i) The relationships of gene pairs in all databases are combined as one final relationship and then a combined functional network is constructed. (ii) The relationship of a gene pair in each database is calculated individually and multiple genotype networks are constructed. The genes collected from linkage analysis or differentially expressed genes from microarray experiments are used as the test gene set (Fig. 2.4c). Candidate genes are ranked by using the calculated values output by computational tools (Fig. 2.4d; Chen et al. 2012).

9 2.3 Data Integration 17 Fig. 2.4 A typical workflow of integrating multiple data sources to the gene prioritization (Chen et al. 2012)

10 18 2 Gene Prioritization Resources and the Evaluation Method Acquiring and merging numerous sources of heterogeneous data presents severe technical challenges. First, multiple identifiers are available for genes, transcripts and proteins (such as Ensembl gene identifiers, Affymetrix probe identifiers or SwissProt identifiers), and there is not necessarily a one-to-one relationship between them. Thus, data from different sources needs to be appropriately mapped and merged. Moreover, information about diseases, phenotypes, and biological processes is far from being fully standardized. 2.4 Utilized Data Sources in Gene Prioritization Tools There are several data sources used by the tools, including text mining (co-occurrence and functional mining), protein protein interaction (PPI), functional annotations, pathways, expression, sequence, phenotype, conservation, regulation, disease probabilities, and chemical components. The four data sources most commonly used are, text mining (functional and interactions mining), protein protein interactions, functional annotations and pathways (Table 2.2). 2.5 Validation Method The leave-one-out cross-validation is used to evaluate different gene prioritization algorithms, so this method is discussed briefly Leave-One-Out Cross-Validation As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as a K-fold crossvalidation with K being equal to the number of observations in the original sampling (Fig. 2.5). For example, as shown in Fig. 2.6, in each leave-one-out cross-validation fold for a given disease, a different gene is retained from the set of known disease genes (red, blue, orange). The remaining genes known to be associated with that particular disease are mapped onto the network and used as prior knowledge (training set) to compute gene disease scores for all the genes in the network. A test set, including the left out genes, and a set of candidates previously sampled from a pool of genes (the genes in a network or the intersection of the sets of genes in different networks), is sorted according to the obtained gene disease scores. The performance is then determined by assessing the position of the left out gene in the ranked test set. Then we average the overall and per disease results obtained in complete leave-one-out crossvalidation runs, each using a distinct set of candidate genes (Gonçalves et al. 2012).

11 2.5 Validation Method 19 Table 2.2 Repartition of the gene prioritization tools according to the data sources (Tranchevent et al. 2013) Tools Functional Expression Regulatory Text Text Interactions Pathways Sequence Phenotype Conservation/ Disease Chemical annotations information (co-citation) (functional) homology probabilities Components ageneapart X BioGraph X X X X X Biomine X X X X X X X X Bitola X Caesar X X X X X X X Candid X X X X X X CGI X X DGP X X DIR X X X DomainRBF X ENDEAVOUR X X X X X X X X eresponsenet X X X X G2D X X X X GeneDistiller X X X X X X GeneFriends X GeneProspector X GeneRank X X GeneRanker X X X X X GeneSeeker X X X X GeneWanderer X X X X X Génie X X GenTrepid X X X X GLAD4U X GPSy X X X X X X X X X X GUILD X MedSim X X X X MetaRanker X X X MimMiner X X X X X

12 20 2 Gene Prioritization Resources and the Evaluation Method Table 2.2 (continued) Tools Functional Expression Regulatory Text Text Interactions Pathways Sequence Phenotype Conservation/ Disease Chemical annotations information (co-citation) (functional) homology probabilities Components Pandas X X X PGMapper X PhenoPred X X X X Pinta X X X X X Pocus X X PolySearch X X X X X PosMed X X X X X X X PRINCE X X Prioritizer X X X X ProDiGe X X X X X X ProphNet X X X S2G X X X X X X X SNPs3D X X X X X X Suspects X X X TargetMine X X X X X Tom X X ToppGene X X X X X X X VAVIEN X

13 2.5 Validation Method 21 Fig. 2.5 The procedure of leave-one-out cross-validation method (Tranchevent et al. 2008) Fig. 2.6 Evaluation scheme for leave-one-out cross-validation (Gonçalves et al. 2012)

14 22 2 Gene Prioritization Resources and the Evaluation Method Fig. 2.7 A sample ROC curve 2.6 Performance Measures The following measures in combination with the leave-one-out cross-validation method are used to compare the performance of the different gene prioritization algorithms ROC The receiver operating characteristic (ROC) can be applied to gene prioritization. Instead of true positive rate (TPR) and false positive rate (FPR), we plot the proportion of true causative genes below a threshold rank (TPR) versus the proportion of noncausative genes below the threshold (FPR). To compare different ROC curves, the area under the curve (AUC) is often used (Fig. 2.7). The higher the value, the better the predictor. A perfect predictor will have an AUC of 1, while a random predictor will get an average value of 0.5.

15 2.7 Summary Enrichment Another way to measure performance is fold enrichment. If a method ranks known disease genes in the top m % of all candidate genes in n % of the test cases, it is said to have n/m-fold enrichment on average. For instance, if a method ranks 50 % of the known disease genes in the top 1 %, it is said to have 50-fold enrichment. 2.7 Summary In this chapter, different types of data sources are described. To evaluate and compare different gene prioritization algorithms, the performance evaluation metrics and leave-one-out cross-validation method are introduced.

16