Predicting Gene Functions

Size: px
Start display at page:

Download "Predicting Gene Functions"

Transcription

1 Predicting Gene Functions Shubhra Sankar Ray Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India

2 Tasks in Bioinformatics: Alignment, comparison and analysis of DNA, RNA, and Protein Sequences Interpretation and analysis of microarray gene expression data Gene mapping on chromosomes Gene finding and promoter identification from DNA sequences Predicting gene regulatory network Construction of phylogenetic trees for studying evolutionary relationship Structure prediction and classification of DNA, RNA and protein Molecular and ligand design with molecular docking. Predicting Function of Unclassified Genes Need for Pattern Recognition and Structure Prediction Algorithms to Understand the Meaning of Data

3 Motivation Tasks involving in Gene function prediction Basics of Data Sources Related Problems Combining Multi-Source information Validation Gene Function Prediction Summary References

4 Motivation One of the important goals of biological investigation is to predict the function of unclassified genes. An approach to this direction involves identifying the nearest classified genes using different data sources, such as, microarray gene expressions, protein sequences, protein-protein interaction data and pathway information from Kyoto Encyclopedia of Genes and Genomes (KEGG). Even in a model organism like Yeast, there are more than 800 genes with unknown biological function in Munich Information for Protein Sequences (MIPS) and Saccharomyces Genome Database (SGD).

5 Single data set can assess functional relationships between genes and can assign accurate functional annotation to a significant number of unclassified genes but they alone often lack the degree of specificity needed for accurate gene function prediction. This improvement in specificity can be achieved through the combination of multiple data sets in an integrated analysis.

6 Tasks involving Multi-Source informationintegration for Gene Function Prediction Choosing informative data sources Extracting similarities/scores from individual sources Benchmarking data-sources in common framework Finding a method for integrating different similarities/scores Finding highly similar gene-pairs Predicting networks/clusters of highly similar genes Predicting function of unclassified genes

7 Data Sources Different types of Data Source that can be used for gene function prediction are: Microarray Gene Expression Protein similarity through transitive homology Protein-protein interaction information KEGG Pathway information According to MIPS the accuracies of all these data sources are above 50%. Phlogenetic profiles, Rosetta Stone Linkages, and Medline Abstract are avoided for low accuracies.

8 Measuring Gene expression with Microarray By performing biological experiments mrna from experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5. Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.

9 Microarray Gene Expression values gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1 YDR029W YBL052C YOR337W YMR183C YKR021W YHR023W YHR029C Dynamic Range of PMT -1.2 to to to to 2.0

10 Similarity Extraction through Gene Expression Gene similarity is extracted using Pearson Correlation and defined as Where, x i and y i are the expression values of gene X and Y at the i th time point, respectively. and are mean and standard deviation of expression profile of gene X. = = k i y i x i k Y X Y y X x P 1 1, σ σ σ x X

11 Protein Sequence Similarity extraction through Transitive Homologues 6221 Yeast protein sequences, corresponding to 6221 ORF/Genes, are downloaded from SGD and compared with 33,57,450 protein sequences, downloaded from Uniprot database, using BLAST Proteins related by direct homology are all classified Transitive homology can identify new relations through 33,57,450 sequences B X, BLAST score ( Y ) is used as similarity between two protein sequences X and Y.

12 How BLAST (Basic Local Alignment Search Tool) Works? Given a query sequence, look for high scoring words of length W in the database Compile list L of all words that score >T When some word found: Extend alignment When score drops below X stop extension Report all words with large score S W: Word size minimum number of aligned amino acids (3) T: Threshold focus on pairs scoring >T (11) X: Drop-off stop extending when loss >X (15) S: Score the final score of segment pair

13 There are twenty types of amino acids; each pair of amino acids have a similarity score, which varies for different amino acids Aligning protein sequences: (gap = -5) FDSK THRGHR Blosum Matrix FDSYWTH GHR Score: = 36 FDS=16, GHR=19 No loss in extension (+1)

14 Example of Transitive Homology: b a B a,b =0.8 B b,c =0.2 B a,c =0.9 C The homology between sequence b and sequence c can be detected with third sequence a, and now BT b,c = B a,b x B a,c =0.72

15 KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for: 1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases and also on the structure relationships (KEGG drug structure maps) in: 6. Drug Development

16 Protein Similarity through KEGG Pathway information 210 biological pathways are defined in KEGG, e.g. Metabolism All protein sequences corresponding to each pathway are downloaded (except yeast proteins) 210 pathway databases are created A profile for each Yeast protein is computed by searching homologues across 210 pathway databases Profile dimension is 210 an element of a profile, corresponding to a particular pathway database, is 1 if homologue is present in that database, otherwise 0

17 KEGG pathway Cont. Profile similarity between two proteins X and Y is calculated by taking the dot product. Example: Profile X Profile Y Metabolism Cell Cycle Energy Transcription Dot product between profiles of protein1 and protein2 is 2. Dot products between KEGG profiles are denoted as K X,Y.

18 Protein-protein Interaction Information Information is downloaded from Database of Interacting Proteins (DIP) containing Yeast protein-protein interactions For a given pair of genes/proteins the similarity value is 1 or 0, indicating a interaction present or absent, respectively Similarity value is denoted as I

19 Problems in Multi-Data Source Integration 1) The degree of biological accuracy is different for different data sources. To obtain equivalency in biological accuracy, the similarities arising from various data-sources are separately benchmarked, based on the super GO-Slim process annotations of genes in the SGD database. The positive predictive value (PPV) of gene-pairs at a particular similarity value can be used as a benchmarking method.

20 TP pairs are defined as pairs of genes having overlapping GO (Gene Ontology) term classification/annotation. Positive Predictive Value of gene-pairs is defined as PPV = no. of predicted pairs that share common GO term total no. of predicted pairs A gene pair is considered as a predicted pair if similarity value is non-zero and both the genes in the pair is classified in SGD Higher the ppv greater the functional similarity between genepairs, predicted by a similarity measure or method. ppv can be used as the fitness function.

21 proportiontp------> Transitive homology KEGG Pathway profile Microarray S im ilarity V alue-----> Benchmarking the similarity values obtained from different data-sources in terms of their PPV. The PPV values for intermediate similarity values, that are not plotted in the figure, are calculated from the slopes of the respective curves. The similarities extracted from protein-protein interactions are binary relations in our study. Therefore, PPV for proteinprotein interactions has a constant value 0.69 at a similarity value of 1 and hence it is not shown in the Figure.

22 Biological Score using Linear Combination BS two genes X and Y is computed by integrating PPV values PP X,Y, PB X,Y, PK X,Y, and PI X,Y The relative contribution/weight of each information source is determined adaptively by maximizing the PPV, dependent on SGD classification of Yeast genes BS is defined as BS X,Y a PPX, Y = a+ b+ c+ d b PBX, Y + a+ b+ c+ d c PKX, Y + a+ b+ c+ d d PIX, Y + a+ b+ c+ d Where, a, b, c, and d are varied in steps of 1 to find a combination that maximizes the PPV for classified genes of top gene pairs.

23

24 Non-Linear Score Non-Linear Score is defined as 1 NLS Y = ( PP a X, Y + PB b c d X, Y + PK X, Y + PI X, n X, Y Where, a, b, c, and d are varied in steps of -1 to find a combination that maximizes the PPV for classified genes. n is the total number of datasources. The degrees of contribution of each information source is determined by maximizing the PPV, dependent on SGD classification of Yeast genes )

25 Evaluation of BS and NLS As GO (Gene Ontology) classification is used to determine the degrees of contribution, MIPS annotation can be used to evaluate the BS and NLS Performance can be compared with individual data-sources and related methods by plotting Total no. of predicted gene-pairs vs. positive predictive value (shown in next slide) by increasing the threshold for individual similarity value. No. of predicted gene-pairs decreases as threshold increases for any similarity measure PPV of gene-pairs increases as threshold increases for any similarity measure.

26 1 0.9 Transitive homology Lee et al. Prob. Network with same source Lee et al. Prob. Network Microarray Phenotypic Profile Non-Linear Score Linear Combination Score KEGG Pathway profiles 0.8 PPV------> Number of top relations-----> 4

27 Influence of number of classified genes on Positive Predictive Value

28 Top Predictions Gene functions are predicted from the first 10 neighbors, using NLS Top function predictions consist the functions of 12 unclassified genes and 417 classified genes with 98.2% PPV. The prediction is performed with the MIPS 2008 classification and validated with 2011 classification.

29 Top 12 function predictions for unclassified gene Out of these 12 unclassified genes, YIL080W, YHR218W, YHR219W, and YHR049W are now classified in MIPS, and our predictions are in agreement with MIPS.

30 Summary Frameworks for multiple data-source integration, that combines pairwise similarity from different sources, are presented. Functional categories of 12 unclassified Yeast genes are predicted. Evaluation on 12 unclassified genes, by Saccharomyces Genome Database (SGD), confirmed the validity and potential value of the framework for gene function prediction

31

32

33 Selected References 1. S. S. Ray, S. Bandyopadhyay and S. K. Pal, A Weighted Power Framework for Integrating Multi-Source Information: Gene Function Prediction in Yeast, IEEE Transactions on Biomedical Engineering, vol. preprint, no. 00, pp. 1-7, S. S. Ray, S. Bandyopadhyay and S. K. Pal, Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast, IEEE Transactions on Biomedical Engineering, vol. 56, no. 2, pp , I. Lee, S. V. Date, A. T. Adai, and E. M. Marcotte, A probabilistic functional network of yeast genes, Science, vol. 306, pp , C. V. Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork, Comparative assessment of large-scale data sets of protein-protein interactions, Nature, vol. 417, pp , E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, A combined algorithm for genome-wide prediction of protein function, Nature, vol. 402, pp , E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg, Detecting protein function and protein-protein interactions from genome sequences, Science, vol. 285, pp , M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proc. National Academy of Sciences, vol. 95, pp , 1998.

34 8. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, From genomics to chemical genomics: new developments in KEGG, Nucleic Acids Res., vol. 34, pp. D354 D357, L Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit J. U. Bowie, and D. Eisenberg, The database of interacting proteins, Neuclic Acid Research, vol. 32, pp , O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae), Proc. Natl. Acad. Sci. USA, vol. 100, no. 14, pp , Q. Ma, G. W. Chirn, R. Cai, J. D. Szustakowski, and N. Nirmala, Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks, BMC Bioinformatics, vol. 6, no. 242, S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, Gapped blast and psi-blast: a new generation of protein database search programs, Nucleic Acids Research, vol. 25, no. 17, pp , P. Pipenbacher, A. Schliep, S. Schneckener, A. Schonhuth, D. Schomburg, and R. Schrader, Proclust: improved clustering of protein sequences with an extended graphbased approach, Bioinformatics, vol. 18, no. 2, pp. S182S191, 2002.

35 Thank You

36 Handling Gene Expressions Shubhra Sankar Ray Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India

37 Gene expression Process by which a gene's coded information is converted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mrna and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs). Not all genes are expressed and gene expression involves the study of the expression level of genes in the cells under different conditions.

38 Measuring Gene expression with Microarray Enables measuring at the same time expression levels of thousands of genes. Is typically a glass or slide, on which DNA molecules are attached at fixed locations and colored with the green-fluorescent dye Cy3. There may be tens of thousands of spots on an array, each containing about identical DNA molecules. For gene expression studies, each of these molecules ideally should identify one gene or one exon in the genome The spots are either printed on the microarrays by a robot, or synthesized by photolithography or by ink-jet printing. By performing biological experiments mrna from experimental samples are colored during reverse transcription with the red-fluorescent dye Cy5. Cy5/Cy3 fluorescence ratio (gene expression) are obtained from microarray by measuring the spot intensities with fluorescence scanner Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting microarray data.

39 A gene expression database can be regarded as consisting of three parts the gene expression data matrix, gene annotation and sample annotation. Figure : Gene expression array

40 Microarray Gene Expression values gene Cell Cycle 1 Cell Cycle 2 Sporulation 1 Sporulation 2 Shock 1 Shock 2 Diauxic Shift 1 YDR029W YBL052C YOR337W YMR183C YKR021W YHR023W YHR029C Dynamic Range of PMT -1.2 to to to to 2.0

41 Average linkage hierarchical clustering is one of the first clustering algorithms applied to microarray data. Using a distance metric, the method builds a hierarchical binary tree (called a dendrogram). Given a set of N data points to be clustered, and an N N distance (or similarity) matrix, the basic steps of hierarchical clustering are : S1) Start by assigning each item to a cluster, so that if there are N items there are N clusters, each containing just one item. So, the distances (similarities) between the clusters are the same as the distances (similarities) between the items they contain. S2) Find the closest (most similar) pair of clusters and merge them into a single cluster, so that there is one less cluster. S3) Compute distances (similarities) between the new cluster and each of the old clusters. S4) Repeat S2 and S3 until all items are clustered into a single cluster of size N.

42 Hierarchical Clustering 1. Single Linkage 2. Average Linkage 3. Complete Linkage

43 In single-linkage clustering (also called the connectedness or minimum method), the shortest distance from any member of one cluster to any member of the other cluster is considered as the distance between one cluster and another cluster. In complete-linkage (also called the diameter or maximum method), the distance between one cluster and another cluster is considered to be equal to the largest distance from any member of one cluster to any member of the other cluster In average-linkage clustering average distance from any member of one cluster to any member of the other cluster is considered.

44 Gene Ordering in Clustering Solutions

45 Subclusters found by Gene Ordering 3 Cell Cycle and DNA Processing 4 Transcription 6 Protein Fate (folding, modification, destination) 7 Protein with Binding Function or Cofactor Requirement 9 Cellular Transport, Transport Facilitation and Transport Routes Gene ordering helps to identify the number and size of subclusters Relations among genes within a cluster can be identified with gene ordering

46 Figure 2: Comparing Average Linkage (Fig. a), Average Linkage + Leaf Ordering (Fig. b), K-means (Fig. c), and K-means +Gene ordering (Fig. d) for Herpes data (106 x 21)