KEGG Kyoto Encyclopedia of Genes and Genomes

Size: px
Start display at page:

Download "KEGG Kyoto Encyclopedia of Genes and Genomes"

Transcription

1 KEGG Kyoto Encyclopedia of Genes and Genomes Objectives of KEGG Computerize current knowledge of biological systems in terms of pathway of interacting molecules or genes Maintain gene catalogs for sequenced genomes and provide consistent and standardized annotations Maintain catalog of chemical reactions in living cells by LIGAND database and they are linked to pathways Provide new informatics technologies toward predicting biological systems and designing further experiments

2 KEGG Main Components (release 8.0) Databases PATHWAY GENES number number of entries of entries LIGAND/COMPOUND 0 LIGAND/ENZYME Entries , Contents GIF image maps Nomenclature,classific ation,codon frequency,sequence Nomenclature,chemica l formula, structure,cas number EC Number, nomenclature,reaction formula, substrate, product, inhibitor, effector, cofactor

3 Bricks of KEGG Binary Relations Represent pairwise interaction Decompose higher level interaction into pairwise Relational Databases LIGAND PATHWAY GENES

4 LIGAND DATABASE Represent binary relation between substrate and product catalyzed by an enzyme Three sections:enzyme,compound,reaction Enzyme Substrate Product LIGAND Prephenate Prephenate Pretyosine Phenylpyruvate Chorismate Prephenate : : : Flat-file database, data format similar to Genbank. Reaction not organized as flat-file Examples: : ATP + Glycerol <=> ADP + sn-glycerol 3-phosphate : Glycerol => sn-glycerol 3-phosphate

5 What is its data like?

6 Path computation using LIGAND

7 diagram show: After a while

8 Relations PATHWAY DATABASE KEGG PATHWAY Map Phe,try and trp biosynthesis Phe,try and trp biosynthesis : Enzyme : Map Phe,try and trp biosynthesis Phe,try and trp biosynthesis : Enzyme : Enzyme : Collection of graphical pathway diagrams Highly biased toward metabolism Metabolic pathway is well conserved while regulatory pathways are divergent Absence of proper identifiers for functions in the regulatory pathways

9 PATHWAY DATABASE A Pathway Example

10 最初酵素 (Enzyme) 命名, 多數以 -in -ase 或 -zyme 等字尾表示, 例如 trypsin pepsin amylase DNase 及 lysozyme 等 後來以該酵素所催化的反應加上 -ase 的字尾為名, 並在前面加上此反應的受質名稱, 以區分不同的酵素

11 酵素命名與分類

12 How is this done? Construct ~90 reference metabolic pathways manually Automatically generate organism-specific pathways by partial join operation and path computation Matching : organism:gene EC:number map:accession A wild card is permitted in the lowest level of EC numbering scheme Query relaxation : if missing in connectivity, go up the hierarchy of the superfamily(from PIR) and then go down examining all EC numbers in that superfamily Path Computation path (X,Y, [E EL]) reaction (E, X, Z), path (Z, Y, EL)

13 GENES DATABASE Collection of genes for all organisms in KEGG Entry contains: organism name,gene name, functional description, functional hierarchy, chromosomal position, codon usage, aa sequence, nt sequence Constructed by fetching information of all genes from GenBank, then assigning EC number by GFIT * with manual verification efforts *Reconstruction of amino acid biosynthesis pathways from the complete Genome sequence, Genome Res.,8, ,1998

14 Independent gene database for each organism

15 Genome map and gene browser

16 Application of KEGG ( I* ) Objective use KEGG to detect functionally related enzyme clusters FREC : a set of enzymes that catalyze successive reactions in the metabolic pathway and that are encoded in close locations on the chromosome Perspective graph comparison Pathway : sure, a connected graph Genome : a graph? I think it as a line. Well General : G(V,E), V is a series of vertices (nodes) and E is a set of edges * A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters, Nucleic Acids Research,2000,vol 28,

17 Graph Comparison Algorithm Consider two graphs,g 1 (V 1,E 1 ) and G 2 (V 2,E 2 ) and a correspondence matrix, e.g. V21 V22 V23.. V11 1 V12 1 V12 1 : Start with clusters by each row Single linkage clustering algorithm Distance between two initial clusters i&j :d 1 (i,j), length of shortest path between v 1i and v 1j in G 1, same with d 2 (i,j) Then start merging clusters recursively, indication function by :

18 Graph Comparison Algorithm Merge when 1, do nothing when 0 ; Gap1,Gap2 are non-negative integers and chosen empirically. In this paper, 1 and 3 for the genome and pathway respectively d 1 and d 2 are pre-computed for two graphs from the sets of binary relations Floyd-Warshall algorithm

19 Floyd-Warshall algorithm Motivation: solve all-pairshortest-path problem Main recursion: dynamic programming!

20 Results:FRECs and operons An example of FREC in E.coli Total number of FRECs detected in E.coli is 100. Compared with experiment data, 89% FRECs share at least 2 genes; among 118 operons, 75.4% partially detected;complete match 39 out of 100.» FRECs - functionally related enzyme clusters

21 FRECs and operons (continued) FREC formation in 10 microorganism

22 Ortholog: A gene in two or more species that has evolved from a common ancestor. Ortholog clusters of enzyme genes Superimpose multiple genome-pathway alignment and obtain a multiple alignment Example of ortholog group table

23 Application (II)* We ve seen how graph comparison enables us to detect correlated clusters between two graphs What if both graphs are genome graph? Can we make the correspondence matrix? Yes, this is similarity matrix. The genes are ordered on each axis. If similarity score of two genes is over 100, matrix element 1; otherwise 0. Can we make sense of the clusters detected? Yes, they are conserved gene cluster with positional coupling, defined as a group of homologous genes located at contiguous positions in the different genomes. * Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping, Nucleic Acids Research,2000,Vol 28

24 Method First: Extraction of gene clusters from pairwise graph comparison Second: Identification of related clusters in multiple genomes by P-quasi analysis

25 P-quasi grouping A set of gene cluster pairs is obtained. This set contains overlapping clusters. Clustering algorithm Single linkage : tend to form large clusters Complete linkage : only concerns uniform gene clusters P-quasi grouping: any member in one group has linkage to >=P% of all the members within the group P=100 complete linkage P= 0 single linkage Tunable parameter P

26 Final Step Identification of orthologous,paralogous and fused genes by P-quasi and COG grouping

27 Results Conserved gene clusters Size and quality depend on P Optimal P value varies for different gene clusters (reason?) Used only as rough approximation for ortholog group tables Fused genes

28 Application (III)* EXPRESSION Database Flat-file database, under construction Graphical view by taking ratio of two channels(cy5- and Cy3-labeled spots)

29 Pathway reconstruction A general strategy : get gene clusters from an array experiment and search possible functional connections of genes in these clusters Assumption: co-regulated genes have functional correlation in pathways ( what is the meaning of co-regulated? Positive or negative feedback(?)) Their approach: based on distance between two genes, which is defined by computing the correlation of the expression patterns of two genes Applied to analysis of Saccharomyces cerevisiae

30 Result *mapping clustered genes coding for proteins in glycolysis (mapping EC number) onto KEGG pathway *similar expression Patterns along this pathway

31 Discussion Integration of pathway information is difficult. Divergent Insufficient and inaccurate interaction data In vivo Pathways are not independent Graph perspective may help us and so does some graph theory applications