Chapter 8 Data Analysis, Modelling and Knowledge Discovery in Bioinformatics

Size: px
Start display at page:

Download "Chapter 8 Data Analysis, Modelling and Knowledge Discovery in Bioinformatics"

Transcription

1 Chapter 8 Data Analysis, Modelling and Knowledge Discovery in Bioinformatics Prof. Nik Kasabov nkasabov@aut.ac.nz 12/16/2002 Nik Kasabov - Evolving Connectionist Systems

2 Overview Bioinformatics - an area of information growth and emergence of knowledge Dynamic DNA and RNA sequence data analysis and knowledge discovery Gene expression data analysis, rule extraction, and disease profiling Fuzzy evolving clustering of genes according to their time-course expression Protein secondary structure prediction Dynamic cell modelling

3 Biology Basics DNA ( Dioxyribonucleic Acid) is a chemical chain, present in the nucleus of each cell of an organism The whole process of DNA transcription, gene translation, and protein production is continuous and it evolves over time RNA (ribonucleic acid) has a similar structure as the DNA except for one chemical molecule Genes are complex chemical structures and cause dynamic transformation of one substance into another during the whole life of an individual, as well as the life of the human population over many generations Modelling these interactions, learning about them and extracting knowledge, is a major goal for Bioinformatics

4 Bioinformatics First draft of human genome is completed, now the challenge is to be able to process the vast amount of dynamic information and to create intelligent systems for prediction and knowledge discoveries at different levels of life, from cell to whole organisms and species. Bioinformatics is concerned with the application of the methods of information sciences for the analysis, modelling and knowledge discovery of biological processes in living organisms

5 Bioinformatics A schematic representation of the central dogma of molecular biology; from DNA to RNA (transcription) and from RNA to proteins (translation). (Fig 8.1) The central dogma of the molecular biology states that the DNA is transcribed into RNA, which is translated into proteins.

6 Life-long Learning & Evolution Through evolution genes are slowly modified over many generations of populations of individuals and selection processes (e.g. natural selection). Evolutionary processes imply the development of generations of populations of individuals where crossover, mutation, selection of individuals, based on fitness criteria are applied in addition to the learning processes of each individual A biological system evolves its structure and functionality through both, life-long learning of an individual, and evolution of populations of many such individuals,

7 Computational Modelling in Molecular Biology There are five main phases of information processing and problem solving in most bioinformatics systems: 1. Data collection, e.g. collecting biological samples and processing them. 2. Feature analysis and feature extraction 3. Modelling the problem 4. Knowledge discovery in silico 5. Verifying the discovered knowledge in vitro and in vivo

8 Computational Modelling in Molecular Biology Some of the modelling techniques (decision trees, KBNN) allow for extracting knowledge e.g. rules from the models, that can be used for explanation or for knowledge discovery. For large data sets and for continuously incoming data streams that require the model and the system to rapidly adapt to new data, it is more appropriate to use on-line, knowledge based techniques and ECOS in particular as it is demonstrated in this chapter. There are many problems in Bioinformatics that require their solutions in the form of a dynamic, learning, knowledge based system An ultimate task for bioinformatics would be predicting the development of an organism from its DNA code

9 Dynamic DNA & RNA Sequence Analysis Analysis of a DNA sequence and identifying promoter regions Identify splice junction (E/I, or I/E, or None):

10 On-line learning of ribosome binding site data (fig 8.3) 1 Desired and Actual Number of rule nodes

11 Identify intron/exon splice junction EXTRACTION OF RULES: Rule1: if AGGT-AG then [EI] Rule8: if T------T-CAG then [IE]

12 Gene Expression Data: Biological Perspective Microarray equipment is used widely at present to evaluate the level of gene expression in a tissue, or in a living cell. Each point (pixel, cell) in a microarray represents the level of expression of a single gene Microarray analysis might not identify unique markers (e.g. a single gene) of clinical utility for a disease because of the heterogeneity of the disease, but a prediction of the biological state of disease is likely to be more sensitive by identifying clusters of gene expression (profiles) Gene expression clustering has been used to distinguish normal colon samples from tumours from within a 6,500 gene set. Another example of profiling developed in this chapter is for the distinction between two subtypes of Leukaemia, namely AML and ALL.

13 Gene Expression Data Analysis A gene profile is a pattern of expression of a number of genes that is typical for all, or for some of the known samples of a particular disease. A disease profile would look like:» IF (gene g1 is highly expressed) AND (gene g37 is low expressed) AND (gene 134 is very highly expressed) THEN most probably this is cancer type C (123 out of available 130 samples have this profile), This profile can be matched against existing gene profiles and based on similarity, it can be predicted with certain probability if the patient is in an early phase of a disease or he/she is at risk of developing the disease in the future with certain probability.

14 Gene expression data analysis, modelling and knowledge discovery Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer. Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue. It is difficult to find consistent patterns of gene expression for a class of tissue After all, a microarray data is just of few microseconds snapshot of what is happening in the cell Genes interact how do we find out about that?

15 Fuzzy representation of gene expression data

16 Gene Profiling Methodology Phases: 1. Microarray data pre-processing. 2. Selecting a set of significant differentially expressed genes across the classes. 3. Finding subsets of (a) under-expressed genes, and (b) over-expressed genes, from the selected ones in the previous step. 4. Clustering of the gene sets from (3) that would reveal preliminary profiles of jointly over-expressed/underexpressed genes across the classes. 5. Building a classification model and extracting rules that define the profiles for each class.

17 Gene Expression Knowledge Discovery Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer. Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue. It is difficult to find consistent patterns of gene expression for a class of tissue After all, a microarray data is just of few microseconds snapshot of what is happening in the cell Genes interact how do we find out about that? Growing number of examples and complexity.

18 Case Study: Gene Profiling of Colon Cancer using EFuNN Rule 1: IF M24902 (High 0.988) and H13238 (Low 0.991) and H16758 (High 0.995) and X90908(Low 0.992) and T55255(Low 0.998) THEN COLON CANCER (High 1.0) (receptive field 0.5, examples explained by the rule 23/40; Rule 2: IF T71662(Low 0.984) and X76383(High 0.985) and X54938(Low 0.989) and H88522(Low 0.987) and H92523(High 0.989) THEN NORMAL TISSUE(High 1.0) (receptive field 0.19; examples explained by this rule 13/22; used thresholds for the condition membership degrees 0.98 and for the conclusion memb. degrees 0.95) Two of the 12 extracted rules that reveal some conditions for a colon cancer against normal tissue. Each rule represents a sub-class (cluster) of each of the two classes.

19 Disease Profiling Through Rule Extraction from EFuNN Rule extraction from EFuNNs:» Input space restricted to genes with high significance (e.g. 98 genes for the colon cancer data set (Alon et al)» Rule extraction after learning in an EFuNN» Rules represent disease profiles» Proper visualization for a better understanding

20 Dynamic modeling and knowledge discovery from 14 cancer type gene expression data A continuous flow of data An adaptive mother model is being created and updated over time: new data; new genes; new classes At any time, an optimal simple model is extracted and analyzed Rules are extracted and genes arte analyzed Example: Ramaswami s data (PNAS,January,2002) of 14 types of cancer Future work: dynamic modeling of gene interaction networks and cell development

21 Using Evolving Self-organising Maps ESOM for clustering of time course gene expression data On-line clustering of time-course gene expression data by ESOMs (Da Deng, and N. Kasabov, 2002, Neurocomputing)

22 Amino Acid codons The codons of each of the 20 amino acids. The first column represents the first base in the triplet, the first row the second base, and the last column the last base (Table 8.6)

23 Protein Structure Prediction The mrna is translated by ribosomes into proteins A protein is a sequences of amino-acids, each of them defined by a group of 3 nucleotides (codons) 20 amino acids all together (A,C-H,I,K-N,P-T,V,W,Y) Initiation and stop codons Proteins have complex structures:» Primary (linear),» Secondary (3D, defining functionality)» Tertiary (high level energy minimisation packing),» Quaternary (interaction between molecules) The Protein Data Bank ,000 hits a day on average

24 Protein Structure Prediction Predicting the secondary structure from the primary Segments from a protein can have different shapes:» Helix» Sheet» Coil (loop) ANN is trained on existing data to predict the shape of an arbitrary new segment; window of 13 amino-acids 273 inputs 3 outputs; 18,000 examples for training Research done mainly by Mike Watts in collaboration with Natural Selection Inc., based in La Jolla, California.

25 Proteins and protein structure prediction The mrna is translated into proteins A protein is a sequences of aminoacids, each of them defined by a group of 3 nucleotides (codons) 20 amino acids all together (A,C-H,I,K- N,P-T,V,W,Y) Initiation and stop codons Proteins have complex structures:» Primary (linear),» Secondary (3D, defining functionality)» Tertiary ( energy minimisation packs),» Quaternary (interaction between molecules) The Protein Data Bank ,000 hits a day on average

26 Towards comprehensive EI for bioinformatics applications Hybrid models Using all available information gene expression, biological, clinical, etc. comprehensive simulation systems Cell Parameters System Parameters DNA data of a living cell RNA data Evolving model of a cell Output information Protein data Existing data bases New knowledge extracted (DNA, Genes, Proteins, Metabolic networks)

27 Dynamic Cell Modelling The cell is never conquered until its total behaviour is understood, and the total behaviour of the cell is never understood until it is modelled and simulated. (Tomita, 2001) Computer modelling of processes in living cells is an extremely difficult task.» The processes in a cell are dynamic and depend on many variables some of them related to a changing environment.» The processes of DNA transcription, and protein translation are not fully understood. Several cell models have been created and experimented A starting point to dynamic modelling of a cell would be dynamic modelling of a single gene regulation process The next step in dynamic cell modelling would be to try and model the regulation of more genes, hopefully a large set of genes

28 Genetic networks and reverse engineering GN describe the regulatory interaction between genes Reverse engineering from gene expression data to GN. It is assumed that gene expression data reflects the underlying genetic regulatory network Co-expressed genes over time either one regulates the other, or both are regulated by same other genes What is the time unit? Appropriate data needed Validation procedure Correct interpretation of the models may generate new biological knowledge

29 Evolving fuzzy neural networks for GRN modeling G(t) EFuNN G(t+dt) On-line, incremental learning of a GN Adding new inputs/outputs (new genes) The rule nodes capture clusters of input genes that are related to the output genes Rules can be extracted that explain the relationship between G(t) and G(t+dt), e.g.: IF g13(t) is High (0.87) and g23(t) is Low (0.9) THEN g87 (t+dt) is High (0.6) and g103(t+dt) is Low Playing with the threshold will give stronger or weaker patterns of relationship

30 DENFIS: Dynamic, evolving neuro-fuzzy inference systems for GN modeling (IEEE Trans. FS, April, 2002) G(t) -> gj(t+dt) Dynamic partitioning of the input space Takagi-Sugeno fuzzy rules, e.g.: if G1 is ( ) and G2 is ( ) and G3 is ( ) and G4 is ( ) and then Gy = X1-1.22X X X4

31 Summary Modelling biological processes is aiming at the creation of models that trace these processes over time. The models should reveal the steps of development, the metamorphoses that occur at different points of time, the trajectories of the developed patterns. Biological processes are dynamically evolving and they require appropriate techniques, such as evolving connectionist systems.

32 Further Readings Computational Molecular Biology (Pevzner, 2001). Applications of neural network methods, mainly multiplayer perceptrons and selforganising maps, in the general area of genome informatics (Wu and McLarty, 2000). Microarray gene technologies (Schena, 2000). Data mining in biotechnology (Persidis, 2000). Application of the theory of complex systems for dynamic gene mo delling ( Bar- Yam, 1997). Computational modelling of genetic and biochemical networks (Bower and Bolouri, 2001). Dynamic modelling of the regulation of a large set of genes (Somogyi et al, 2001; D haeseleer et al, 2000). Methodology for gene expression profiling (Futschik, et al, 2002; Futschik, 2002). Using fuzzy neural networks and evolving fuzzy neural networks in bioinformatics (Kasabov, Futschik and Middlemiss, 2000). Fuzzy clustering for gene expression analysis (Futschik and Kasabov, 2002). Artificial neural filters for pattern recognition in protein sequences (Schneider and Wrede, 1993). Dynamic models of the cell (Schaff and Loew, 1999; Tomita et al, 1999; Kohn and Dimitrov, 2000).