Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo

Similar documents
Data Mining for Biological Data Analysis

From Gene to Protein

DNA Function: Information Transmission

DNA and Biotechnology Form of DNA Form of DNA Form of DNA Form of DNA Replication of DNA Replication of DNA

Introduction to Bioinformatics

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

The study of the structure, function, and interaction of cellular proteins is called. A) bioinformatics B) haplotypics C) genomics D) proteomics

UNIT 3 GENETICS LESSON #41: Transcription

Resources. How to Use This Presentation. Chapter 10. Objectives. Table of Contents. Griffith s Discovery of Transformation. Griffith s Experiments

Chapter 13 - Concept Mapping

What Are the Chemical Structures and Functions of Nucleic Acids?

MATH 5610, Computational Biology

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Big Idea 3C Basic Review

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

BIOINFORMATICS THE MACHINE LEARNING APPROACH

II. DNA Deoxyribonucleic Acid Located in the nucleus of the cell Codes for your genes Frank Griffith- discovered DNA in 1928

O C. 5 th C. 3 rd C. the national health museum

DNA Structure and Replication, and Virus Structure and Replication Test Review

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

DNA Structure DNA Nucleotide 3 Parts: 1. Phosphate Group 2. Sugar 3. Nitrogen Base

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Class XII Chapter 6 Molecular Basis of Inheritance Biology

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

Protein Synthesis: From Gene RNA Protein Trait

Algorithms in Bioinformatics ONE Transcription Translation

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Replication Transcription Translation

How do we know what the structure and function of DNA is? - Double helix, base pairs, sugar, and phosphate - Stores genetic information

Replication Review. 1. What is DNA Replication? 2. Where does DNA Replication take place in eukaryotic cells?

Why are proteins important?

MBioS 503: Section 1 Chromosome, Gene, Translation, & Transcription. Gene Organization. Genome. Objectives: Gene Organization

LABS 9 AND 10 DNA STRUCTURE AND REPLICATION; RNA AND PROTEIN SYNTHESIS

Lecture Overview. Overview of the Genetic Information. Marieb s Human Anatomy and Physiology. Chapter 3 DNA & RNA Protein Synthesis Lecture 6

Chapter 12 DNA & RNA

Section 3: DNA Replication

Molecular Basis of Inheritance

DNA and RNA Structure Guided Notes

DNA and RNA Structure. Unit 7 Lesson 1

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Chapter 8. Microbial Genetics. Lectures prepared by Christine L. Case. Copyright 2010 Pearson Education, Inc.

Click here to read the case study about protein synthesis.

Warm-Up: Check your Answers

Algorithms in Bioinformatics

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Nucleic Acids: DNA and RNA

DNA. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

RNA & PROTEIN SYNTHESIS

Introduction to Bioinformatics

BIOLOGY 111. CHAPTER 6: DNA: The Molecule of Life

Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication.

Chapter 8 From DNA to Proteins. Chapter 8 From DNA to Proteins

Nucleic acids deoxyribonucleic acid (DNA) ribonucleic acid (RNA) nucleotide

Bundle 5 Test Review

A. Incorrect! This feature does help with it suitability as genetic material.

Videos. Lesson Overview. Fermentation

Principle 2. Overview of Central. 3. Nucleic Acid Structure 4. The Organization of

Computational Genomics. Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall

DNA: Structure and Function

Lecture Overview. Overview of the Genetic Information. Chapter 3 DNA & RNA Lecture 6

NUCLEIC ACID. Subtitle

Outline. Structure of DNA DNA Functions Transcription Translation Mutation Cytogenetics Mendelian Genetics Quantitative Traits Linkage

RNA ID missing Word ID missing Word DNA ID missing Word

NUCLEIC ACID METABOLISM. Omidiwura, B.R.O

Biology Celebration of Learning (100 points possible)

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

PHYS 498 HW3 Solutions: 1. We have two equations: (1) (2)

Unit IX Problem 3 Genetics: Basic Concepts in Molecular Biology

Ch 10 Molecular Biology of the Gene

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Molecular Genetics. The flow of genetic information from DNA. DNA Replication. Two kinds of nucleic acids in cells: DNA and RNA.

Adv Biology: DNA and RNA Study Guide

Unit 1: DNA and the Genome. Sub-Topic (1.3) Gene Expression

From Gene to Protein. Lesson 3

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7)

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

(deoxyribonucleic acid)

Gene Expression Data Analysis

Lecture 8. Chromosome. The Nuclei. Two Types of Nucleic Acids. Genes. Information Contained Within Each Cell

translation The building blocks of proteins are? amino acids nitrogen containing bases like A, G, T, C, and U Complementary base pairing links

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

DNA, Replication and RNA

DNA - DEOXYRIBONUCLEIC ACID

Nucleic acids. How DNA works. DNA RNA Protein. DNA (deoxyribonucleic acid) RNA (ribonucleic acid) Central Dogma of Molecular Biology

DNA & THE GENETIC CODE DON T PANIC! THIS SECTION OF SLIDES IS AVAILABLE AT CLASS WEBSITE

Biology Chapter 12 Test: Molecular Genetics

The Structure of Proteins The Structure of Proteins. How Proteins are Made: Genetic Transcription, Translation, and Regulation

I. To understand Genetics - A. Chemical nature of genes had to be discovered B. Allow us to understand how genes control inherited characteristics

DNA: The Molecule Of Life

Introduction to Algorithms in Computational Biology Lecture 1

Transcription. The sugar molecule found in RNA is ribose, rather than the deoxyribose found in DNA.

Transcription:

Data Mining in Bioinformatics Prof. André de Carvalho ICMC-Universidade de São Paulo

Main topics Motivation Data Mining Prediction Bioinformatics Molecular Biology Using DM in Molecular Biology Case studies Gene Expression Analysis Protein function prediction André de Carvalho 2

Motivation Genome research is producing a very large amount of data Exponential growth in the number of stored bp in the last 10 years In the beginning of the decade, doubling every 12-15 months André de Carvalho 3

Motivation Source: http://www.ncbi.nlm.nih.gov/ GenBank growth Spring 2009 André de Carvalho 4

Motivation Year Base Pairs Source: http://www.ncbi.nlm.nih.gov/ Sequences 2000 11,101,066,288 10,106,023 2001 15,849,921,438 14,976,310 2002 28,507,990,166 22,318,883 2003 36,553,368,485 30,968,418 2004 44,575,745,176 40,604,319 2005 56,037,734,462 52,016,762 2006 69,019,290,705 64,893,747 2007 83,874,179,730 80,388,382 2008 99,116,431,942 98,868,465 André de Carvalho 5

Bioinformatics Several sequencing projects have been concluded lately Producing a large amount of data Until 2009: 4370 projects Almost 1000 completed André de Carvalho 6

Bioinformatics Fonte: http://www.genomesonline.org/gold_statistics.htm André de Carvalho 7

Bioinformatics (2009) Fonte: http://www.genomesonline.org/gold_statistics.htm André de Carvalho 8

Bioinformatics (2007) Fonte: http://www.genomesonline.org/gold_statistics.htm André de Carvalho 9

Bioinformatics Genome projects Complete genomes published (eukaryote) Human Mouse Drosophila Arabidopsis thaliana Domestic animals Bovine André de Carvalho 10

Bioinformatics Fonte: http://www.genomesonline.org/gold_statistics.htm André de Carvalho 11

Motivation Emphasis is progressively moving from data accumulation to data interpretation Data resulting from sequencing projects These data needs to be analysed Analysis in Laboratories is difficult and expensive Sophisticated computational tools are needed Data mining André de Carvalho 12

DM and Machine Learning Most DM methods are based on Machine Learning (ML) techniques Decision Trees Regression Clustering Association rules Artificial Neural Networks Support Vector Machines Evolutionary Computation Hybrid Intelligent Systems André de Carvalho 13

Bioinformatics Definition Research and development of computational tools able to solve problems from Biology Molecular biology Computers are to biology what mathematics is to physics Harold Morowitz André de Carvalho 14

Bioinformatics Several areas may benefit Medicine - Pharmacy - Agriculture Molecular Medicine Improve diagnosis of diseases Detect genetic predisposition to pathologies Create drugs based on molecular information Use gene therapy as drugs Design custom drugs based on individual genetic profiles André de Carvalho 15

Molecular biology Study of cells and molecules Particularly genome of organisms Main structures investigated: Genes Chromosomes DNA nucleotides RNA Proteins Amino acids Gene expression André de Carvalho 16

Molecular biology André de Carvalho 17

Molecular biology Central Dogma of Molecular Biology Information transference Replication DNA RNA Transcription Translation They are only assumptions Proteins André de Carvalho 18

Molecular biology Recent discoveries contradict this dogma: RNA can suffer replication in some virus and plants Viral RNA, through an enzyme named reverse transcriptase, can be transcribed in DNA DNA can directly produce specific proteins Without going through the transcription process André de Carvalho 19

Molecular biology A genome is all the DNA in an organism, including its genes Genes carry information for making all the proteins required by all organisms These proteins determine, among other things: How an organism looks like, how well its body metabolizes food or fights infection, and sometimes even how it behaves André de Carvalho 20

Molecular biology DNA (Deoxyribonucleic Acid) Molecule made up of two parallel twisted chains of alternating units of phosphoric acid and deoxyribose sugars Combination of four types of bases A (adenine), C (cytosine), G (guanine) and T (thymine) Chains are held together by links that connect each nucleotide in one chain to its complement in the other chain A connects with T and C with G Gives the double helix appearance André de Carvalho 21

Molecular biology RNA (Ribonucleic Acid) Differ from DNA in several aspects: Single stranded molecule Contains ribose sugars Instead of deoxyribose Instead of T (thymine), contains U (uracyl) RNA molecules are smaller than DNA molecules André de Carvalho 22

Molecular biology Genes Subsequences of DNA Localized in chromosomes Used as mould for the production of proteins There are segments incased between genes named no coding regions André de Carvalho 23

Molecular biology Proteins Define structure, function and regulatory mechanisms in the cells Examples of regulatory mechanisms: Cell cycle control, genetic transcription Can be represented by linear sequences Combination of 20 different amino acids Three nucleotides (codon) are mapped to an amino acid André de Carvalho 24

Molecular biology André de Carvalho 25

Gene expression process RNA Polymerase T DNA G C A G C T C C G G A C T C C A T... promoter Transcription mrna A André de Carvalho 26

Gene expression process RNA Polymerase.. promoter DNA T G C A G C T C C G G A C T C C A T. Transcription mrna A C G U C G A G G C C U G A G G U A... André de Carvalho 27

Gene expression process RNA Polymerase.. promoter Ribosome DNA T G C A G C T C C G G A C T C C A T. Transcription mrna A C G U C G A G G C C U G A G G U A... Translation Thr André de Carvalho 28

.. Gene expression process RNA Polymerase promoter DNA T G C A G C T C C G G A C T C C A T. Transcription mrna A C G U C G A G G C C U G A G G U A... Ser Translation His Ribosome Cys Ser Gly Leu André de Carvalho 29

DM and molecular biology Problems in Molecular Biology where ML techniques have been used Gene recognition Reconstruction of phylogenies Gene expression analysis Protein structure prediction Protein function prediction Gene regulation analysis Sequences alignment André de Carvalho 30

Gene expression analysis Concerned with the identification of the function of genes Main goals: Reveal patterns in genetic datasets Looks for Patterns of similarity and dissimilarity Analyze expression levels of thousands of genes collected from different tissues André de Carvalho 31

Gene expression analysis Several techniques are used to detect gene expression in a tissue Microarrays Sage PCR MPSS André de Carvalho 32

Gene expression analysis Microarray Tissue A Tissue B André de Carvalho 33

Gene expression analysis Microarray Green spot: gene is abundant in health state Red spot: gene is abundant in disease state Yellow spot: gene is abundant in both states Black spot: neither health nor disease state express the gene André de Carvalho 34

Gene expression analysis André de Carvalho 35

Gene expression analysis Measure expression level under several conditions Gene 1 Condition 1 Condition m Normal and cancer tissue Gene 2 Different treatments Before and after using a drug Different periods of treatment Different diseases High cost to obtain data Gene n André de Carvalho 36

Gene expression analysis Data mining techniques have been largely used Classification or clustering Microarray data are challenging High dimensionality Irrelevant features Redundant features Noisy data Small number of tissue samples André de Carvalho 37

Gene expression analysis Usually works with a subset of genes Identify important genes Improve classification accuracy Minimize effects of noise Make the technology more accessible Become a common clinical tool André de Carvalho 38

Gene expression analysis Gene selection Not all the genes are relevant for tissue classification / clustering Use only the most relevant genes Each gene can be seen as an attribute Problem becomes attribute selection Two approaches are used Ranking of attributes Selection of the best subset of attributes André de Carvalho 39

Experiment 1 Several ML techniques have been used for gene expression analysis Tissue classification Given a gene expression data set, which technique should be used? Trial and error Meta-learning André de Carvalho 40

Experiment 1 Examples 49 datasets 7 ML algorithms Relative performances No clear winner André de Carvalho 41

Metalearning Issues with algorithm selection The choice of ML algorithm should be data driven Trial-and-error may be very time consuming Metalearning Learn from past to predict the future Relate data characteristics with preference for particular algorithms Construct rankings of algorithms It is fast and easy to apply André de Carvalho 42

Metalearning 3 main steps Generation of metadata Induction of meta-learning model Application of the metamodel André de Carvalho 43

Metalearning Generation of metadata Synthesize data characteristics and algorithms performances Metaexamples Metafeatures: general, statistical and information-theoretic measures Target: ranking of estimated performances for a set of algorithms Flexible recommendation allows user to try out algorithms in according to his/her preference André de Carvalho 44

Metalearning Induction of meta-learning model K-NN ranking method Find nearest metaexamples (Euclidean distance) Combine target rankings (Average rank) André de Carvalho 45

Metalearning Application of the metamodel Support user in the algorithm selection process Compute metaexample for new data Metafeatures Target Evaluation of metamodel Leave-one-out (LOO) Spearman s rank correlation for ranking accuracy Default ranking as baseline method All metaexamples are considered André de Carvalho 46

Experimental results Data 49 cancer related datasets Mainly disease diagnostics related Diverse data characteristics Usual mean 0, variance 1 transformations Calculation of the meafeatures was preceded by a data reduction step PLS reduced the number of attributes to 3 components André de Carvalho 47

Experimental results ML algorithms Common approaches Moderate computational burden Easy availability SVM Linear, SVM RBF, DLDA, DQDA, PAM, 3-NN, PDA Default parameters Performance estimation.632+ estimator with 50 examples André de Carvalho 48

Experimental results Metafeatures 10 Statlog continuous measures Log of number of examples Log of number of attributes Log of number of classes Mean absolute skewness Mean kurtosis Geometric mean ratio of the standard deviations of individual populations to the pooled standard deviations First canonical correlation Proportion of total variation explained by the firs canonical correlation Normalized class entropy Average absolute correlation between continuous attributes, per class André de Carvalho 49

Experimental results Mean ranking LOO accuracies Varying k = [1:20] Always better than default Smooth performance degradation with K More homogeneous datasets André de Carvalho 50

Analysis of the results Before, metalearning has been applied to general classification domains Now, a successful application of metalearning in the gene expression analysis domain is presented Future steps Compare metalearning approaches Employ domain specific metafeatures André de Carvalho 51

Protein function prediction Allows the assignment of functions to newly discovered proteins Important problem in proteomics Common approach Search for similar frequencies Alternative Approach Induce a classification model André de Carvalho 52

Protein function prediction Difficulties associated with the prediction of protein function The same protein may have more than one function Multi-label classification Functions may vary from more generic to more specific Hierarchical classification André de Carvalho 53

Protein function prediction Class hierarchy of Enzymes André de Carvalho 54

Multi-label classification Examples may belong to more than one class Simultaneously Pres. Cough Cough e Aches Aches Cough e Fever Fever Aches e Fever Cough, Aches e Fever Temp. André de Carvalho 55

Multi-label classification Two main approaches Transformation into a single-label problem Algorithm independent Combination of conventional single label-classifiers Algorithm dependent Modification of single-label classifiers Modification of their internal mechanisms Development of new multi-label classification algorithms Encode multi-label output André de Carvalho 56

Multi-label classification Algorithm independent transformation Label-based Instance-based Instance elimination Creation of new labels Label conversion Label elimination Label decomposition André de Carvalho 57

Multi-label classification Label-based transformation A classifier is associated to each label / class Binary classification problems Instance Classes 1 A and B 2 A 3 A and B 4 C 5 B 6 A Classifier Positive Negative A 1, 2, 3, 6 4, 5 B 1, 3, 5 2, 4, 6 C 4 1, 2, 3, 5, 6 Single-label problem Multi-label Problem André de Carvalho 58

Multi-label classification Instance-based transformation Instance elimination Instance Classes 1 A and B 2 A 3 A and B 4 C 5 B 6 A Instance Class 2 A 4 C 5 B 6 A Single-label problem Multi-label Problem André de Carvalho 59

Multi-label classification Instance-based transformation Label creation (label-powerset) Instance Classes 1 A and B 2 A 3 A and B 4 C 5 B 6 A Instance Class 1 D 2 A 3 D 4 C 5 B 6 A Multi-label Problem Single-label problem André de Carvalho 60

Multi-label classification Instance-based transformation Label elimination Instance Classes 1 A and B 2 A 3 A and B 4 C 5 B 6 A Instance Class 1 A 2 A 3 B 4 C 5 B 6 A Multi-label Problem Single-label problem André de Carvalho 61

Multi-label classification Instance-based transformation Label decomposition (cross-training method) Single-label problems Instance Class 1 A 2 A 3 A 4 C 5 B 6 A Instance Classes 1 A, B 2 A 3 A, B 4 C 5 B 6 A Instance Class 1 B 2 A 3 B 4 C 5 B 6 A Multi-label Problem André de Carvalho 62

Multi-label classification Instance-based transformation Label decomposition (multiplicative method) Single-label problems Instance Class 1 A 2 A 3 A 4 C 5 B 6 A Instance Class 1 A 2 A 3 B 4 C 5 B 6 A Instance Classes 1 A, B 2 A 3 A, B 4 C 5 B 6 A Multi-label Problem Instance Class 1 B 2 A 3 A 4 C 5 B 6 A Instance Class 1 B 2 A 3 B 4 C 5 B 6 A André de Carvalho 63

Output encoding Encode desired output by binary vectors Multi-class problem Instance Classes 1 A and B 2 A 3 A and B 4 C 5 B 6 A Class code A B C 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 Multi-label Problem Single-label problem André de Carvalho 64

Multi-label classification Evaluation Require specific measures Examples can be partially correct or partially incorrect classified Classification may use a ranking André de Carvalho 65

Experiments 2 Comparison of three algorithm independent methods for multi-label classification One-against-all (OAA) Label-Powerset Cross-Training Datasets: Yeasts proteins found in the organism Saccharomyces cerevisiae Sequences protein sequences classified in structural families André de Carvalho 66 66

Experiments Datasets Yeasts 2417 examples (2385 multi-label) 103 numerical attributes Distribution (34, 731.5, 1816) and 4.23 classes/example Sequences 662 examples (69 multi-label) 1186 nominal attributes Distribution (16, 54.5, 171) and 1.15 classes/example André de Carvalho 67

Experiments Yeasts dataset Accuracy Accuracy Accuracy 60 55 50 45 40 35 30 25 20 15 10 5 0 KNN SVM C4.5 Ripper BN 60 55 50 45 40 35 30 25 20 15 10 5 0 KNN SVM C4.5 Ripper BN 60 55 50 45 40 35 30 25 20 15 10 5 0 KNN SVM C4.5 Ripper BN OAA Label-Powerset Cross-Training André de Carvalho 68

Experiments Sequence dataset Accuracy Accuracy Accuracy 100 97 94 91 88 85 82 79 76 73 70 KNN SVM C4.5 Ripper BN 100 97 94 91 88 85 82 79 76 73 70 KNN SVM C4.5 Ripper BN 100 97 94 91 88 85 82 79 76 73 70 KNN SVM C4.5 Ripper BN OAA Label-Powerset Cross-Training André de Carvalho 69

Analysis of the results Each method favours a specific feature of the dataset High / low frequency of multi-label examples SVMs usually presents a better predictive accuracy Similar results for other datasets and performance metrics André de Carvalho 70

Hierarchical classification Classification problems where: Classes can be partitioned into subclasses Classes can be grouped into superclasses Data are hierarchically organized {1, 1.1, 1.2,..., k, k.1, k.2} Classes assume an hierarchical organization André de Carvalho 71

Hierarchical classification Types of hierarchical based classification Mandatory prediction to leaf nodes Prediction to any node doença Sick Cough Aches Fever SARS Spanish Common comum André de Carvalho 72

Types of hierarchy (a) Trees (b) Direct Acyclic Graphs (DAG) André de Carvalho 73

Hierarchical classification Main approaches Transformation into a flat classification problem Hierarchical prediction with flat classification algorithms Top-down Big-bang (one-shot) André de Carvalho 74

Hierarchical classification Transformation into a flat classification problem Reduces problem complexity Most used method: Select a hierarchy level and perform flat classification in this level Advantage: the simplest approach Disadvantage: classification in the other levels of the hierarchy is lost André de Carvalho 75

Hierarchical classification Original problem Flat classification problem André de Carvalho 76

Hierarchical classification Hierarchical prediction with flat classification algorithms Divides the original problem into a set of flat classification problems A flat classification for each level Advantage: no need to modify flat classification algorithms Disadvantage: classifications in different levels may be inconsistent Ex. class 2 (level 2) and class 3.4 (level 3) André de Carvalho 77

Hierarchical classification Prob. 1 Level 1 Prob. 2 Level 2 André de Carvalho 78

Hierarchical classification Top-down Divides original problem into a set of flat classification problems Which are dealt with sequentially, level by level, from the root Classification proceeds in the sub-tree associated with the previously chosen node Advantage: no need to modify flat classification algorithms Disadvantage: risk of classification error propagation André de Carvalho 79

Hierarchical classification André de Carvalho 80

Hierarchical classification Big-bang Classification algorithm considers the whole hierarchy Advantage: classification is carried out in just one go Disadvantage: complexity of the algorithms André de Carvalho 81

Hierarchical classification André de Carvalho 82

Hierarchical classification Evaluation measures Uniform cost Most used Distance-based cost Based on the distance between the predicted class and the true class Depth dependent Errors at higher levels should have a higher cost Semantic-based cost More similar classes classes have smaller penalizations André de Carvalho 83

Hierarchical classification Evaluation measures might Report na accuracy rate for the whole hierarchy Report na accuracy rate for each level Report na accuracy rate for each class André de Carvalho 84

Experiment 3 Two datasets G-Protein-Coupled Receptors (GPCRs) Enzymes Data extracted from UniProt and GPCRDB Attributes: Interpro entries, along with molecular weight and sequence length André de Carvalho 85

Data sets G-Protein-Coupled Receptors (GPCRs) 40-50% of current drugs target GPCR activity 7461 instances Class hierarchy 12/54/82/50 classes per level Enzymes Catalysts which are used to speed up chemical reactions within the cell 6925 instances Class hierarchy 2/21/48/87 classes per level André de Carvalho 86

Investigated approaches Classifier technique: decision trees Four models were used: Flat based on leaves Flat all levels Top-Down Big-Bang (Clare and King, 2003) André de Carvalho 87

Hierarchical classification of proteins André de Carvalho 88

Analysis of the results Hierarchical approaches are better than flat approaches GPCR Top-Down Other classifiers Ensembles Different metrics André de Carvalho 89

Conclusion Data Mining Motivation Molecular Biology Bioinformatics problems Analysis of Gene Expression Protein function classification DM solutions André de Carvalho 90

Conclusion Classification can be a complex tasks New types of problems are being investigated And novel demands may arise New techniques are needed And metrics to evaluate them in these nontrivial classification problems André de Carvalho 91

Acknowledgements Alex Freitas, University of Kent Ana Carolina Lorena, UFABC André Rossi, ICMC-USP Bruno Feres de Souza, ICMC-USP Katti Faceli, UFSCar Eduardo Costa, UKL Eduardo J. Spinosa, ICMC-USP Carlos Soares, Universidade do Porto João Gama, Universidade do Porto André de Carvalho 92

Acknowledgements André de Carvalho 93