Mining Association Rule Bases from Integrated Genomic Data and Annotations

Size: px
Start display at page:

Download "Mining Association Rule Bases from Integrated Genomic Data and Annotations"

Transcription

1 Mining Association Rule Bases from Integrated Genomic Data and Annotations Ricardo Martinez 1, Nicolas Pasquier 1 and Claude Pasquier 2 1 Laboratoire I3S, Université de Nice / CNRS UMR-6070, Sophia-Antipolis, France. rmartine@i3s.unice.fr, pasquier@i3s.unice.fr 2 Institute of Developmental Biology and Cancer, Université de Nice / CNRS UMR-6543, Nice, France. claude.pasquier@unice.fr

2 GenMiner General Framework Gene expression data Phenotypes Biological data Molecular pathways Bibliography Transcriptional regulators Biological process Response to stimulus Response to temperature stimulus Response to stress Response to cold Gene Ontology Data mining Information

3 GenMiner Process Data integration Normal Discretization (NorDi) Data mining (JClose) Exploration and interpretation Gene expression measures Annotations enriched dataset Gene expression levels and annotations Bases of association rules Knowledge Annotations

4 The Normal Discretization (NorDi) Algorithm Gene expression measures distribution

5 The Normal Discretization (NorDi) Algorithm 1. Removal of outliers as long as it induces an improvement of normality Outliers detected with Grubbs test Normality tested with Jarque Bera test

6 The Normal Discretization (NorDi) Algorithm 1. Removal of outliers as long as it induces an improvement of normality Outliers detected with Grubbs test Normality tested with Jarque Bera test 2. Verification of the normality of the cleaned distribution Performed with Lilliefors test

7 The Normal Discretization (NorDi) Algorithm Under-expression cutoff 1. Removal of outliers as long as it induces an improvement of normality Outliers detected with Grubbs test Normality tested with Jarque Bera test 2. Verification of the normality of the cleaned distribution Performed with Lilliefors test 3. Calculation of over and under-expressed cutoff using the Z-score Over-expression cutoff

8 The Normal Discretization (NorDi) Algorithm 1. Removal of outliers as long as it induces an improvement of normality Outliers detected with Grubbs test Normality tested with Jarque Bera test 2. Verification of the normality of the cleaned distribution Performed with Lilliefors test 3. Calculation of over and under-expressed cutoff using the Z-score 4. Discretization of the initial distribution under-expressed unchanged over-expressed

9 Association Rules Extraction (JClose) Based on the closure of the Galois connexion Problem of efficiency Apriori based approaches are efficient for sparse, weakly correlated data (e.g. market basket data) Execution times and memory usage are important for dense, correlated data (e.g. genomic data) The frequent closed itemsets approach improves both when data are dense and correlated Reduces the search space Reduces the number of dataset scans

10 Experimental Results: Execution Times and Memory Usage Annotations enriched Eisen et al. dataset minsup (#) minconf JClose (s) Apriori (s) (50) (37) (25) (22) (19) (17) (14) Out of memory (12) Out of memory (9) Out of memory (7) Out of memory (4) 0.3 Out of memory Out of memory

11 Experimental Results: Scalability Varying minsup and minconf thresholds minsup (#) minconf Time (s) minconf Time (s) minconf Time (s) (50) (37) (25) (22) (19) (17) (14) (12) (9) (7) (4) 0.9 NA 0.5 NA 0.3 NA

12 Association Rules Extraction (JClose) Problem of relevance of extracted rules Apriori generates numerous associations from dense, correlated data Several redundant rules represent the same information Ex : Exp1(+) GO:121 Exp1(+), GO:001 GO:121 Exp1(+) GO:001 Exp1(+) GO:001, GO:121 Exp1(+), GO:121 GO:001 The Informative Basis contains only non-redundant associations with minimal antecedent (LHS) Ex : Exp1(+) GO:001, GO:121

13 Experimental Results: Number of Rules Annotations enriched Eisen et al. dataset minsup (#) Informative basis Exact Partial Total All associations (50) (37) (25) (22) (19) (17) (14) Out of memory (12) Out of memory (9) Out of memory (7) Out of memory

14 Conclusion Normal Discretization of gene expression measures Ability to process large, dense genomic datasets integrating gene expression levels and annotations Ability to extract associations for small groups of genes Suppress irrelevant association rules Slides and implementations of GenMiner and NorDi available at Thank you!