TEXT MINING FOR BONE BIOLOGY

Similar documents
Text Mining for Bone Biology

Multi-Level Text Mining for Bone Biology

Introduction to Bioinformatics

GeneNetMiner: accurately mining gene regulatory networks from literature

Data Mining for Biological Data Analysis

Exploring Similarities of Conserved Domains/Motifs

Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics

The knowledge-driven exploration of integrated biomedical knowledge sources facilitates the generation of new hypotheses

Bioinformatics : Gene Expression Data Analysis

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

Finding Regularity in Protein Secondary Structures using a Cluster-based Genetic Algorithm

DETECTING GENE RELATIONS FROM MEDLINE ABSTRACTS

Time Series Motif Discovery

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Case Study: Dr. Jonny Wray, Head of Discovery Informatics at e-therapeutics PLC

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

Grand Challenges in Computational Biology

Statistical Inference and Reconstruction of Gene Regulatory Network from Observational Expression Profile

PATIENT STRATIFICATION. 15 year A N N I V E R S A R Y. The Life Sciences Knowledge Management Company

Signaling Hypergraph. Set V of nodes proteins, small molecules, etc.

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

2. Materials and Methods

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

Gene Identification in silico

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Textbook Reading Guidelines

Gene Expression Data Analysis

Analysis of Microarray Data

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Data representation for clinical data and metadata

Feature Selection for Predictive Modelling - a Needle in a Haystack Problem

KnetMiner USER TUTORIAL

Classification of DNA Sequences Using Convolutional Neural Network Approach

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Expression Analysis Systematic Explorer (EASE)

Smart India Hackathon

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Inferring Cellular Networks Using Probabilis6c Graphical Models. Jianlin Cheng, PhD University of Missouri 2010

TERTIARY MOTIF INTERACTIONS ON RNA STRUCTURE

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Text Mining. Theory and Applications Anurag Nagar

Types of Databases - By Scope

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

EXTRACTING AND STRUCTURING SUBCELLULAR LOCATION INFORMATION FROM ON-LINE JOURNAL ARTICLES: THE SUBCELLULAR LOCATION IMAGE FINDER

Evaluating Workflow Trust using Hidden Markov Modeling and Provenance Data

Learning Bayesian Network Models of Gene Regulation

Supporting Information Tasks with User-Centred System Design: The development of an interface supporting bioinformatics analysis

ONLINE BIOINFORMATICS RESOURCES

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

TUTORIAL. Revised in Apr 2015

Biomine: Predicting links between biological entities using network models of heterogeneous databases

Knowledge-Guided Analysis with KnowEnG Lab

Pathways from the Genome to Risk Factors and Diseases via a Metabolomics Causal Network. Azam M. Yazdani, PhD

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Churn Prediction Model Using Linear Discriminant Analysis (LDA)

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Chapter 16 IDENTIFICATION OF BIOLOGICAL RELATIONSHIPS FROM TEXT DOCUMENTS

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Resolution of Chemical Disease Relations with Diverse Features and Rules

ECS 234: Introduction to Computational Functional Genomics ECS 234

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Computational Challenges of Medical Genomics

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

A WEB-BASED TOOL FOR GENOMIC FUNCTIONAL ANNOTATION, STATISTICAL ANALYSIS AND DATA MINING

2/19/13. Contents. Applications of HMMs in Epigenomics

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

IBM SPSS Modeler Personal

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Genome-Scale Predictions of the Transcription Factor Binding Sites of Cys 2 His 2 Zinc Finger Proteins in Yeast June 17 th, 2005

Applications of HMMs in Epigenomics

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Perspectives on the Priorities for Bioinformatics Education in the 21 st Century

On utility of temporal embeddings for skill matching. Manisha Verma, PhD student, UCL Nathan Francis, NJFSearch

Tassc:Estimator technical briefing

The Integrated Biomedical Sciences Graduate Program

Product Applications for the Sequence Analysis Collection

Workshop on Data Science in Biomedicine

Analysis of Microarray Data

Non-conserved intronic motifs in human and mouse are associated with a conserved set of functions

Linear model to forecast sales from past data of Rossmann drug Store

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins

The Application of NCBI Learning-to-rank and Text Mining Tools in BioASQ 2014

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

SOFTWARE DEVELOPMENT PRODUCTIVITY FACTORS IN PC PLATFORM

CHAPTER 2 LITERATURE SURVEY

Introduction to Bioinformatics

DRAGON DATABASE OF GENES ASSOCIATED WITH PROSTATE CANCER (DDPC) Monique Maqungo

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Our view on cdna chip analysis from engineering informatics standpoint

dmgwas: dense module searching for genome wide association studies in protein protein interaction network

David Wild Indiana University Bloomington & Data2Discovery Inc

IBM SPSS Modeler Personal

File S1. Program overview and features

Computational Genomics. Reconstructing signaling and dynamic regulatory networks

An Analysis Framework for Content-based Job Recommendation. Author(s) Guo, Xingsheng; Jerbi, Houssem; O'Mahony, Michael P.

TREC 2004 Genomics Track. Acknowledgements. Overview of talk. Basic biology primer but it s really not quite this simple.

Transcription:

Andrew Hoblitzell, Snehasis Mukhopadhyay, Qian You, Shiaofen Fang, Yuni Xia, and Joseph Bidwell Indiana University Purdue University Indianapolis TEXT MINING FOR BONE BIOLOGY

Outline Introduction Background Literature Methodology Results and Discussion Conclusion

INTRODUCTION

Introduction Bone diseases affect tens of millions of people and include bone cysts, osteoarthritis, fibrous dysplasia, and osteoporosis among others. Osteoporosis affects an estimated 75 million people in Europe, USA and Japan, with 10 million people suffering from osteoporosis in the United States alone.

Introduction Goal: The extraction and visualization of relationships between biological entities related to bone biology appearing in biological databases Benefit: Keep biologists up to date on the research and also possibly uncover new relationships among biological entities.

Key Terms Bioinformatics: the application of information technology and computer science to the field of molecular biology Text mining: allows for the extraction of knowledge contained in the text-based literature

BACKGROUND LITERATURE

Background Literature Computer Science is still a relatively young science, and text mining is an even younger subset of the science Nonetheless, the field of text mining has developed very well and quite rapidly In particular, its application to the biomedical domain has attracted considerable attention The PubMed resource maintained by NIH has more than 20 million research articles, necessitating the development of automated analysis methods

Some Relevant Background Complementary Literatures: A Stimulus to Scientific Discovery 1997 paper by Swanson et al. Begin with a list of viruses that have weapons potential development and present findings meant to act as a guide to the virus literature to support further studies of defensive measures. Initially promising results

Background Literature Automatic Term Identification and Classification in Biology Texts 1999 paper by Collier et al. Made use of a decision tree for classification and term candidate identification Results indicated that while identifying term boundaries was non-trivial, a high success rate could eventually be obtained in term classification.

Background Literature Accomplishments and challenges in literature data mining for biology 2002 paper by Hirschman et al. Trace literature data mining from its recognition of protein interactions to its solutions to a improving homology search, identifying cellular location, and more Notes the field has progressed from simple term recognition to much more complex interactions between degrees of entities

Background Literature Support tools for literature-based information access in molecular biology 2009 paper by Fabio Rinaldi and Dietrich Rebholz-Schuhmann Paper shows different tools developed by the authors to support professional biologists in accessing information High performance on gold standard data does not necessarily translate into high performance for database annotation

Background Literature An application of bioinformatics and text mining to the discovery of novel genes related to bone biology 2007 paper by Gajendran, Lin, and Fyhrie Reports the results of text mining for a bone biology pathway including SMAD genes Proposed a ranking systems for relevant genes based on text mining

METHODOLOGY

Extraction To extract entity relationships from the biological literature, we examined flat relationships, which simply state there exists a relationship between two biological entities A Thesaurus-based text analysis approach is used to discover the existence of relationships

Extraction The document representation step next converts the downloaded text documents into data structures which are able to be processed without the loss of any meaningful information The process uses a thesaurus, an array T of atomic tokens (or terms) identified by a unique numeric identifier.

Tf*idf method The tf*idf (the term frequency multiplied with inverse document frequency) algorithm is applied to achieve a refined discrimination at the term representation level. The inverse document frequency (idf) component acts as a weighting factor by taking into account inter-document term distribution.

Normalized weighting where Tik represents the number of occurrences of term Tk in document i, Ik=log(N/nk) provides the inverse document frequency of term Tik in the base of documents, N is the number of documents in the base of documents, and nk is the number of documents in the base that contains the given term Tk.

Weight vector Each document di is converted to an M dimensional vector where W where W ik denotes the weights of the k th gene or protein term in the document and M indicates the number of total terms in the thesaurus. W ik will increase with the term frequency (T ik ) and decrease with the total number of documents containing the given term in the collection (n k ).

Association matrix The associations between entities k and l are computed using the following equation: The association[k][l] will always be greater than or equal to zero. The relative values of association[k][l] will indicate the product of the importance of the k th and l th term in each document

Transitive text mining The basic premise of transitive text mining is that if there are direct associations between objects A and B, as well as direct associations between objects B and C, then an association between A and C may be hypothesized even if the latter has not been explicitly seen in the literature. Such transitive associations may be efficiently determined by computing the transitive closure of the association matrix

Floyd-Warshall algorithm The transitive closure of a binary relation R on a set X is the smallest transitive relation on X that contains R The Floyd-Warshall algorithm may be used to find the transitive closure

Separation of evidence principle Evidence (i.e., a part of the capacities) once used along a transitive path may not be used again along another transitive path in defining the confidence measure of a transitive association. This will allow us to find association strength using a flow model

Maximum flow Maximum flow problem, seen as a special case of the circulation problem The Edmonds-Karp algorithm is applied for each transitive association (a,b), to find the maximum flow through the graph

RESULTS AND DISCUSSION

Results and Discussion To test our search strategy we chose to explore potential novel relationships between NMP4/CIZ (nuclear matrix protein 4/cas interacting zinc finger protein; hereafter referred to as Nmp4 for clarity) and proteins that may interact with this signalling pathway. Nmp4 is a nuclear matrix architectural transcription factor that represses genes that support the osteoblast phenotype

Terms used A summary of the terms used is presented in the following legend:

Direct Association Matrix The following direct association matrix was generated:

Transitive matrix Transitive closure and the Edmonds-Karp algorithm provided the following results:

Normalization The Direct Association Matrix then normalizes. A thresh holding value of 152.1 was then obtained and used for examining and analyzing the data. The MNF matrix was then normalized. A thresh holding value of 7000.2 was obtained from inspection of the scores. The normalize data was used to generate heat maps.

Direct Association Heat Map

MNF Heat Map

Expert Heat Map

Error computation The results from were then compared against expert provided scores. The average error was then computed as follows: Expert(l,k)-Predicted(l,k) /N r where Expert(l,k) is the expert provided score of a relationship between entities l and k, Predicted(l,k) is the predicted score of a given relationship between entities l and k, l is one entity, k is another entity, and N r is the total number of relations.

Error results Using random guessing, a random average error rate of 0.58 was obtained Using the corresponding direct association matrix, an error rate 0.35 was obtained. Using the maximum network flow method, an error rate of 0.24 was obtained. Application of the maximum flow algorithm to this problem offers significant improvement over other methods

CONCLUSION

Conclusion The biological literature is a huge and constantly increasing source of information which the biologist may consult for information about their field, but the vast amount of data can sometimes become overwhelming Text Mining, a solution to this problem, has seen a great amount of development

Conclusion The aim was to present a method which uses MNF to determine a confidence score for the derived transitive associations A specific pathway in bone biology consisting of a number of important proteins was subjected to the text mining approach A significantly higher agreement with an expert s knowledge can be obtained with transitive mining than that with only direct associations.

Extension: Hypergraphs A hypergraph is a generalization of a GRAPH, where EDGES can connect any number of VERTICES Numerous problems have been studied on hypergraphs including transitive closure, transitive reduction, flow and cut problems, and minimum weight traversal problems This could offer improved accuracy

Other Future Work Causal Model Development: A systematic procedure for constructing causality models from text mining knowledge could also be developed using Bayesian networks. Biomedical Knowledge Visualization: A visualization environment would assist biologists in understanding the data. It would also aid in the knowledge discovery and the hypothesis generation process.