Data Mining for Biological Data Analysis

Similar documents
BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

Biology Chapter 12 Test: Molecular Genetics

DNA and Biotechnology Form of DNA Form of DNA Form of DNA Form of DNA Replication of DNA Replication of DNA

Introduction to Algorithms in Computational Biology Lecture 1

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

Bioinformatics : Gene Expression Data Analysis

Introduction to Bioinformatics

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

DNA REPLICATION & BIOTECHNOLOGY Biology Study Review

Machine Learning. HMM applications in computational biology

Chapter 8. Microbial Genetics. Lectures prepared by Christine L. Case. Copyright 2010 Pearson Education, Inc.

COMP 555 Bioalgorithms. Fall Lecture 1: Course Introduction

Adv Biology: DNA and RNA Study Guide

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Biology Celebration of Learning (100 points possible)

Introduction to Cellular Biology and Bioinformatics. Farzaneh Salari

Introduction to Bioinformatics

translation The building blocks of proteins are? amino acids nitrogen containing bases like A, G, T, C, and U Complementary base pairing links

Living Environment. Directions: Use Aim # (Unit 4) to complete this study guide.

Griffith and Transformation (pages ) 1. What hypothesis did Griffith form from the results of his experiments?

Sections 12.3, 13.1, 13.2

24.5. Lesson 24.5 Nucleic Acids. Overview. In this lesson, you will cover the topics of DNA replication, gene mutation, and DNA technologies.

Textbook Reading Guidelines

ENGR 213 Bioengineering Fundamentals April 25, A very coarse introduction to bioinformatics

COS 597c: Topics in Computational Molecular Biology. DNA arrays. Background

Microarrays & Gene Expression Analysis

Chapter 12 Reading Questions

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

Videos. Lesson Overview. Fermentation

Write: Unit 5 Review at the top.

Supervised Learning from Micro-Array Data: Datamining with Care

1. The diagram below shows an error in the transcription of a DNA template to messenger RNA (mrna).

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Ch 12.DNA and RNA.Biology.Landis

MICROARRAYS: CHIPPING AWAY AT THE MYSTERIES OF SCIENCE AND MEDICINE

Lecture 10. Ab initio gene finding

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

From Gene to Protein

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Lecture Overview. Overview of the Genetic Information. Marieb s Human Anatomy and Physiology. Chapter 3 DNA & RNA Protein Synthesis Lecture 6

Classification of DNA Sequences Using Convolutional Neural Network Approach

Name Class Date. Practice Test

Name Date Class. The Central Dogma of Biology

3.1.4 DNA Microarray Technology

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Warm Up #15: Using the white printer paper on the table:

Challenging algorithms in bioinformatics

DNA/RNA STUDY GUIDE. Match the following scientists with their accomplishments in discovering DNA using the statement in the box below.

Gene Expression. Student:

Videos. Bozeman Transcription and Translation: Drawing transcription and translation:

Computational Genomics. Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall

UAMS ADVANCED DIAGNOSTICS FOR ADVANCING CURE

Bioinformatics. Lecturer: Antinisca Di Marco Tutor: Francesco Gallo

DNA Structure and Replication, and Virus Structure and Replication Test Review

Resources. How to Use This Presentation. Chapter 10. Objectives. Table of Contents. Griffith s Discovery of Transformation. Griffith s Experiments

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

What Are the Chemical Structures and Functions of Nucleic Acids?

MATH 5610, Computational Biology

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

Name: Period: Date: BIOLOGY HONORS DNA REVIEW GUIDE (extremely in detail) by Trung Pham. 5. What two bases are classified as purines? pyrimidine?

A DNA molecule consists of two strands of mononucleotides. Each of these strands

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Basic Bioinformatics: Homology, Sequence Alignment,

Match the Hash Scores

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

Motivation From Protein to Gene

UNIT 3 GENETICS LESSON #41: Transcription

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Lesson Overview. Fermentation 13.1 RNA

Replication Transcription Translation

Introduction to human genomics and genome informatics

Developmental Biology BY1101 P. Murphy

DNA & Genetics. Chapter Introduction DNA 6/12/2012. How are traits passed from parents to offspring?

MATH 5610, Computational Biology

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

4/3/2013. DNA Synthesis Replication of Bacterial DNA Replication of Bacterial DNA

Unit 3: DNA and Genetics Module 6: Molecular Basis of Heredity

Gene Expression Data Analysis

Central Dogma. 1. Human genetic material is represented in the diagram below.

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Protein Synthesis 101

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Transcription and Translation

Name: Family: Date: Monday/Tuesday, March 9,

Lecture Overview. Overview of the Genetic Information. Chapter 3 DNA & RNA Lecture 6

DNA/RNA STUDY GUIDE. Match the following scientists with their accomplishments in discovering DNA using the statement in the box below.

Genes are coded DNA instructions that control the production of proteins within a cell. The first step in decoding genetic messages is to copy a part

Protein Synthesis. Lab Exercise 12. Introduction. Contents. Objectives

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Basic Local Alignment Search Tool

BIO 101 : The genetic code and the central dogma

Molecular Genetics Quiz #1 SBI4U K T/I A C TOTAL

CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

In silico prediction of novel therapeutic targets using gene disease association data

Transcription:

Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management Systems (Second Edition) Chapter 8

Introduction to Biology

The DNA

DNA components Four nucleotide types: Adenine Guanine Cytosine Thymine Hydrogen bonds: A-T C-G

Different cell types All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

So, how does the cell use DNA?

The Central Dogma DNA contains thousands of particular segments called genes Genes contain instructions for making proteins In order to be executed these instructions have to be transcribed into mrna (similar to DNA, with Uracil instead of Thymine) Proteins are defined by a sequence of amino acids (20 types) There are almost one million of proteins that act alone or in complexes to perform many cellular functions

Gene expression Transcription Translation Gene mrna Protein Cells express different subset of the genes in different tissues and under different conditions Muscle, nervous, blood Disease, mutation

Genomic and Proteomic Thousands of genes (~25K in human DNA) function in a complicated and orchestrated way that creates the mystery of life. Genomic studies the functionality of specific genes, their relations to diseases, their associated proteins and their participation in biological processes Proteins (~1M in human organism) are responsible for many regulatory functions in cells, tissues and organism Proteome, the collection of proteins produced, evolves dynamically during time depending on environmental signals. Proteomic studies the sequences of proteins and their functionalities

Data Mining of Biological Data (1) Semantic integration of genomic and proteomic databases Data produced by different labs need to be integrated Data mining can be used to perform data cleaning, integration, object reconciliation to merge heterogeneous databases Alignment of nucleotide/protein sequences Build phylogenetic trees Similarity search Difference search Protein structure analysis 3D structure of proteins heavily affects their functionalities Prediction of protein structures Discovery of regularities

Data Mining of Biological Data (2) Association and path analysis of gene sequences Analysis of gene associations in diseases Discovery of sequential patterns of genes correlated to different stages of diseases Visualization Support to knowledge discovery Interactive data explortation

DNA Microarray Analysis

Microarray Data Analysis The DNA Microarray is a technology that allows the analysis of the gene expression levels in samples collected Such an analysis has many potential applications Earlier and more accurate diagnostics New molecular targets for therapy Improved and individualized treatments Fundamental biological discovery (e.g. finding and refining biological pathways) Examples Molecular diagnosis of leukemia, breast cancer,... Discovery that genetic signature strongly predicts outcome A few new drugs, many new promising drug targets

Motivation for DNA Microarrays Traditional methods in molecular biology generally work on a one gene in one experiment basis, which means that the throughput is very limited and the whole picture of genes function is hard to obtain In early 1997, scientists never envisioned looking at more than 25 to 50 gene-expression levels simultaneously. Today everybody tells us that they want to look at the whole genome. Kreiner, Affymetrics With a technology for simultaneously analyzing the expression levels of large numbers of genes we can: Study the behavior of co-regulated gene networks. Look for groups of genes involved in a particular biological process or in a specific disease by identifying genes whose expression levels change under certain circumstances. Detecting changes in gene expression level in order to have clues on its product function. Compare normal organism and mutant RNA transcription profiles.

The technology: hybridization DNA double strands form by gluing of complementary single strands. RNA transcript, introduced during the renaturation process, competes with the coding DNA strand and forms double-stranded DNA/RNA hybrid molecule

The technology: the whole picture

The technology: Affymetrix Chip

The technology: scanning An observation i g k g k+2 g k+3 g k+4 g k+4 Genes expression levels

Microarray Data Analysis Types Gene Selection Find genes for therapeutic targets (new drugs) Classification (supervised) Identify disease Predict outcome / select best treatment Clustering (unsupervised) Find new biological classes / refine existing ones Exploration (discovery of unknown classes)

Challenges Main challenges Few samples (usually < 100) but many features (usually genes > 1000) High probability of finding false positives, that are knowledge discovered due to random noise Models discovered need to be explainable to biologists Main steps Data preparation Feature selection Apply a classification methods Tuning parameters with crossvalidation

Preparing data Microarray data is translated in a n x p table, where n is the number of observations and p is the number of genes tested Each element <i,j> of the table is the expression level of gene j in the observation I Thresholds and transformations are applied to data Genes with a not significant variability through the whole dataset are excluded Gene 1 Gene 2 Gene 3 Sample 1 104 3208 40 Sample 2 32 1095 41

Genes Selection Most learning algorithms look for non-linear combinations of features Can easily find spurious combinations given few records and many genes ( false positives problem ) Classification accuracy improves if we first reduce number of genes by a linear method e.g. T-values of mean difference ( Avg 2 2 ( σ / N + σ 1 1 Avg 1 2 2 / ) N 2 ) Select the top N genes from each class

Genes Selection: Randomization Approach Class Gene i Healthy 3208 Sick Sick 56 80 T-Value Healty 4560 Is T-Value outcome due to chance?

Genes Selection: Randomization Approach Class Healthy Sick Gene i 3208 56 T-Value 1 T-Value 2 Sick Healty 80 4560 T-Value M Is T-Value outcome due to chance? Randomization approach Generate M random permutations of the class columns Compute T-values for each permutation and for each gene How frequent a big T-value occurs for a random permutation? Keep genes with high T-value and desired significance Limitations Genes are assumed independent Randomization is a conservative approach

Genes Selection: Wrapper Approach Generate several models and evaluate them Issues Apply T-Values to identify the top N genes Evaluate (with crossvalidation) the accuracy of the model learned using all the subset of genes selected Choose the simplest model that reaches the best performance Computationally expensive Validation sets used in the genes selection process cannot be used to assess the final performance of the model!

Classification Methods Decision Trees/Rules Model easy to understand Find smallest gene sets, but not robust Poor performance Neural Nets Work well for reduced number of genes Model is difficult to understand K-nearest neighbor Good results for small number of genes, but no model Naïve Bayes Simple, robust, but ignores gene interactions Support Vector Machines (SVM) Good accuracy, does own gene selection, but hard to understand

Biological Sequence Alignment

Alignment of biological sequence (1) Given two or more input biological sequences, identify similar sequences with long conserved subsequences Sequences can be either nucleotides (DNA/RNA) or amino acids (proteins) Tasks Nucleotides align with if they are identical Amino acids align if identical or if one can be derived from the other Pairwise sequence alignment Multiple sequence alignment Applications Discovering phylogentic trees Similarity searches

Alignment of biological sequence (2) Substitution matrix is used to define cost of substitutions cost of insertions and deletions Cost is inversely proportional to the probability that a substitution/insertion/deletion occurred Gaps ( ) can be used to indicate positions where it is preferable not to align two symbols The introduction of a gap ( ) is usually associated to a negative cost (penalty)

Example Align the following sequences: HEAGAWGHEE PAWHEAE Evaluate the following alignments P -1-1 -2-2 -4 according to the substitution matrix provided and the a gap penalty of -8 W -3-3 -3-3 15 A E H A 5-1 -2 E -1 6 0 G 0-3 -2 H -2 0 10 W -3-3 -3-2 -8 +5-8 -8 +15-8 +10 +6-8 +6 = 0-8 -8-1 -8 +5 +15-8 +10 +6-8 +6 = +1

Pairwise sequence alignment Two major approaches Local alignment, works on segments and merge them Global alignment, works on entire sequence Global alignment approaches search for the optimal alignment starting from optimal subsequences Needleman-Wunsch and Smith-Waterman algorithms exploit dynamic programming to find the optimal solution Both these algorithm have a computational complexity that is quadratic w.r.t. sequences length! Local alignment approaches (e.g. BLAST and FASTA) may be not able to find the best alignment but are more suitable to deal with long sequencess

BLAST BLAST breaks the sequences in small fragments called words A word is a k-tuple of elements (typically 11 nucleotides or 3 amino acids) BLAST first builds an hash tables of neighborhood words, that are closely matching A closeness threshold is used and statistics is applied to define how significant the matches are Starting from a fragment, the alignment is extended in both the direction by choosing the best scoring matches BLAST has computationally complexity linear w.r.t. to the sequence length Several specialized versions of BLAST have been introduced Protein similarity searches (BLASTP) Variable word size (BLASTN) Discontiguous alignments (MEGABLAST)

Multiple Sequence Alignment Methods Is important both in phylogenetic analysis and in the discovery of protein structures Multiple alignment is computationally more challenging Freng-Doolittle alignment Performs the pairwise alignments Merge them following a guide tree generated with a hierarchical clustering approach Hidden Markov Models More sophisticated probabilistic approach to represent statistical regularities in the sequences