Hybrid Intelligent Systems for DNA Microarray Data Analysis

Size: px
Start display at page:

Download "Hybrid Intelligent Systems for DNA Microarray Data Analysis"

Transcription

1 Hybrid Intelligent Systems for DNA Microarray Data Analysis November 27, 2007 Sung-Bae Cho Computer Science Department, Yonsei University Soft Computing Lab

2 What do I think with Bioinformatics? Biological Objects Cause Function Blackbox Disease Identification modeling Expression Data clustering classification Predict Cancer (Classify Disease) Drug Design (Personal Medicine) Identify Risk Factors optimal features & classifiers ensemble approach S.-B. Cho, Soft Computing Lab 2

3 Acknowledgements Bioinformatics team members (including OB s) C.-H. Park, K.-J. Kim, J.-H. Hong, H.-S. Park, S.-H. Yoo, H.-H. Won, J. Ryu, and H.-J. Kwon Soft Computing Lab

4 Outline Overview of DNA microarray technology Classification Comprehensive comparisons Ensemble approaches S.-B. Cho, Soft Computing Lab 4

5 DNA Microarray Technology Soft Computing Lab

6 Data Mining in Biological Data cells in human body 3*10 9 letters in DNA code in every cell in human body Only 0.2% differ between humans Human DNA is 98% identical to that of chimpanzees 97% of human DNA has no known function Bioinformatics Solving problems arising from biology using methodology from computer science Drug design, identification of risk factors, personal medicine, etc. Related topics Classification, clustering, gene modeling, gene identification S.-B. Cho, Soft Computing Lab 6

7 New Paradigm in Biology Microarray Technology One Gene Analysis Very Slow Local Analysis Thousands Gene Analysis Very Fast Global Analysis Need Computational Method Machine Learning S.-B. Cho, Soft Computing Lab 7

8 Overview DNA Microarray DNA microarray A chip or slide that has been printed with a large number of DNA spots DNA microarray technology Enables the simultaneous analysis of thousands of gene expression levels for genetic and genomic research and for diagnostics Gene : sequence of DNA that includes genetic information Two major techniques Hybridization method cdna microarray/ Oligonucleotide microarray Sequencing method Serial analysis of gene expression (SAGE) S.-B. Cho, Soft Computing Lab 8

9 Data Acquisition DNA Microarray samples samples sample 1 sample 2 sample 3 genes genes log 2 Int( Cy5) Int( Cy3) microarray image accumulated microarray image (colors) gene expression data matrix (numbers) Microarray data consist of large number of genes in small samples!! S.-B. Cho, Soft Computing Lab 9

10 Example DNA Microarray A part of Leukemia dataset, before log transformation (Golub, et al., 1999) sample Gene Description Gene Accession Number AML AML ALL AML AML ALL GB DEF = BAC clone RG293F11 from 7q21-7q22, complete sequence AC000066_at Metabotropic glutamate receptor 8 mrna AC000099_at WUGSC:H_GS188P18.1a gene extracted from Human BAC clone GS188P18 A-589H1.1 from Homo sapiens Chromosome 16 BAC clone CIT987-SKA-589H1 ~complete genomic sequence, complete sequence./ntype=dna /annot=mrna WUGSC:DJ515N1.2 gene extracted from Human PAC clone DJ515N1 from 22q11.2-q22 GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT GB DEF = PAC clone DJ525N14 from Xq23, complete sequence COX6B gene (COXG) extracted from Human DNA from overlapping chromosome 19 cosmids R31396, F25451, and R31076 containing COX6B and UPKA, genomic sequence F25451_3 gene extracted from Human DNA from overlapping chromosome 19 cosmids R31396, F25451, and R31076 containing COX6B and UPKA, genomic sequence UPKA gene extracted from Human DNA from overlapping chromosome 19 cosmids R31396, F25451, and R31076 containing COX6B and UPKA, genomic sequence AC000115_cds1_at AC002045_xpt1_at gene AC002073_cds1_at AC002077_at AC002086_at AC002115_cds1_at AC002115_cds3_at AC002115_cds4_at S.-B. Cho, Soft Computing Lab 10

11 Two Types of Data DNA Microarray Single time point in different states States : disease or tumor type Goal : classifying samples using informative genes Can be used for gene identification Feature selection/extraction Classification problem Monitoring each gene in multiple times Time series data Goal : identifying functionally related genes Can be used for gene regulatory network Clustering problem S.-B. Cho, Soft Computing Lab 11

12 Challenges DNA Microarray Noise Microarray data contain a high level of noise due to experimental procedures The labeling of cdna and the scanning of the slides frequently show non-linear characteristics Sparseness Microarray data are sparse Several thousands of genes are monitored, while the number of samples is often restricted to hundreds or less High redundancy Many genes are highly correlated, which leads to redundancy in the data Adding coexpressed genes to the classification system does not increase information for the system S.-B. Cho, Soft Computing Lab 12

13 Classification Comprehensive comparisons Ensemble approaches Soft Computing Lab

14 Motivation Many researchers have been studying many problems of cancer classification using gene expression profiles and attempting to propose the optimal classification technique to work out these problems We need a thorough effort to give the evaluation of the possible methods to solve the problems of analyzing gene expression data There are several microarray datasets leukemia cancer dataset, colon cancer dataset, lymphoma dataset, breast cancer dataset, NCI60 dataset, and ovarian cancer dataset Three datasets for our study Leukemia cancer dataset Colon cancer dataset Lymphoma cancer dataset S.-B. Cho, Soft Computing Lab 14

15 Classification Scheme DNA microarray data Selected features Class 1 Class 2 Feature selection Classification S.-B. Cho, Soft Computing Lab 15

16 Overview Feature Selection Selecting informative features appropriate to specific goal Variable selection/ gene selection Microarray data consist of large number of genes in small samples All genes are not needed for classification It is essential to select some genes highly related with particular classes for classification, which is called informative genes (Golub et al., 1999) Many selection/extraction techniques based on measures Correlation-based measures Similarity-based measures Information theory-based measures Principal component analysis S.-B. Cho, Soft Computing Lab 16

17 Top 50 Genes Selected Feature Selection Leukemia dataset PC Pearson's Correlation Gene ALL AML Sample 0 S.-B. Cho, Soft Computing Lab 17

18 Rank-based Selection Feature Selection Representative feature selection method Gene selection according to the significance order of each gene Gene number Significance Gene Gene Gene Gene Selecting order Gene 3 Gene 2 Gene 4 Gene 1 How can we calculate the significance? S.-B. Cho, Soft Computing Lab 18

19 Correlation Measures Feature Selection Measuring how much each gene is correlated with the class g ideal = (0, 0, 0,, 1, 1, 1) class pattern class 1 class 2 Pearson correlation coefficients (PC) Parametric Spearman correlation coefficients (SC) Non-parametric Feature 2 Feature Negative correlation Feature 1 Positive correlation Feature 1 No correlation S.-B. Cho, Soft Computing Lab 19

20 Similarity Measures Feature Selection Calculating geometrical similarity between ideal gene vector and each gene vector Euclidean distance (ED) Geometric distance Cosine coefficient (CC) Difference of direction d θ S.-B. Cho, Soft Computing Lab 20

21 Information Theoretic Measures Feature Selection Measuring feature-goodness based on the frequency of the feature satisfying condition Q (whether genes are induced or not) Using frequency or mean and standard deviation of data to calculate the significance of genes Information gain (IG) Mutual information (MI) Signal to noise ratio (SN) µ 1 µ 2 σ 2 σ 1 µ 2 µ 1 S.-B. Cho, Soft Computing Lab 21

22 S.-B. Cho, Soft Computing Lab 22 Mathematical Definitions ) ( ) ( ) ( ) ( ), ( ) ( ) ( log ) ( ) ( log ) ( ) ( log ) ( 1) ( ) ( 6 1 ) ) ( )( ) ( ( cos g g g g c g P C A B A A MI D B B A B B C A B A A A IG Y X XY r Y X r N N Dy Dx r N Y Y N X X N Y X XY r ine euclidean spearman pearson σ σ µ µ + = + + = = = = = = Pearson s correlation coefficient (PC) Euclidean distance (ED) Spearman s correlation coefficient (SC) Cosine coefficient (CC) Information gain (IG) Mutual information (MI) Signal to noise ratio (SN) Feature Selection

23 Principal Component Analysis Feature Selection Widely used for dimensionality reduction Given N vectors in k-dimension, find c (<= k) orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N vectors on c principal components (reduced dimensions) Each vector is a linear combination of the c principal components Principal components are directions of variance from the highest The first principal component (PC) is the direction of maximum variance, the second is that of the next highest variance, etc t ij = n k = 1 p ik m kj n : the number of significant principal components pik : the score of sample i on component k mkj : the loading on component k of variable j S.-B. Cho, Soft Computing Lab 23

24 Overview Classifier Supervised learning Need reliable and precise classification essential for successful cancer treatment Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables Uncertainties in diagnosis remain; likely that existing classes are heterogeneous Characterize molecular variations among tumors by monitoring gene expression (microarray) Hope: microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes) Class 1 Decision boundary Class 2 S.-B. Cho, Soft Computing Lab 24

25 Classifiers Classifier Multilayer perceptron K-nearest neighbor Support vector machine Decision tree Structure adaptive self-organizing map S.-B. Cho, Soft Computing Lab 25

26 Multilayer Perceptron Classifier Updating the weights recursively in order to minimize errors occurred on layer using desired output Local for updating the synaptic weights and biases Efficient for computing all the partial derivatives of the cost function with respect to these free parameters x 1 x 2 w 11 w 21 x 3 o 1 o 2 x N w KN Input layer Hidden layer Output layer S.-B. Cho, Soft Computing Lab 26

27 K-Nearest Neighbor Classifier One of the most common methods in memory based induction Deciding the labels of k known data based on similarities with known exemplars P( X, c j ) = Sim( X, d di knn i ) P( d i, c j ) b j Sim(X, d i ) : Pearson s correlation similarity function k : # of neighbors b j : a bias term S.-B. Cho, Soft Computing Lab 27

28 Support Vector Machine Classifier Introduced by Vapnik in 1995 Constructing a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized Given a labeled set of M training samples (X i, y i ), where X i R N and y i is the associated label, y i {-1, 1}, the discriminant hyperplane is defined by: f ( X ) y α k ( X, = M i = 1 Linear and RBF kernels are used i i X i ) + b S.-B. Cho, Soft Computing Lab 28

29 Decision Tree Classifier A graph (tree) based model used primarily for classification Popular method for inductive inference A method for approximating discrete-valued target functions Easy to convert learned tree into if-then rules P2 P2 <= 0.03 P2 > 0.03 tumor P21 P21 <= 0.2 P21 > 0.2 P32 normal P32 <= 0.22 P32 > 0.22 normal tumor S.-B. Cho, Soft Computing Lab 29

30 Structure Adaptive SOM Classifier Dynamic node splitting classifier based on self organizing map (SOM) Overcome the shortcoming of SOM The structure of nodes does not have to be determined before training in advance P 1 P 1 C 0 C 1 P 0 P 4 P 2 P 0 P 2 C 2 C 3 P 3 P 3 S.-B. Cho, Soft Computing Lab 30

31 Classification Performance Comparisons Lymphoma cancer dataset SVM KNN MLP SASOM Linear RBF Cosine Pearson Avg. PC SC ED CC IG MI SN Avg S.-B. Cho, Soft Computing Lab 31

32 Classification Performance Comparisons Colon cancer dataset MLP SASOM SVM KNN Linear RBF Cosine Pearson DT Avg. PC SC ED CC IG MI SN Avg S.-B. Cho, Soft Computing Lab 32

33 Classification Comprehensive comparisons Ensemble approaches Soft Computing Lab

34 Overview Ensemble Classifier Limitation of machine learning classifiers in solving practical problems Incomplete dataset Noise in data Imperfection of classification algorithm Solution Searching for effective features of input patterns Utilizing multiple features Providing multiple pathways (more chance) to the optimal solution Improving classification performance Combining multiple classifiers Combining several prospective models may produce better prediction S.-B. Cho, Soft Computing Lab 34

35 Rationale Ensemble Classifier Feature space Selected feature Solution space Φ 1 F 1 Φ 2 F 2 High and complex space Φ F 3 3 Feature selection Classification Optimal solution Estimated solution by ensemble S.-B. Cho, Soft Computing Lab 35

36 Ensemble Approach Ensemble Classifier A good ensemble includes base classifiers that Are accurate easy Make their errors in different parts of the problem domain difficult Issues for ensemble classifiers How to generate good base classifiers From combinations of features and classifiers How to combine the base classifiers Majority voting Weighted voting Borda count BKS, S.-B. Cho, Soft Computing Lab 36

37 Ensemble Generation Ensemble Classifier Feature selection m Classification n Pearson correlation coefficients (PC) Spearman correlation coefficients (SC) Cosine coefficients (CC) Euclidean distance (ED) Information gain (IG) Mutual information (MI) Signal to noise ratio (SN) Principal component analysis (PCA) Multilayer perceptron (MLP) K-nearest neighbor (KNN(C), KNN(P)) Support vector machine (SVM(L), SVM(R)) Structure adaptive self-organizing map (SASOM) Feature-classifier pair 1 Feature-classifier pair 2 Combination Huge number of available ensembles Feature-classifier pair mn mn 2 mn S.-B. Cho, Soft Computing Lab 37

38 Ensemble Strategies Ensemble Classifier Mutually exclusive features Negatively correlated features Combinatorial ensemble GA optimization Speciated GA optimization S.-B. Cho, Soft Computing Lab 38

39 Overview Mutually Exclusive Features Combining classifiers with mutually exclusive features through the analysis of correlation of features Input pattern Feature a mutually exclusive Feature b MLP KNN SVM linear SVM RBF MLP KNN SVM linear SVM RBF Combining module S.-B. Cho, Soft Computing Lab 39

40 Classification Rates Mutually Exclusive Features Leukemia dataset 100 Recognition rate [%] MLP KNN SVM RBF SVM linear KNN cosine SOM DT S.-B. Cho, Soft Computing Lab 40

41 Correlation of Features Mutually Exclusive Features Three representative cases of correlations Pearson s correlation between features has been calculated Euclidean distance Signal to noise ratio Cosine coefficient Pearson s correlation (a) Negative correlation (coefficient: -0.52) Pearson s correlation (b) Neutral (coefficient: -0.03) Pearson s correlation (c) Positive correlation (coefficient: 0.80) S.-B. Cho, Soft Computing Lab 41

42 Comparison of Accuracy Mutually Exclusive Features Recognition accuracy [%] Neural network Majority voting case(a) Negative correlation case (b) Neutral case (c) Positive correlation all feature S.-B. Cho, Soft Computing Lab 42

43 Overview Negatively Correlated Features Idea With two ideal gene vectors, select features whose expression patterns are similar to one of ideal gene vectors Train classifiers with two feature sets and combine them Method Sim(X, Y) : similarity between vector X and Y Ideal gene vector A Gene set whose expression pattern is similar to (1,1,1,,0,0,0) SGS I = argmax{sim(gene i, Ideal Gene Vector A)} Ideal gene vector B Gene set whose expression pattern is similar to (0,0,0,,1,1,1) SGS II = argmax{sim(gene i, Ideal Gene Vector B)} S.-B. Cho, Soft Computing Lab 43

44 Example Negatively Correlated Features Ideal Gene A (1,1,1,1,1,1,0,0,0,0,0,0) Ideal Gene B (0,0,0,0,0,0,1,1,1,1,1,1) Negative Gene 1 Correlation Gene 1' Gene 2 Gene 2' S.-B. Cho, Soft Computing Lab 44

45 Selected Features Negatively Correlated Features Leukemia dataset Pearson correlation coefficients ALL AML gene_3320 gene_4847 gene_2020 gene_1745 gene_5039 gene_1834 gene_461 gene_4196 gene_3847 gene_2288 gene_1249 gene_6201 gene_2242 gene_3258 gene_1882 gene_2111 gene_2121 gene_6200 gene_6373 gene_6539 gene_2043 gene_2759 gene_6803 gene_1674 gene_2402 gene_5772 gene_2301 gene_6055 gene_387 gene_4167 gene_4230 gene_6990 gene_4328 gene_6281 gene_5593 gene_2543 gene_1306 gene_6064 gene_2050 gene_3386 gene_2441 gene_4289 gene_4389 gene_1928 gene_515 gene_2354 gene_6471 gene_6515 gene_149 gene_3070 SGS II SGS I S.-B. Cho, Soft Computing Lab 45

46 PCA 3D Plot Negatively Correlated Features Select 25 genes from SGS I + 25 genes from SGS II by Pearson correlation coefficients and extract 3 principal components Well classifying AML and ALL Third PC Second PC First PC Red : ALL Blue : AML S.-B. Cho, Soft Computing Lab 46

47 Comparison of Performance Negatively Correlated Features accuracy(%) sensitivity(%) specificity(%) Leukemia MLP I MLP II MLP I + MLP II Colon MLP I MLP II MLP I + MLP II Lymphoma MLP I MLP II MLP I + MLP II S.-B. Cho, Soft Computing Lab 47

48 Overview Combinatorial Ensemble In theory, a good ensemble should include base classifiers that Are accurate Make their errors in different parts of the problem domain In practice Easy to obtain weak classifiers whose accuracy is about 50% Very difficult to get uncorrelated classifiers large number of classifiers do not guarantee the good performance of ensemble Testing ensembles combinatorially until the promising number of ensembles instead of all available ensembles S.-B. Cho, Soft Computing Lab 48

49 Structure Combinatorial Ensemble Gene Expression Data Methods F 1 F 2 F 3 F i Classifiers Selection C 1 C 2 C 3.Feature C j.feature-classifier Sets F 1 C 1 F 1 C 2 F 1 C 2 F i C j.n Combinatorial Selection ( n C 5 ) Ensemble Method prediction 1.Class c S.-B. Cho, Soft Computing Lab 49

50 Comparison of Accuracy Combinatorial Ensemble Combining method # of classifiers Leukemia Colon Lymphoma Majority voting All Weighted voting All Bayesian Combination All is less accurate, 7 is expensive S.-B. Cho, Soft Computing Lab 50

51 Overview GA Optimization There are so many available ensembles from several classifiers Exponentially increase with respect to the number of classifiers 48 base feature-classifier pairs make 2 48 ensembles Exhaustive searching is very time-consuming Use GA to find optimal ensemble in a short time Ensemble is made from 48 base feature-classifier pairs from 8 feature selection methods and 6 classifiers S.-B. Cho, Soft Computing Lab 51

52 Structure GA Optimization Normalized Gene Expression Profiles Feature Selector 1 Feature Selector 2... Feature Selector m feature-classifier pairs Classifier 1... Classifier Classifier n fitness evaluation x x o... GA searching x o x Ensemble Cancer Normal S.-B. Cho, Soft Computing Lab 52

53 GA Chromosome GA Optimization 0 CC-MLP 1 ED-MLP % 1 IG-MLP % 0 MI-MLP 0 PC-MLP 48 bits 1 PCA-MLP % 0 SN-MLP 0. 0 SC-MLP. SC-SVM(RBF) % ensemble result actual class Majority voting Genotype (chromosome) Phenotype (feature-classifier) Result of featureclassifier pair Fitness of a chromosome ch: Fit( ch) = # of correctly classified samples by ch # of total classified samples by ch S.-B. Cho, Soft Computing Lab 53

54 Change of Average Fitness GA Optimization Fitness Iteration Increase until the number of iterations reaches 150 Saturated after 150 iterations S.-B. Cho, Soft Computing Lab 54

55 Leave-one-out-cross Validation GA Optimization 100 validation(ensemble) validation(ensemble) test Accuracy(%) training average range test validation (single) training validation (single) Lymphoma Colon Optimal ensemble searched by GA outperforms!! S.-B. Cho, Soft Computing Lab 55

56 Comparison of Accuracy GA Optimization 100 accuracy best single ensemble of good classifiers best ensemble among 1 milion random ensemble best ensemble among 1 milion - simple GA, sharing best ensemble among 1 milion - crowding experiment GA > best single classifier > ensemble of good classifiers S.-B. Cho, Soft Computing Lab 56

57 Some Optimal Ensembles GA Optimization Majority voting Weighted voting Feature-classifier pair Accuracy (%) Feature-classifier pair Accuracy (%) CC-KNN(P) 75.0 MI-KNN(C) 83.3 SN-KNN(C) 79.2 SC-SASOM 62.5 IG-SVM(L) 91.7 Ensemble 100 IG-KNN(C) 91.7 MI-KNN(C) 83.3 SN-KNN(C) 79.2 SN-KNN(P) 79.2 CC-SASOM 54.2 IG-SASOM 83.3 PC-SVM(R) 62.5 Ensemble 100 S.-B. Cho, Soft Computing Lab 57

58 Overview Speciated GA Optimization Among all the 2 mn ensembles Standard GA does not guarantee optimal solution GA usually converges to local optima There may be many optimal ensembles The number is unknown GA just finds one of them Use of speciated GA instead of standard GA Fitness sharing Deterministic crowding S.-B. Cho, Soft Computing Lab 58

59 Concept Speciated GA Optimization Solution space genetic drift Ω Observation space Solutions searched by simple GA Solutions searched by speciated GA S.-B. Cho, Soft Computing Lab 59

60 Structure Speciated GA Optimization Microarray data Preprocessing Gene expression data matrix Feature selection PC SC ED CC IG MI SN PCA... Classifier MLP KNN(C) KNN(P) SVM(L) SVM(R) SASOM... Training FCs FC1 FC2 FC2... FC48 Ensemble Ensemble maker Searching speciated GA searching Validation Optimal ensemble Evaluation new instance Test Tumor Normal S.-B. Cho, Soft Computing Lab 60

61 Fitness Function Speciated GA Optimization Fitness of a chromosome ch Fitness( ch) = Acc( ch) α * Num1( ch) where Acc( ch) = # of correctly classified samples by ch # of total classified samples by ch The shorter, the better Num 1 ( ch) = # of bit 1's in chromosome ch α :constant S.-B. Cho, Soft Computing Lab 61

62 Deterministic Crowding Speciated GA Optimization Input: g - number of generations to run, s - population size Output: P(g) - the final population P(0) initialize() for t 1 to g do P(t) shuffle(p(t-1)) for i 0 to s/2-1 do a 2i+1 (t) Od od p 1 p 2 a 2i+2 (t) {c1, c2} recombination(p1, p2) c 1 ' mutate(c 1 ) c 2 ' mutate(c 2 ) if[d(p 1,c 1 ')+d(p 2,c 2 ')] [d(p 1,c 2 ')+d(p 2,c 1 ')] then if F(c 1 ') > F(p 1 ) then a 2i+1 (t) c 1 ' fi if F(c 2 ') > F(p 2 ) then a 2i+2 (t) c 2 ' fi else if F(c 2 ') > F(p 1 ) then a 2i+1 (t) c 2 ' fi if F(c 1 ') > F(p 2 ) then a 2i+1 (t) c 1 ' fi fi S.-B. Cho, Soft Computing Lab 62

63 Fitness Sharing Speciated GA Optimization A strategy that maintains diversity of chromosomes through lowering the fitnesses of individuals that are located close Use shared fitness F (i) instead of original fitness F(i) F( i) F '( i) = m( i) µ m ( i) = sh( d( i, j)) sh(d ) = j 1 sharing α 1 ( d / σ share ) if d < σ share 0 otherwise shared fitness fitness S.-B. Cho, Soft Computing Lab 63

64 Comparison of Diversity Speciated GA Optimization The number of optimal ensembles found by each method on one dataset Experiment sga sharing crowding crowding >> sga sharing S.-B. Cho, Soft Computing Lab 64

65 Speciated GA Optimization Change of Fitness and Accuracy fitness, accuracy simple GA, fitness simple GA, accuracy sharing, fitness sharing, accuracy crowding, fitness crowding, accuracy iteration crowding >> sga sharing S.-B. Cho, Soft Computing Lab 65

66 Search Efficiency Speciated GA Optimization Iterations Common GA Sharing Crowding Execution time per iteration: simple GA < crowding < sharing S.-B. Cho, Soft Computing Lab 66

67 Conclusion Classification Comparisons of feature/classifiers Exploration of ensemble approaches S.-B. Cho, Soft Computing Lab 67

DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm

DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm DNA Gene Expression Classification with Ensemble Classifiers Optimized by Speciated Genetic Algorithm Kyung-Joong Kim and Sung-Bae Cho Department of Computer Science, Yonsei University, 134 Shinchon-dong,

More information

Our view on cdna chip analysis from engineering informatics standpoint

Our view on cdna chip analysis from engineering informatics standpoint Our view on cdna chip analysis from engineering informatics standpoint Chonghun Han, Sungwoo Kwon Intelligent Process System Lab Department of Chemical Engineering Pohang University of Science and Technology

More information

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA advanced analysis of gene expression microarray data aidong zhang State University of New York at Buffalo, USA World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONG KONG TAIPEI CHENNAI Contents

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

Lymphoma Cancer Classification Using Genetic Programming with SNR Features Lymphoma Cancer Classification Using Genetic Programming with SNR Features JinHyuk Hong and SungBae Cho Dept. of Computer Science, Yonsei University, 134 Shinchondong, Sudaemoonku, Seoul 120749, Korea

More information

Bioinformatics : Gene Expression Data Analysis

Bioinformatics : Gene Expression Data Analysis 05.12.03 Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering What is Bioinformatics Broad Definition The study of how information technologies are used

More information

Classifying Gene Expression Data using an Evolutionary Algorithm

Classifying Gene Expression Data using an Evolutionary Algorithm Classifying Gene Expression Data using an Evolutionary Algorithm Thanyaluk Jirapech-umpai E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science School of Informatics University of Edinburgh

More information

Microarrays & Gene Expression Analysis

Microarrays & Gene Expression Analysis Microarrays & Gene Expression Analysis Contents DNA microarray technique Why measure gene expression Clustering algorithms Relation to Cancer SAGE SBH Sequencing By Hybridization DNA Microarrays 1. Developed

More information

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan International Conference on Mechatronics Engineering and Information Technology (ICMEIT 2016) Study on the Application of Mining in Bioinformatics Mingyang Yuan School of Science and Liberal Arts, New

More information

Analysis of microarray data

Analysis of microarray data BNF078 Fall 2006 Analysis of microarray data Markus Ringnér Computational Biology and Biological Physics Department of Theoretical Physics Lund University markus@thep.lu.se 046-2229337 1 Contents Preface

More information

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Microarrays: designing chips, clustering methods Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Course Syllabus Jan 7 Jan 14 Jan 21 Jan 28 Feb 4 Feb 11 Feb 18 Feb 25 Sequence

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Functional Genomics: Microarray Data Analysis Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Outline Introduction Working with microarray data Normalization Analysis

More information

Random forest for gene selection and microarray data classification

Random forest for gene selection and microarray data classification www.bioinformation.net Hypothesis Volume 7(3) Random forest for gene selection and microarray data classification Kohbalan Moorthy & Mohd Saberi Mohamad* Artificial Intelligence & Bioinformatics Research

More information

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology. G16B BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY Methods or systems for genetic

More information

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Feature Selection of Gene Expression Data for Cancer Classification: A Review Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 50 (2015 ) 52 57 2nd International Symposium on Big Data and Cloud Computing (ISBCC 15) Feature Selection of Gene Expression

More information

Ensemble methods for bioinformatics

Ensemble methods for bioinformatics Ensemble methods for bioinformatics Giorgio Valentini e-mail: valenti@disi.unige.it Ensemble methods for bioinformatics and for gene expression data analysis Applied in different bioinformatics domains:

More information

Data Mining and Applications in Genomics

Data Mining and Applications in Genomics Data Mining and Applications in Genomics Lecture Notes in Electrical Engineering Volume 25 For other titles published in this series, go to www.springer.com/series/7818 Sio-Iong Ao Data Mining and Applications

More information

Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis

Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Bagged Ensembles of Support Vector Machines for Gene Expression Data Analysis Giorgio Valentini INFM, Istituto Nazionale di Fisica della Materia, DSI, Dip. di Scienze dell Informazione Università degli

More information

Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics

Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics PavanKumar kolla* kolla.haripriyanka+ *School of Computing Sciences,

More information

Gene Expression Data Analysis

Gene Expression Data Analysis Gene Expression Data Analysis Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu BMIF 310, Fall 2009 Gene expression technologies (summary) Hybridization-based

More information

DNA Microarrays and Clustering of Gene Expression Data

DNA Microarrays and Clustering of Gene Expression Data DNA Microarrays and Clustering of Gene Expression Data Martha L. Bulyk mlbulyk@receptor.med.harvard.edu Biophysics 205 Spring term 2008 Traditional Method: Northern Blot RNA population on filter (gel);

More information

Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms

Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms Laboratoire d Informatique Fondamentale de Lille Gene Selection in Cancer Classification using PSO/SVM and GA/SVM Hybrid Algorithms Enrique Alba, José GarcíaNieto, Laetitia Jourdan and ElGhazali Talbi

More information

A Comparative Study of Microarray Data Analysis for Cancer Classification

A Comparative Study of Microarray Data Analysis for Cancer Classification A Comparative Study of Microarray Data Analysis for Cancer Classification Kshipra Chitode Research Student Government College of Engineering Aurangabad, India Meghana Nagori Asst. Professor, CSE Dept Government

More information

Introduction to Bioinformatics. Fabian Hoti 6.10.

Introduction to Bioinformatics. Fabian Hoti 6.10. Introduction to Bioinformatics Fabian Hoti 6.10. Analysis of Microarray Data Introduction Different types of microarrays Experiment Design Data Normalization Feature selection/extraction Clustering Introduction

More information

First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes

First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes Olga Troyanskaya lecture for cheme537/cs554 some slides borrowed from

More information

Computational Biology I

Computational Biology I Computational Biology I Microarray data acquisition Gene clustering Practical Microarray Data Acquisition H. Yang From Sample to Target cdna Sample Centrifugation (Buffer) Cell pellets lyse cells (TRIzol)

More information

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.

More information

Bioinformatics and Genomics: A New SP Frontier?

Bioinformatics and Genomics: A New SP Frontier? Bioinformatics and Genomics: A New SP Frontier? A. O. Hero University of Michigan - Ann Arbor http://www.eecs.umich.edu/ hero Collaborators: G. Fleury, ESE - Paris S. Yoshida, A. Swaroop UM - Ann Arbor

More information

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development

Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development 1 Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development Alexander Statnikov M.S., Constantin F. Aliferis M.D.,

More information

Functional genomics + Data mining

Functional genomics + Data mining Functional genomics + Data mining BIO337 Systems Biology / Bioinformatics Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ of Texas/BIO337/Spring 2014 Functional genomics + Data

More information

APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION. A Thesis. Presented to

APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION. A Thesis. Presented to APPLICATION OF COMMITTEE k-nn CLASSIFIERS FOR GENE EXPRESSION PROFILE CLASSIFICATION A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for

More information

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics VI. Support Vector Machine Applications in Bioinformatics Jianlin Cheng, PhD Computer Science Department and Informatics Institute University of

More information

Machine Learning. HMM applications in computational biology

Machine Learning. HMM applications in computational biology 10-601 Machine Learning HMM applications in computational biology Central dogma DNA CCTGAGCCAACTATTGATGAA transcription mrna CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly

More information

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005 Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of

More information

Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition

Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition Classification Study on DNA Microarray with Feedforward Neural Network Trained by Singular Value Decomposition Hieu Trung Huynh 1, Jung-Ja Kim 2 and Yonggwan Won 1 1 Department of Computer Engineering,

More information

Gene set based ensemble methods for cancer classification

Gene set based ensemble methods for cancer classification Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2013 Gene set based ensemble methods for cancer classification William Evans Duncan Louisiana State University

More information

Classification and Learning Using Genetic Algorithms

Classification and Learning Using Genetic Algorithms Sanghamitra Bandyopadhyay Sankar K. Pal Classification and Learning Using Genetic Algorithms Applications in Bioinformatics and Web Intelligence With 87 Figures and 43 Tables 4y Spri rineer 1 Introduction

More information

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data Predictive Genomics, Biology, Medicine Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data Ex. Find mean m and standard deviation s for

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

DNA Based Disease Prediction using pathway Analysis

DNA Based Disease Prediction using pathway Analysis 2017 IEEE 7th International Advance Computing Conference DNA Based Disease Prediction using pathway Analysis Syeeda Farah Dr.Asha T Cauvery B and Sushma M S Department of Computer Science and Shivanand

More information

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Contents Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine

Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine Prediction of Success or Failure of Software Projects based on Reusability Metrics using Support Vector Machine R. Sathya Assistant professor, Department of Computer Science & Engineering Annamalai University

More information

Neural Networks and Applications in Bioinformatics

Neural Networks and Applications in Bioinformatics Contents Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking

Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking Gene Reduction for Cancer Classification using Cascaded Neural Network with Gene Masking Raneel Kumar, Krishnil Chand, Sunil Pranit Lal School of Computing, Information, and Mathematical Sciences University

More information

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner

Support Vector Machines (SVMs) for the classification of microarray data. Basel Computational Biology Conference, March 2004 Guido Steiner Support Vector Machines (SVMs) for the classification of microarray data Basel Computational Biology Conference, March 2004 Guido Steiner Overview Classification problems in machine learning context Complications

More information

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU

A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi YIN, Li-zhen LIU*, Wei SONG, Xin-lei ZHAO and Chao DU 2017 2nd International Conference on Artificial Intelligence: Techniques and Applications (AITA 2017 ISBN: 978-1-60595-491-2 A Protein Secondary Structure Prediction Method Based on BP Neural Network Ru-xi

More information

Comparative Genomic Hybridization

Comparative Genomic Hybridization Comparative Genomic Hybridization Srikesh G. Arunajadai Division of Biostatistics University of California Berkeley PH 296 Presentation Fall 2002 December 9 th 2002 OUTLINE CGH Introduction Methodology,

More information

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET 1 J.JEYACHIDRA, M.PUNITHAVALLI, 1 Research Scholar, Department of Computer Science and Applications,

More information

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report Predicting prokaryotic incubation times from genomic features Maeva Fincker - mfincker@stanford.edu Final report Introduction We have barely scratched the surface when it comes to microbial diversity.

More information

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE CHAPTER1 ROAD TO STATISTICAL BIOINFORMATICS Jae K. Lee Department of Public Health Science, University of Virginia, Charlottesville, Virginia, USA There has been a great explosion of biological data and

More information

Measuring gene expression (Microarrays) Ulf Leser

Measuring gene expression (Microarrays) Ulf Leser Measuring gene expression (Microarrays) Ulf Leser This Lecture Gene expression Microarrays Idea Technologies Problems Quality control Normalization Analysis next week! 2 http://learn.genetics.utah.edu/content/molecules/transcribe/

More information

PCA and SOM based Dimension Reduction Techniques for Quaternary Protein Structure Prediction

PCA and SOM based Dimension Reduction Techniques for Quaternary Protein Structure Prediction PCA and SOM based Dimension Reduction Techniques for Quaternary Protein Structure Prediction Sanyukta Chetia Department of Electronics and Communication Engineering, Gauhati University-781014, Guwahati,

More information

2. Materials and Methods

2. Materials and Methods Identification of cancer-relevant Variations in a Novel Human Genome Sequence Robert Bruggner, Amir Ghazvinian 1, & Lekan Wang 1 CS229 Final Report, Fall 2009 1. Introduction Cancer affects people of all

More information

Smart India Hackathon

Smart India Hackathon TM Persistent and Hackathons Smart India Hackathon 2017 i4c www.i4c.co.in Digital Transformation 25% of India between age of 16-25 Our country needs audacious digital transformation to reach its potential

More information

Supervised Learning from Micro-Array Data: Datamining with Care

Supervised Learning from Micro-Array Data: Datamining with Care November 18, 2002 Stanford Statistics 1 Supervised Learning from Micro-Array Data: Datamining with Care Trevor Hastie Stanford University November 18, 2002 joint work with Robert Tibshirani, Balasubramanian

More information

Network System Inference

Network System Inference Network System Inference Francis J. Doyle III University of California, Santa Barbara Douglas Lauffenburger Massachusetts Institute of Technology WTEC Systems Biology Final Workshop March 11, 2005 What

More information

Reliable classification of two-class cancer data using evolutionary algorithms

Reliable classification of two-class cancer data using evolutionary algorithms BioSystems 72 (23) 111 129 Reliable classification of two-class cancer data using evolutionary algorithms Kalyanmoy Deb, A. Raji Reddy Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of

More information

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM) PROGRAM TITLE DEGREE TITLE Master of Science Program in Bioinformatics and System Biology (International Program) Master of Science (Bioinformatics

More information

Predicting Corporate Influence Cascades In Health Care Communities

Predicting Corporate Influence Cascades In Health Care Communities Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice

More information

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo

Data Mining in Bioinformatics. Prof. André de Carvalho ICMC-Universidade de São Paulo Data Mining in Bioinformatics Prof. André de Carvalho ICMC-Universidade de São Paulo Main topics Motivation Data Mining Prediction Bioinformatics Molecular Biology Using DM in Molecular Biology Case studies

More information

GA-SVM WRAPPER APPROACH FOR GENE RANKING AND CLASSIFICATION USING EXPRESSIONS OF VERY FEW GENES

GA-SVM WRAPPER APPROACH FOR GENE RANKING AND CLASSIFICATION USING EXPRESSIONS OF VERY FEW GENES GA-SVM WRAPPER APPROACH FOR GENE RANKING AND CLASSIFICATION USING EXPRESSIONS OF VERY FEW GENES N.REVATHY 1, Dr.R.BALASUBRAMANIAN 2 1 Assistant Professor, Department of Computer Applications, Karpagam

More information

Introduction to Microarray Analysis

Introduction to Microarray Analysis Introduction to Microarray Analysis Methods Course: Gene Expression Data Analysis -Day One Rainer Spang Microarrays Highly parallel measurement devices for gene expression levels 1. How does the microarray

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics If the 19 th century was the century of chemistry and 20 th century was the century of physic, the 21 st century promises to be the century of biology...professor Dr. Satoru

More information

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data http://www.psi.toronto.edu Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data Jim C. Huang, Quaid D. Morris, Brendan J. Frey October 06, 2004 PSI TR 2004 031 Iterated

More information

Feature selection methods for SVM classification of microarray data

Feature selection methods for SVM classification of microarray data Feature selection methods for SVM classification of microarray data Mike Love December 11, 2009 SVMs for microarray classification tasks Linear support vector machines have been used in microarray experiments

More information

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification Final Project Report Alexander Herrmann Advised by Dr. Andrew Gentles December

More information

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis Gene expression analysis Biosciences 741: Genomics Fall, 2013 Week 5 Gene expression analysis From EST clusters to spotted cdna microarrays Long vs. short oligonucleotide microarrays vs. RT-PCR Methods

More information

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks

Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Bayesian Variable Selection and Data Integration for Biological Regulatory Networks Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Gary

More information

296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006

296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006 296 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 10, NO. 3, JUNE 2006 An Evolutionary Clustering Algorithm for Gene Expression Microarray Data Analysis Patrick C. H. Ma, Keith C. C. Chan, Xin Yao,

More information

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara Customer Relationship Management in marketing programs: A machine learning approach for decision Fernanda Alcantara F.Alcantara@cs.ucl.ac.uk CRM Goal Support the decision taking Personalize the best individual

More information

Finding molecular signatures from gene expression data: review and a new proposal

Finding molecular signatures from gene expression data: review and a new proposal Finding molecular signatures from gene expression data: review and a new proposal Ramón Díaz-Uriarte rdiaz@cnio.es http://bioinfo.cnio.es/ rdiaz Unidad de Bioinformática Centro Nacional de Investigaciones

More information

ISTITUTO DI ANALISI DEI SISTEMI ED INFORMATICA Antonio Ruberti CONSIGLIO NAZIONALE DELLE RICERCHE

ISTITUTO DI ANALISI DEI SISTEMI ED INFORMATICA Antonio Ruberti CONSIGLIO NAZIONALE DELLE RICERCHE ISTITUTO DI ANALISI DEI SISTEMI ED INFORMATICA Antonio Ruberti CONSIGLIO NAZIONALE DELLE RICERCHE P. Bertolazzi, G. Felici, G. Lancia APPLICATION OF FEATURE SELECTION AND CLASSIFICATION TO COMPUTATIONAL

More information

Measuring gene expression

Measuring gene expression Measuring gene expression Grundlagen der Bioinformatik SS2018 https://www.youtube.com/watch?v=v8gh404a3gg Agenda Organization Gene expression Background Technologies FISH Nanostring Microarrays RNA-seq

More information

Gene expression analysis: Introduction to microarrays

Gene expression analysis: Introduction to microarrays Gene expression analysis: Introduction to microarrays Adam Ameur The Linnaeus Centre for Bioinformatics, Uppsala University February 15, 2006 Overview Introduction Part I: How a microarray experiment is

More information

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics Institute A brief outline of this course What is gene expression, why it s important Microarrays and how

More information

Including prior knowledge in shrinkage classifiers for genomic data

Including prior knowledge in shrinkage classifiers for genomic data Including prior knowledge in shrinkage classifiers for genomic data Jean-Philippe Vert Jean-Philippe.Vert@mines-paristech.fr Mines ParisTech / Curie Institute / Inserm Statistical Genomics in Biomedical

More information

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data 2011 International Conference on Information and Electronics Engineering IPCSIT vol.6 (2011) (2011) IACSIT Press, Singapore Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression

More information

Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue

Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue Artificial Intelligence in Medicine 28 (2003) 165 189 Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue Matthias E. Futschik a,*, Anthony Reeve b, Nikola

More information

STATISTICAL CHALLENGES IN GENE DISCOVERY

STATISTICAL CHALLENGES IN GENE DISCOVERY STATISTICAL CHALLENGES IN GENE DISCOVERY THROUGH MICROARRAY DATA ANALYSIS 1 Central Tuber Crops Research Institute,Kerala, India 2 Dept. of Statistics, St. Thomas College, Pala, Kerala, India email:sreejyothi

More information

Machine Learning Methods for Microarray Data Analysis

Machine Learning Methods for Microarray Data Analysis Harvard-MIT Division of Health Sciences and Technology HST.512: Genomic Medicine Prof. Marco F. Ramoni Machine Learning Methods for Microarray Data Analysis Marco F. Ramoni Children s Hospital Informatics

More information

Statistical Methods for Network Analysis of Biological Data

Statistical Methods for Network Analysis of Biological Data The Protein Interaction Workshop, 8 12 June 2015, IMS Statistical Methods for Network Analysis of Biological Data Minghua Deng, dengmh@pku.edu.cn School of Mathematical Sciences Center for Quantitative

More information

Machine Learning in Computational Biology CSC 2431

Machine Learning in Computational Biology CSC 2431 Machine Learning in Computational Biology CSC 2431 Lecture 9: Combining biological datasets Instructor: Anna Goldenberg What kind of data integration is there? What kind of data integration is there? SNPs

More information

Introduction to gene expression microarray data analysis

Introduction to gene expression microarray data analysis Introduction to gene expression microarray data analysis Outline Brief introduction: Technology and data. Statistical challenges in data analysis. Preprocessing data normalization and transformation. Useful

More information

Homework : Data Mining. Due at the start of class Friday, 25 September 2009

Homework : Data Mining. Due at the start of class Friday, 25 September 2009 Homework 4 36-350: Data Mining Due at the start of class Friday, 25 September 2009 This homework set applies methods we have developed so far to a medical problem, gene expression in cancer. In some questions

More information

Single-cell sequencing

Single-cell sequencing Single-cell sequencing Harri Lähdesmäki Department of Computer Science Aalto University December 5, 2017 Contents Background & Motivation Single cell sequencing technologies Single cell sequencing data

More information

Syllabus for BIOS 101, SPRING 2013

Syllabus for BIOS 101, SPRING 2013 Page 1 Syllabus for BIOS 101, SPRING 2013 Name: BIOSTATISTICS 101 for Cancer Researchers Time: March 20 -- May 29 4-5pm in Wednesdays, [except 4/15 (Mon) and 5/7 (Tue)] Location: SRB Auditorium Background

More information

Cancer Classification using Support Vector Machines and Relevance Vector Machine based on Analysis of Variance Features

Cancer Classification using Support Vector Machines and Relevance Vector Machine based on Analysis of Variance Features Journal of Computer Science 7 (9): 1393-1399, 2011 ISSN 1549-3636 2011 Science Publications Cancer Classification using Support Vector Machines and Relevance Vector Machine based on Analysis of Variance

More information

In silico prediction of novel therapeutic targets using gene disease association data

In silico prediction of novel therapeutic targets using gene disease association data In silico prediction of novel therapeutic targets using gene disease association data, PhD, Associate GSK Fellow Scientific Leader, Computational Biology and Stats, Target Sciences GSK Big Data in Medicine

More information

Microarray analysis challenges.

Microarray analysis challenges. Microarray analysis challenges. While not quite as bad as my hobby of ice climbing you, need the right equipment! T. F. Smith Bioinformatics Boston Univ. Experimental Design Issues Reference and Controls

More information

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines

Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Identifying Splice Sites Of Messenger RNA Using Support Vector Machines Paige Diamond, Zachary Elkins, Kayla Huff, Lauren Naylor, Sarah Schoeberle, Shannon White, Timothy Urness, Matthew Zwier Drake University

More information

Estoril Education Day

Estoril Education Day Estoril Education Day -Experimental design in Proteomics October 23rd, 2010 Peter James Note Taking All the Powerpoint slides from the Talks are available for download from: http://www.immun.lth.se/education/

More information

Measuring and Understanding Gene Expression

Measuring and Understanding Gene Expression Measuring and Understanding Gene Expression Dr. Lars Eijssen Dept. Of Bioinformatics BiGCaT Sciences programme 2014 Why are genes interesting? TRANSCRIPTION Genome Genomics Transcriptome Transcriptomics

More information

A Hybrid Approach for Gene Selection and Classification using Support Vector Machine

A Hybrid Approach for Gene Selection and Classification using Support Vector Machine The International Arab Journal of Information Technology, Vol. 1, No. 6A, 015 695 A Hybrid Approach for Gene Selection and Classification using Support Vector Machine Jaison Bennet 1, Chilambuchelvan Ganaprakasam

More information

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology - Methods of Biomaterials Testing Lesson 3-5 Biochemical Methods - Molecular Biology - Chromosomes in the Cell Nucleus DNA in the Chromosome Deoxyribonucleic Acid (DNA) DNA has double-helix structure The

More information

Learning Methods for DNA Binding in Computational Biology

Learning Methods for DNA Binding in Computational Biology Learning Methods for DNA Binding in Computational Biology Mark Kon Dustin Holloway Yue Fan Chaitanya Sai Charles DeLisi Boston University IJCNN Orlando August 16, 2007 Outline Background on Transcription

More information

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter VizX Labs, LLC Seattle, WA 98119 Abstract Oligonucleotide microarrays were used to study

More information