CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Similar documents
Gene Expression Data Analysis

Statistical Methods for Network Analysis of Biological Data

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

Exam 1 from a Past Semester

Aliving cell is a complex system that needs to perform various functions and adapt to changing

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Bioinformatics for Biologists

Supporting Online Material for

Microarrays & Gene Expression Analysis

Survival Outcome Prediction for Cancer Patients based on Gene Interaction Network Analysis and Expression Profile Classification

10.1 The Central Dogma of Biology and gene expression

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Lecture 11 Microarrays and Expression Data

Gene Expression Data Analysis (I)

EECS730: Introduction to Bioinformatics

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Scoring pathway activity from gene expression data

Analysis of Microarray Data

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Package GOGANPA. February 19, 2015

Enhanced Biclustering on Expression Data

Introduction to Bioinformatics. Fabian Hoti 6.10.

Package TIN. March 19, 2019

Microarray Experiment Design

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Supplementary Figures Supplementary Figure 1

CSC 2427: Algorithms in Molecular Biology Lecture #14

Gene expression: Microarray data analysis. Copyright notice. Outline: microarray data analysis. Schedule

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

QLUCORE: Novel platform for protein microarray

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

First steps in signal-processing level models of genetic networks: identifying response pathways and clusters of coexpressed genes

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

Introduction to Bioinformatics: Chapter 11: Measuring Expression of Genome Information

Inferring Cellular Networks Using Probabilis6c Graphical Models. Jianlin Cheng, PhD University of Missouri 2010

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

MFMS: Maximal Frequent Module Set mining from multiple human gene expression datasets

Introduction Reference System Requirements GIST (Gibbs sampler Infers Signal Transduction)... 4

Finding Large Average Submatrices in High Dimensional Data

Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation

Bioinformatics for Biologists

Cell Stem Cell, volume 9 Supplemental Information

1 Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat Biotechnol doi: /nbt.3128 (2015).

Bioinformatics for Biologists

Analysis of Microarray Data

Homework : Data Mining. Due at the start of class Friday, 25 September 2009

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Comp/Phys/Mtsc 715. Example Videos. Administrative 4/12/2012. Bioinformatics Visualization. Vis 2005, Bertram. Vis 2005, Cantarel(tighten.

Inferring Gene Networks from Microarray Data using a Hybrid GA p.1

An Evaluation of a System that Recommends Microarray Experiments to Perform to Discover Gene-Regulation Pathways

Analyzing Gene Expression Time Series CS 229 Fall 2011 Final Project

HISTORICAL LINGUISTICS AND MOLECULAR ANTHROPOLOGY

Gene List Enrichment Analysis

Combining ANOVA and PCA in the analysis of microarray data

Preprocessing Methods for Two-Color Microarray Data

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

BIOINFORMATICS. Continuous Hidden Process Model for Time Series Expression Experiments

From Promoter Sequence to Expression: A Probabilistic Framework

SUPPLEMENTARY INFORMATION

Analysis of Microarray Data

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

Seven Keys to Successful Microarray Data Analysis

V3 differential gene expression analysis

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

Microarray Technique. Some background. M. Nath

Consensus and Meta-analysis Regulatory Networks for Combining Multiple Microarray Gene Expression Datasets

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Measuring gene expression

dmgwas: dense module searching for genome wide association studies in protein protein interaction network

Gene Expression Microarrays. For microarrays, purity of the RNA was further assessed by

Accurate Campaign Targeting Using Classification Algorithms

Lecture 2: March 8, 2007

DETERMINING SIGNIFICANT FOLD DIFFERENCES IN GENE EXPRESSION ANALYSIS

Lees J.A., Vehkala M. et al., 2016 In Review

A Propagation-based Algorithm for Inferring Gene-Disease Associations

CSE182-L16. LW statistics/assembly

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs. Bertil Schmidt Christian Hundt

An Association Analysis Approach to Biclustering

MulCom: a Multiple Comparison statistical test for microarray data in Bioconductor.

QTL mapping in mice. Karl W Broman. Department of Biostatistics Johns Hopkins University Baltimore, Maryland, USA.

Computational Approaches to Analysis of DNA Microarray Data

Basic aspects of Microarray Data Analysis

Effects of Different Normalization Methods on Gene Set Enrichment Analysis (GSEA)

Suberoylanilide Hydroxamic Acid Treatment Reveals. Crosstalks among Proteome, Ubiquitylome and Acetylome

CS 5984: Topics and Schedule

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

Computational Biology I

Workshop on Data Science in Biomedicine

Lecture 8: Predicting and analyzing metagenomic composition from 16S survey data

Robust Prediction of Expression Differences among Human Individuals Using Only Genotype Information

Analysis of Microarray Data

Lab 1: A review of linear models

CS302 - Data Structures using C++

Lecture 7: April 7, 2005

Downstream analysis of transcriptomic data

10. Lecture Stochastic Optimization

Methods and tools for exploring functional genomics data

STATISTICAL CHALLENGES IN GENE DISCOVERY

Introduction to microarrays

Transcription:

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer T. M. Murali January 31, 2006

Innovative Application of Hierarchical Clustering A module map showing conditional activity of expression modules in cancer, Eran Segal, Nir Friedman, Daphne Koller and Aviv Regev, Nature Genetics 36, 1090 1098, 2004 Analyse gene expression data to find groups of genes expressed in concert between different cancers. Use hierarchical clustering innovatively.

Key Ideas Group genes into predefined gene sets, e.g., groups of genes with the same functional annotation. Convert gene-by-array matrix into gene-set-by-array matrix. Hierarchically cluster gene sets in this matrix. Identify interesting gene set clusters (nodes) in the tree. In each gene set cluster, remove genes not expressed consistently with the cluster.

Gene Expression Data Sets

Data Normalisation

Data Normalisation Needed because some arrays measure absolute value of gene expression and others measure relative values. Affymetrix microarrays: take logarithm to the base-2 and zero transform within data set. cdna microarrays:

Data Normalisation Needed because some arrays measure absolute value of gene expression and others measure relative values. Affymetrix microarrays: take logarithm to the base-2 and zero transform within data set. cdna microarrays: zero transform within data set.

Pre-defined Genes Sets

Computing Gene-Set-By-Array Matrix Goal is to construct a gene-set-by-array matrix. For each gene set-array pair, find an average expression value of that gene set in that array.

Computing Gene-Set-By-Array Matrix Goal is to construct a gene-set-by-array matrix. For each gene set-array pair, find an average expression value of that gene set in that array. A gene is induced (respectively, repressed in an array if its change in expression is 2 (respectively, 2). For each gene set-array pair, compute the fraction of genes induced or repressed. Use these values in the gene-set-by-array matrix.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array.

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array. Statistical test: for a given array, is the fraction of induced genes in a gene set much larger than the fraction of induced genes in the entire array?

Computing Significant Entries in the Gene-Set-By-Array Matrix Many entries in the gene-set-by-array matrix may not be statistically significant. For a given array, fraction of induced genes in a gene set may be close to the fraction of induced genes in the array. Statistical test: for a given array, is the fraction of induced genes in a gene set much larger than the fraction of induced genes in the entire array? Compute the p-value of the fraction using the hypergeometric test. Do so for every gene-set-array pair. Use false discovery rate correction to account for multiple hypotheses testing. Replace insignificant entries by 0.

Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a.

Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a. Informally, u G,a /m G u a /m is not statistically significant.

Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a. Informally, u G,a /m G u a /m is not statistically significant. Formally, what is the probability that if we pick m G genes at random from m genes, we will select u G,a or more that are induced in a?

Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a. Informally, u G,a /m G u a /m is not statistically significant. Formally, what is the probability that if we pick m G genes at random from m genes, we will select u G,a or more that are induced in a? i u G,a

Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a. Informally, u G,a /m G u a /m is not statistically significant. Formally, what is the probability that if we pick m G genes at random from m genes, we will select u G,a or more that are induced in a? i u G,a ( ua )( m ua i m G i ( m m G ) )

If this probability is at most a user-specified threshold, we deem that entry to be statistically significant. Computing the Significance of an Entry in the Gene-Set-By-Array Matrix Let m be the number of genes in the data set. let m G be the number of genes in a gene set G. Let u a be the number of induced genes in an array a. let u G,a be the number of genes in G induced in a. Informally, u G,a /m G u a /m is not statistically significant. Formally, what is the probability that if we pick m G genes at random from m genes, we will select u G,a or more that are induced in a? i u G,a ( ua )( m ua i m G i ( m m G ) )

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves.

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves. Which nodes do we select as clusters in the tree?

Hierarchical Clustering Start from a gene-set-by-array matrix containing fraction of induced/repressed genes. Fraction is negative if repressed. Apply bottom-up hierarchical clustering. Vector at internal node is average of vectors at descendant leaves. Which nodes do we select as clusters in the tree? Associate each interior node with Pearson correlation between the two children. Cluster node whose Pearson correlation differs by more than 0.05 from the Pearson correlation of its parent.

Turning Clusters into Modules Each cluster is the union of descendant gene sets (leaves). Module Cluster minus genes whose expression is not consistent with the rest of the cluster.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set G.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set G. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a.

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set G. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R):

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set G. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R): Score(g) = log(p a )+ log(p a ) a I g is induced in a a R g is repressed in a

Testing Consistency of a Gene with a Gene Set Let g be the gene and G be the gene set G. Let I (respectively, R) be the set of arrays in which G is significantly induced (respectively, repressed). For an array a in I (or R), let p a be the fraction of genes that are induced (or repressed) by two-fold or more in a. Measure extent to which g s expression changed by more (or less) than two-fold in the arrays in I (or R): Score(g) = log(p a )+ log(p a ) a I g is induced in a a R g is repressed in a No contribution from an array in I (or R) is g is not induced (or not repressed) in a. Larger contribution from arrays with fewer induced genes. Compute statistical significance of this score.

Computing Statistical Significance of Score(g) Score(g) = log(p a ) + log(p a ) a I g is induced in a a R g is repressed in a

Computing Statistical Significance of Score(g) Score(g) = a I g is induced in a log(p a ) + a R g is repressed in a log(p a ) Null hypotheses: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly.

Computing Statistical Significance of Score(g) Score(g) = a I g is induced in a log(p a ) + a R g is repressed in a log(p a ) Null hypotheses: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a.

Computing Statistical Significance of Score(g) Score(g) = a I g is induced in a log(p a ) + a R g is repressed in a log(p a ) Null hypotheses: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a. Mean of Score(g) is a I R p a log p a and variance is a I R p a(1 p a ) log 2 p a.

Computing Statistical Significance of Score(g) Score(g) = a I g is induced in a log(p a ) + a R g is repressed in a log(p a ) Null hypotheses: genes in each array are randomly permuted, i.e., the p a induced genes in an array a I are chosen randomly. Each element in Score(g) is an independent binary random variable. Random variable takes the value log(p a ) with probability p a and the value 0 with the probability 1 p a. Mean of Score(g) is a I R p a log p a and variance is a I R p a(1 p a ) log 2 p a. Central limit theorem that the distribution of Score(g) is well-approximated by a Gaussian distribution with this mean and variance. Assess statistical significance by computing the tail of this Gaussian.

Further Analysis Statistical significance of computed modules using leave-one-out cross validation (read supplement). Compute enrichment of clinical annotations of the arrays in a module. Visualisation of modules. Literature-based analysis of modules

Conclusions Used pre-defined gene sets to drive hierarchical clustering algorithm. Remove genes from a cluster of gene sets if the gene s expression profile deviates from the cluster. Automatically decide which arrays are part of a module. Natural segue into lectures on biclustering where we will automatically decide which arrays and which genes to include in a bicluster.