GeneQuery: A phenotype search tool based on gene co-expression clustering. Alexander Predeus 21-oct-2015

Size: px
Start display at page:

Download "GeneQuery: A phenotype search tool based on gene co-expression clustering. Alexander Predeus 21-oct-2015"

Transcription

1 GeneQuery: A phenotype search tool based on gene co-expression clustering Alexander Predeus 21-oct-2015

2 About myself Graduated from Moscow state University ( ) PhD: Michigan State University ( ): organometallic chemistry Post-doc #1, MSU: quantitative biology (molecular dynamics) Post-doc #2, Wash U: next-generation sequencing, systems biology

3 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

4 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

5 Clustering The goal of any clustering is to group objects by similarity Thus clustering reveals the inner structure of the data Two questions arise: how do you measure similarity? how do you use that measure to find the groups?

6 Gene clustering For gene expression you can cluster individual samples (columns) Samples from same conditions are expected to cluster together (if not - batch effect!) You can also cluster genes (rows) Genes that are regulated in the same pathway tend to be co-expressed

7 Ways to measure distance & cluster metrics can vary depending on the goal Euclidean distance or correlation are commonly used E.d. is sensitive to scaling and average expression, correlation is not a. groups b. hierarchical c. k-means d. SOM

8 Real-life datasets are hard to cluster, because they are messy outlier samples and genes are most often the problem cluster shape is also important

9 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

10 Mjölnir Mjölnir is Thor s hammer, that cannot be lifted by anyone who is not a god

11 An astronomer s perspective it s not necessarily magical, could just be very heavy

12 WGCNA algorithm 1 Stands for Weighted Correlation Network Analysis Uses a type of Pearson correlation-based metric As the name suggests, tightly related to network analysis paradigm, more concretely to a concept of a scale-free network If we have m samples, and x j is the vector of expression values of length m, and we are comparing genes i and j, similarity: unweighted adjacency if τ (cutoff correlation) is set to 0.8, no genes with correlation of or below would be considered adjacent

13 WGCNA algorithm 2 Microarrays are noisy! β is would be > 1, usually 6-20 in real applications This approach is called soft thresholding Gene significance can range from 0 to 1 and is defined via clinical trait, that can be quantitative (body weight) or qualitative (treatment vs control) By calculating all a ij we have constructed an n x n adjacency matrix, where n in the number of genes

14 Soft thresholding illustrated As the power beta increases, adjacency of lowly correlated genes becomes negligible

15 What does it have to do with a network? if genes are adjacent, that could be represented as a connection if not, then there is no connection. Adjacency matrix Network

16 Why bring the network into this at all? Scale-free network is defined as one for which the probability of a node having k connections decays as a power law: p(k) = k -γ Scale-free topology is a philosophical phenomenon

17 WGCNA algorithm 3 So we want our network scale-free; how do we achieve it? First, we calculate connectivities: Then we simply change β in the range from 1 to 20, and calculate p(k) for each gene, and see how linear the log(p(k)) - log(k) plot is (as measured by R-squared) We want the fit to be very close to linear, because scale-free network is p(k) = k -γ

18 WGCNA algorithm 4 So, we chose the β that gives us at least 0.8 R-squared, i.e. constructed a network. What now? We identify modules using TOM - topological overlap measure When two genes (nodes) connect to the same large group of other nodes, they have high topological overlap

19 WGCNA algorithm 5 TOMs thus can be represented as a matrix with same dimensionality as adjacency matrix - n x n TOM-based dissimilarity measure will thus be Using this dissimilarity measure as a metric, we perform hierarchical clustering

20 Final touch: Dynamic and hybrid dynamic tree cut Notice the presence of null-module which is de-facto genes rejected from all

21 Eigengenes Eigengene is a first principal component of the module Eigengene expression can be used as a measure of module expression change across the samples

22 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us The wonderful story of ciclopirox

23 Hypergeometric probability and gene sets protein-coding gene repertoire of ~ 20k genes signaling pathway X containing 100 genes DE: 200 genes are up-regulated How many genes from pathway X would be included in our 200 simply by chance? What is the p-value of having 50 or more?

24 Use mathematically identical to drawing without replacement model The exact solution is known as Fisher s exact test:

25 We want right-sided p-value

26 MsigDB MsigDB is similar but takes a limited gene signature, and returns standard gene signatures ranked by FDR

27 What about broader scope? We wanted something that can look at all known expression datasets - without tedious/impossible manual curation

28 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

29 Input: Differential Gene Expression (RNA-Chip, RNA-Seq, exome sequencing) gene selection Proposed tool gene matrices list search algorythm Reference: Massive database (GEO) of expression experiments Each independently clustered clustering algorythm Static database of clustered expression matrices Clusters (aka modules) find overlaps

30 Test Set 2: Mouse samples Two separate databases were assembled (Homo Sapiens and Mus Musculus) Using GEO omnibus statistics, top used platforms were selected Search performed using the following criteria: expression profiling by array Mus musculus (See platforms below) 12:100 samples Years of publication: results returned, saved as a meta-file, downloaded 1529 preprocessed CSV files obtained after running pre-processing script 1496 have successfully completed clustering and were usable in the database Platform ID Manufacturer Type Number of sets GPL1261 Affymetrix ISO 681 GPL6246 Affymetrix ISO 317 GPL8321 Affymetrix ISO 107 GPL339 Affymetrix ISO 33 GPL81 Affymetrix ISO 28 GPL7202 Agilent ISO 60 GPL4134 Agilent ISO 44 GPL6887 Illumina OB 142 GPL6885 Illumina OB 84 Affymetrix Agilent Illumina

31 Test Set 2: Human samples Using GEO omnibus statistics, top used platforms were selected Search performed using the following criteria: expression profiling by array Homo Sapiens (See platforms below) 12:100 samples Years of publication: results returned, saved as a meta-file, downloaded 2177 preprocessed CSV files obtained after running pre-processing script 2110 have successfully completed clustering and were usable in the database Platform ID Manufacturer Type Number of sets GPL570 Affymetrix ISO 982 GPL6244 Affymetrix ISO 291 GPL96 Affymetrix ISO 170 GPL571 Affymetrix ISO 125 GPL8300 Affymetrix ISO 17 GPL4133 Agilent ISO 83 GPL6480 Agilent ISO 53 GPL10558 Illumina OB 149 GPL6947 Illumina OB 130 GPL6884 Illumina OB 57 GPL6883 Illumina OB 53 Affymetrix Agilent Illumina

32 Eigengene Expression Modules Normalize per-module expression

33 Do we need to adjust for multiple comparisons? Yes.

34 Distributions distributions are fairly close to normal, so we use it to adjust the p-value.

35 Linear regressions for p-values Linear regressions were used to calculate adjusted p-values on the fly from gene sets of arbitrary size Database Average Standard deviation mm_2k *x *x mm_4k *x *x hs_2k *x *x hs_4k *x *x

36 Website features is operational! waiting time 20s to 1 min, tested with up to 1.5k size queries human database (~ 5k experiments) and mouse database (~ 3.5k experiments) are available you can enter gene list in the form of gene symbols, RefSeq IDs, or Entrez IDs

37 Output

38 Example 1: M2 (IL-4 activated) macrophage-specific genes in mice

39 Example 2: M1 (LPS activated) macrophage-specific genes in mice

40 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

41 GEO Expression Universe Human network dominated by small clusters, murine - by tissue-specific large ones

42 Outline What is gene clustering? WGCNA: Mjölnir of clustering What is gene set enrichment analysis? How can we find experiments biologically similar to ours? GEO database universe Cmap perturbagene database and how it s useful to us

43 Connectivity Map (Cmap) Connectivity map (Cmap) resource is available at drugs were used to treat 3 human cell lines, resulting perturbation of gene expression Allows to connect a given gene signature to an appropriate drug

44 Cmap vs. GeneQuery The results were impressive: over 95% (1245 out of 1303) up-regulated and 93% (1219 out of 1303) drugs have overlapped at least one module! Many matched expected phenotypes, while some matches were unexpected This shows a good potential for drug repurposing

45 Digoxin Digoxin is a cardiac glycoside (causes heart muscle contraction) Used to treat heart conditions, like atrial fibrillation and heart failure

46 Digoxin and GeneQuery Distinct overlaps with many modules implying interference with TLR4 but excluding Nf-kb pathways Example: up-regulated upon infection in cell line with disabled OspF gene (necessary for Nf-kb signaling)

47 Newsflash: digoxin as prospective ALS drug! Last month in Wash U Record newspaper Thought to reduce cytokine release via Na/K ATPase inhibition

48 Reported link between digoxin and Th17

49 Ciclopirox Topical antifungal Known iron chelator, mimics hypoxia via HIF-1a up-regulation Currently in clinical studies for anti-tumor activity

50 Cyclopirox and GeneQuery We see overlaps with hypoxia cancerous tumors inflammatory phenotypes

51 Cyclopirox and GeneQuery We see overlaps with hypoxia cancerous tumors inflammatory phenotypes

52 Cyclopirox and GeneQuery We see overlaps with hypoxia cancerous tumors inflammatory phenotypes

53 Testing the hypothesis Cultures of bone-marrow derived macrophages were treated with cyclopirox We then compared LPS response of treated and untreated BMDMs ELISA assays confirmed up-regulation of IL-1b and down-regulation of IL-6

54 As you can see, it s all quite simple.

55 Thank you for your attention!