User Guide. MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit. Page 1

Size: px

Start display at page:

Download "User Guide. MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit. Page 1"

Johnathan Nelson
5 years ago
Views:

1 User Guide MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit Page 1

2 Case Western Reserve University February 2012 Page 2

3 Page 3

4 1 - Introduction This sec=on will introduce MAGNET: MicroArray Gene expression and Network Evalua=on Toolkit, developed at Case Western Reserve University's Center for Proteomics and Bioinforma=cs Microarray Gene Expression and RNA Sequencing Count Microarray Gene Expression is a high throughput technique employed by researchers to find mrna expression levels in a given sample. RNA Sequencing Count (RNAseq) is a rapidly adopted technique to measure the rela=ve or absolute mrna expression in a cell. The data generated from these experiments can be used to find the mrna expression levels of tens of thousands of genes in a single experiment. This expression data can be used to guide future research, or it can be used as a valida=on technique, or a variety of other purposes. The Gene Expression Omnibus (GEO) at the Na=onal Center for Biotechnology Informa=on (NCBI) is the largest public repository for high- throughput gene expression data. RNAseq data for many cancers can be found at The Cancer Genome Atlas project (TCGA). These databases host and freely disseminate high- throughput gene expression data generated and submixed by the research community using high- throughput technologies. When a researcher has obtained GEO data, as described in this tutorial, he/she is ozen faced with the task of analyzing the data, and ozen integra=ng it with addi=onal omics datasets. MAGNET is a toolset that allows users to analyze their expression data and draw meaningful conclusions from it. Addi=onally, MAGNET houses TCGA data on its server, allowing researcher to easily select and analyze a cancer using TCGA RNAseq data MAGNET Services MAGNET is a Bioinforma=cs toolbox that offers three different services pertaining to microarray expression data analysis: 1. Generate a coexpression network between genes in a given microarray experiment 2. Generate a weighted protein- protein interac=on network (PPIN), integra=ng various sources of high- throughput data 3. Find the bimodality of coexpression between two lists of genes, given a microarray experiment (hosted at bimodality.case.edu) Page 4

5 1.3 - MAGNET Features Open to the research community at magnet.case.edu Queuing system for load management Op=mized for large data files, handles memory effec=vely Processor intensive func=ons wrixen in R, for greater speed Logging system, with AJAX frontend Results can viewed on the site with a variety of formats and op=onally sent to the user s Extensive documenta=on and tutorial 2 - Services As described in the previous sec=on, MAGNET offers three services to the research community- - all pertaining to microarray expression data analysis. These services are described below Correlation Matrix Microarray data provides a researcher with an enormous amount of data that, at the most basic level, indicates a gene's level of expression. It is ozen useful for a researcher to find genes are "coexpressed." Coexpression is the measure of correla=on between two genes' expression data over at least 2 samples. For example, if APC has high expression in sample 1 whenever SRC has high expression, then the two genes are "coexpressed." The opposite is also meaningful- - if APC has high expression whenever SRC has low expression, then the two genes are "differen=ally expressed." Measuring the correla=on between many genes can result in a "coexpression network." In a coexpression network, the correla=on of every gene with every other gene is calculated, and the most notable correla=ons produce a network, where coexpressed or differen=ally expressed genes are connected, signifying a hypothesized rela=onship. MAGNET calculates the correla=on by using either Pearson s Product- Moment Correla6on Coefficient or Spearman s Rank Correla6on Coefficient. Both follow the formula below, except that in Spearman s, the expression data is ranked. Page 5

6 Where, X and Y represent the expression of gene 1 over a set of samples, and gene 2 over the same samples, respec=vely. Both Correla=on Coefficients range from to +1.0, where represents the maximum nega=ve correla=on (differen=ally expressed) and +1.0 represents the maximum correla=on (coexpression). MAGNET allows a user to simply submit their expression data (described in detail here), a gene list (if necessary), and allow the system to generate a correla=on matrix from all genes specified by the user to all other genes specified by the user. Below is a guide to this process: 1. Navigate to the MAGNET homepage (magnet.case.edu), and click "Submit Job," under the header of "Generate Coexpression Matrix." 2. Submit job. a. Upload data in one of three ways: i. Using the dropdown box, select a type of cancer to use high- throughput TCGA data for that specific type of cancer. 2. ii. Upload a GSE and GPL file in an accepted format for MAGNET to analyze the Microarray Gene Expression. iii. Upload a RNA Sequencing Count matrix. b. Specify the threshold for coexpression. This allows you to filter your coexpression matrix otherwise you will have N 2 correla=ons for N genes. If you would like to filter out all genes that are between and 0.6, you would specify Less than: 0.6 and Greater than: Or, leave it blank to include all correla=ons. c. Leave gene list blank if you would like to calculate the correla=on from all genes in an array to all other genes in the array. Uploading a gene list is encouraged, as calcula=ng the correla=on of tens of thousands of genes vs. tens of thousands of genes can take hours. Page 6

7 3. Filter samples. You can specify exactly which samples you would like to use for the coexpression calcula=on. Type in keywords, select the Boolean operator (AND/OR), and the samples will be filtered. In this case, we'd like to only work with samples whose =tle contains villus later" and with a characteris=c of "genotype: APC- " Leave empty to keep all samples. Page 7

A link to the results page will show up in the console output and

8 4. The console output will update automa=cally whenever there is progress on your job. 5. A link to the results page will show up in the console output and ed if the user provided an address. 6. Five different outputs are generated a. Tabulated results Page 8

9 b. Cytoscape EDA and SIF files c. Tab- delimited matrix of normalized sample values. These are the samples that you selected in the previous Filtering step. Gene names are matched to probes, and the samples are then normalized. d. Graphical Network View Page 9

e. Histogram of Correla=on Values 2.2 - Weighted PPIN While the first service generates coexpression networks, the second service generates weighted protein- protein interac=on networks (PPINs).

10 e. Histogram of Correla=on Values Weighted PPIN While the first service generates coexpression networks, the second service generates weighted protein- protein interac=on networks (PPINs). Protein- Protein Interac=on Networks provide very useful insight into the func=on of proteins in a cell, as much of that func=on is related to the interac=on of proteins. However, it is es=mated that the known interac=ons compose only 10% of all of the interac=ons in the studied organisms. In addi=on, those that are known have a significant amount of false posi=ves/nega=ves. MAGNET's solu=on to this problem is to generate a PPIN based on predicted interac=ons, and then to weight each interac=on, using a logis=c regression model. The logis=c regression model integrates four variables describing the interac=ng proteins, and provides a score from 0.0 to 1.0, which serves as a probability that that specific interac=on exists. The four variables used in the logis=c regression model are outlined below: 1. Subcellular localiza=on data (e.g. Are both proteins in the same loca=on?) 2. Small- World Co- Clustering values (e.g. Do the neighbors of both proteins have many hypothesized connec=ons?) Page 10

11 3. Number of interac=ons observed (e.g. When integra=ng mul=ple interac=on databases, how many =mes was this interac=on reported?) 4. Coexpression data (e.g. To what extent are the proteins coexpressed/differen=ally expressed?) MAGNET takes in expression data and a gene list, and outputs an interac=on matrix and network, where each interac=on is weighted as described above. Below is a guide for this process: 1. Navigate to the MAGNET homepage (magnet.case.edu), and click "Submit Job," under the header of "Generate Weighted Protein Protein Interac=on Network (PPIN)." 2. Submit job. a. Upload data in one of three ways: i. Using the dropdown box, select a type of cancer to use high- throughput TCGA data for that specific type of cancer. 2. ii. Upload a GSE and GPL file in an accepted format for MAGNET to analyze the Microarray Gene Expression. iii. Upload a RNA Sequencing Count matrix. b. Specify the threshold for coexpression. This allows you to filter your coexpression matrix otherwise you will have N 2 correla=ons for N genes. If you would like to filter out all genes that are between and 0.6, you would specify Less than: 0.6 and Greater than: Or, leave it blank to include all correla=ons. c. Leave gene list blank if you would like to calculate the correla=on from all genes in an array to all other genes in the array. Uploading a gene list is encouraged, as calcula=ng the correla=on of tens of thousands of genes vs. tens of thousands of genes can take hours. d. Select which logis=c regression variables you would like to include in the analysis. Page 11

12 3. Filter samples. You can specify exactly which samples you would like to use for the coexpression calcula=on. Type in keywords, select the Boolean operator (AND/OR), and the samples will be filtered. In this case, we'd like to only work with samples whose =tle contains villus later" and with a characteris=c of "genotype: APC- " Leave empty to keep all samples. Page 12

13 4. The console output will update automa=cally whenever there is progress on your job. 5. There are three outputs from a PPIN job a. Tabulated Results b. Cytoscape SIF and EDA Page 13

14 c. Graphical Network Output d. Histogram of PPI Probabili=es Page 14

15 2.3 - Find Bimodality of Coexpression The bimodality of coexpression is a novel measure used to measure the associa=on between two gene networks. Even if the expression data of two genes may not suggest any associa=on, the networks to which they belong may be associated. Bimodality of coexpression can indicate that two networks are associated by comparing the distribu=ons of their coexpression values. An example of this comparison can be found in figure 1. In figure 1, the blue distribu=on is the coexpression of all genes in the array vs. all genes in the array. This represents the "background," or "expected," distribu=on. The red is the correla=on of gene list 1 vs. gene list 2. This distribu=on is labeled the "sample," distribu=on. If the sample distribu=on has a higher frequency toward the lez tail, then the networks are differen=ally expressed. Likewise, if the sample distribu=on has a higher frequency on the right tail, then the networks are coexpressed. Bimodality measures this associa=on by summing over the difference of the cumula=ve distribu=on func=ons of both distribu=ons, and then finding a p- value for the resul=ng score. Below is a guide to this process: Figure 2 Expected and sample distribu6ons of coexpression (Bebek et al. 2010, Fig. 5) 1. Navigate to the BiC homepage and click "Submit Job," under the header of "Find Bimodality between Gene Lists" 2. Complete form. The Plaworm and Expression data follow the SOFT file format as specified by GEO. This allows you to simply download files from the GEO repository and upload them directly to BiC. However, there are also templates available on the BiC website to format your own data in the SOFT format for processing by BiC. You are also required to upload at least one Gene List and a Target Gene List, between which the bimodality of coexpression will be calculated. Page 15

16 3. Filter samples. You can specify exactly which samples you would like to use for the coexpression calcula=on. Type in keywords, and select the Boolean operator (AND/OR), and the samples will be filtered. In this case, we only want to only include samples that have the keyword "epithelium," in their annota=on. 4. Specify which samples are Case and Control. This informa=on is used to calculate the t- score, which is part of the Bimodality of Coexpression algorithm. Page 16

17 5. The console output will update automa=cally whenever there is progress on your job. The results will be outpuxed to the console and ed, if the user provided an address. Page 17

18 6. The results include the Bimodality and the associated P- value, abbreviated as B and P. 3 - Conclusion MAGNET offers three different services that can help researchers draw conclusions from their expression data. The first service is genera=ng coexpression matrices, which allows users to find which genes are correlated, and to what extent. The second service is genera=ng weighted PPINs, which can give insight into the func=on of proteins in a cell. The third and final service calculates a measure of associa=on between coexpression networks. If you have any ques=ons, comments, or concerns regarding MAGNET or its func=on, please contact Gurkan Bebek (magnet [at] case.edu). References Bebek, G., Patel, V., Chance, M.R.: Petals: Proteomic evalua=on and topological analysis of a mutated locus signaling. BMC Bioinforma=cs 11, 596 (2010) Page 18

Canadian Bioinforma3cs Workshops

Canadian Bioinforma3cs Workshops www.bioinforma3cs.ca Module #: Title of Module 2 1 Module 3 Expression and Differen3al Expression (lecture) Obi Griffith & Malachi Griffith www.obigriffith.org ogriffit@genome.wustl.edu