Knowledge-Guided Analysis with KnowEnG Lab

Similar documents
From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Gene-Level Analysis of Exon Array Data using Partek Genomics Suite 6.6

Analysis of Microarray Data

Understanding protein lists from comparative proteomics studies

Runs of Homozygosity Analysis Tutorial

IPA Advanced Training Course

Functional analysis using EBI Metagenomics

Agilent Genomic Workbench 7.0

Employee Walkthrough. Version 1.0. Last updated: 26 th January 2018 Author: Joe Sutcliffe E:

Stefano Monti. Workshop Format

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

BCHM 6280 Tutorial: Gene specific information using NCBI, Ensembl and genome viewers

Bioinformatics Analysis of Nano-based Omics Data

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Machine Learning in Computational Biology CSC 2431

Nature Genetics: doi: /ng Supplementary Figure 1

Introduction. Highlights. Prepare Library Sequence Analyze Data

PM Created on 1/14/ :49:00 PM

Methods for network visualization and gene enrichment analysis February 22, Jeremy Miller Postdoctoral Fellow - Computational Data Analysis

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

MAKING WHOLE GENOME ALIGNMENTS USABLE FOR BIOLOGISTS. EXAMPLES AND SAMPLE ANALYSES.

Gene expression connectivity mapping and its application to Cat-App

Human Resources. The HR Guide to Retirement Manager. Version 1 Updated 3/5/2010

SUCCESSFACTORS RECRUITMENT: APPLICANT MANAGEMENT PART 2 Setting Up Interviews & Viewing Ratings

PIE Performance Process Dashboard & Grid

Exercise1 ArrayExpress Archive - High-throughput sequencing example

USER MANUAL for the use of the human Genome Clinical Annotation Tool (h-gcat) uthors: Klaas J. Wierenga, MD & Zhijie Jiang, P PhD

Gene List Enrichment Analysis - Statistics, Tools, Data Integration and Visualization

Setting Up and Running PowerCenter Reports

Manager Dashboard User Manual

Table of Contents HOL CMP

Bionano Access 1.1 Software User Guide

How to view Results with Scaffold. Proteomics Shared Resource

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

How to view Results with. Proteomics Shared Resource

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Quick reference guide

Engagement Portal. Employee Engagement User Guide Press Ganey Associates, Inc.

Supervisor Training Supervisor Training

Tutorial. Whole Metagenome Functional Analysis (beta) Sample to Insight. November 21, 2017

ChroMoS Guide (version 1.2)

In silico prediction of novel therapeutic targets using gene disease association data

BI Portal User Guide

User Guide. Introduction. What s in this guide

A tutorial introduction into the MIPS PlantsDB barley&wheat databases. Manuel Spannagl&Kai Bader transplant user training Poznan June 2013

User Guide. ShipTrack Client Portal Release v4.1. Client Portal - User Guide

Introduction Reference System Requirements GIST (Gibbs sampler Infers Signal Transduction)... 4

Exploring Similarities of Conserved Domains/Motifs

Introducing a Highly Integrated Approach to Translational Research: Biomarker Data Management, Data Integration, and Collaboration

RIT ORACLE CAPITAL EQUIPMENT PHYSICAL INVENTORY

Lees J.A., Vehkala M. et al., 2016 In Review

Product Applications for the Sequence Analysis Collection

Data Intensive Scientific Discovery Vijay Chandru

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite 6.6

New Frontiers in Personalized Medicine

my.scouting Tools Training-Home Trend Chart Training Summary Report

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Contents OVERVIEW... 3

Using the Association Workflow in Partek Genomics Suite

Referral Training Exercise 3: Review and Hiring Manager

Applied Bioinformatics

User Guide Version

Package MADGiC. July 24, 2014

NCBI web resources I: databases and Entrez

Service Portal. TMT Class Course Code: SK1201 First Edition, March Managing Your Business for Service Managers Study Guide

ServicePRO + PartsPRO User Guide

Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate

Applicant Reviewer Quick Guide

Create a Planned Run. Using the Ion AmpliSeq Pharmacogenomics Research Panel Plugin USER BULLETIN. Publication Number MAN Revision A.

Sanger vs Next-Gen Sequencing

Authorize.Net Mobile Application

Why learn sequence database searching? Searching Molecular Databases with BLAST

Getting Started with HLM 5. For Windows

Policy Data Manager User Guide

Multi-omics in biology: integration of omics techniques

Revised February 2017 CivicHR.com 317 Houston St, Suite E Manhattan KS

Scheduler Book Mode User Guide Version 4.81

Symphony Genomics Workflow Management System Managing, Integrating, and Delivering Your NGS Data and Results

GS1 Data Source Healthcare. Web interface user manual for data recipients in healthcare

Package goseq. R topics documented: December 23, 2017

Deltek Vision 6.2 SP1. Custom Reports and Microsoft SQL Server Reporting Services

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Verizon Enterprise Center CALNET 3 Invoices User Guide

GS Analysis of Microarray Data

GEP Project Management System: Order Finishing Reactions

YEAR PERFORMANCE HEALTH TECHNOLOGY. CIM User Manual

BBISR Do it Yourself Workshops Bioinformatics & Biostatistics Tools for Cancer Research

Nature Methods: doi: /nmeth Supplementary Figure 1. Validation of RaPID with EDEN15

COMPUTER RESOURCES II:

PAYROLL PROCESSING User Guide

Chapter 4 Familiarizing Yourself With Q-interactive Assess

Web Time New Hire Packet

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

The ChIP-Seq project. Giovanna Ambrosini, Philipp Bucher. April 19, 2010 Lausanne. EPFL-SV Bucher Group

Central Portfolio Management System

Log In. 1. Navigate to 2. Input Company Name, Username, and Password. Click Sign In on the top right of the screen.

MedSavant: An open source platform for personal genome interpretation

IT portfolio management template User guide

e-recruitment Guide Job Posting and Publication

Transcription:

Han Sinha Song Weinshilboum Knowledge-Guided Analysis with KnowEnG Lab KnowEnG Center Powerpoint by Charles Blatti Knowledge-Guided Analysis KnowEnG Center 2017 1

Exercise In this exercise we will be doing the following: 1. Deploying the KnowEnG Analytics Suite to cluster somatic mutation data while integrating STRING network of protein interactions. 2. Using KnowEnG Analytics Suite to characterize gene sets of subtypes while integrating Protein Similarity network of evolutionary conservation. Knowledge-Guided Analysis KnowEnG Center 2017 2

Genes Genomic Data Analysis Using Prior Knowledge Knowledge Network User Interface Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Samples Network Smoothing NMF Clustering Consensus Clustering RNA-seq, Somatic Mutations, etc.. User Spreadsheet 3

Genes Genomic Data Analysis Using Prior Knowledge in a Scalable Cloud User Interface Knowledge Network Physical interactions, co-expression, pathways, biological processes, text mining, etc. Analysis Pipelines Subtype Stratification Samples Network Smoothing NMF Clustering Consensus Clustering RNA-seq, Somatic Mutations, etc.. User Spreadsheet 4

General Inputs to KnowEnG Collection of Gene Signatures from multiple samples / experiments called the Genomic Spreadsheet Types of Gene Signatures Gene Sets Gene Importance (e.g. p-value) Gene Scores Examples z-scores of transcriptomic profile Differential expression gene sets/pvalues after drug treatment after gene knockdown after environmental stimuli Genes with non-silent somatic mutations Genes near regions of open chromatin Genes with promoter binding of TFs Etc. 5

Graph Mining Value of KnowEnG Tools Knowledge network + user spreadsheet Machine learning Integration of analysis with many and varied public data sets Called the Knowledge Network Gene Interactions STRING, BioGrid, Pathway Commons, Gene Annotations Gene Ontology, KEGG, MSigDB, LINCs, GEO 6

Three Pipelines Available Today Sample Clustering Given genomic data from many samples find clusters (subtypes) that relate to relevant phenotype Gene Set Characterization Find biological annotations related to your gene set(s) Gene Prioritization Find top ranking genes that relate to a phenotype ProGENI applied to drug response of LCLs 7

Step 0A - Form Groups of Two For this laboratory session we will be using the KnowEnG Center s interface for their Analytics Suite. Because the interfaces are not yet deployed at full scale, we will be working through these laboratory exercises as pairs Knowledge-Guided Analysis KnowEnG Center 2017 8

Step 0B: Download and Extract Data Files For viewing and manipulating the files needed for this laboratory exercise, download the following archive: http://veda.cs.uiuc.edu/compgen2017/labs/08 _Network_Analysis.zip Right Click and Extract the contents of the archive to your course directory. We will use the files found in: [course_directory]/08_network_analysi s/data/ Knowledge-Guided Analysis KnowEnG Center 2017 9

Robust Methods for Clustering Somatic Mutation Data In this exercise, we will use the analysis workflows implemented by the KnowEnG Center to find genomic subtypes of uterine corpus endometrial carcinoma patients from somatic mutation data. The approach we demonstrate here is called Network Based Stratification and was inspired by Hofree, et al. We will explore the ability of Network Based Stratification to identify clusters that related to compelling clinical features Knowledge-Guided Analysis KnowEnG Center 2017 10

Dataset Characteristics I Name uterine_clinical_data.tsv uterine_somatic_mutations.tsv Meaning A matrix of phenotype labels for 248 uterine cancer TCGA samples with phenotypes including survival, surgical outcome, tumor stage and grade, etc. A matrix that indicates non-silent somatic mutations that are identified in the protein coding region of a gene for all 248 uterine cancer samples across more than 16,000 genes Enrichment Map GSEA Tutorial Jun Song 11

Dataset Characteristics II Phenotypic data : Table with Samples as the rows Phenotypic categories as the columns Categorical or Quantitative values Enables searching for clusterings that are consistent with particular phenotypes survival, surgical success, treatment outcome, metastasis, etc. 12

Dataset Characteristics III Here is an analogous visualization of our non-silent somatic mutation cancer data (uterine_somatic_mutations.tsv) from the UCSC Cancer Genome Browser Mutated Genes Cancer patients Knowledge-Guided Analysis KnowEnG Center 2017 13

Dataset Characteristics IV Issue 1 Difficulty - Basic clustering methods will be susceptible to noisy data (passenger mutations) Hypothesis - Bootstrapping provides more robust clustering results by running method on many samples of the data and returning consensus results Issue 2 Difficulty - Typical clustering methods have difficulty because of the sparseness of the data Hypothesis - Although two tumors may not share the same somatic mutations, they may affect the same pathways and interaction networks Network Based Stratification (NBS) approach presented here will attempt to address both issues. Knowledge-Guided Analysis KnowEnG Center 2017 14

Issue 2: Value of Network-Guided Analysis Sparsity of Data Problem Distance between all samples is the same 15

Issue 2: Value of Network-Guided Analysis Sparsity of Data Problem Distance between all samples is the same Knowledge of pathways helps produce clusters 16

Issue 2: Value of Network-Guided Analysis Sparsity of Data Problem Distance between all samples is the same Knowledge of pathways helps produce clusters Network Smoothing 17

Step 1A - Sign into KnowEnG Scientific Collaboration Portal Follow the link to the appropriate HubZero login screen https://hub.knoweng.org/ Enter your user name and password for the course into the credentials boxes Click Sign In Knowledge-Guided Analysis KnowEnG Center 2017 18

Step 2A - Upload Cancer Data Files Click on Data among the options along the top Click on Upload New Data Click on browse link Find and highlight the data files in 08_Network_Analysis/data/sample_ clustering uterine_clinical_data.tsv uterine_somatic_mutations.tsv Click Open to load files Knowledge-Guided Analysis KnowEnG Center 2017 19

Step 3A - Configure NBS Analysis Click on Analysis Pipelines along the top In the Select A Pipeline section: Hover over Sample Clustering and click on the Start Pipeline button Knowledge-Guided Analysis KnowEnG Center 2017 20

Step 3B - Algorithm Overview Employs network smoothing to mitigate sparsity by transforming the binary gene-level somatic mutation vectors of patients into a continuous gene importance vector that captures the proximity of each gene in the Knowledge Network to all of the genes with somatic mutations in the patient sample Repeats clustering analysis using bootstrap sampling to build a consensus patient similarity matrix Knowledge-Guided Analysis KnowEnG Center 2017 21

Step 3C - Select User Data Files Leave the default species Human To choose the genomics spreadsheet: Click on Select genomic feature file Then click on the checkbox next to uterine_somatic_mutations.tsv To choose the clinical spreadsheet: Click on Select response file Then click on the checkbox next to uterine_clinical_data.tsv Click Next in the bottom right corner Knowledge-Guided Analysis KnowEnG Center 2017 22

Step 3D - Configure Algorithm Parameters We will use K-Means clustering as the final clustering algorithm to run on our patient similarity consensus matrix We will set the number of clusters (genomic subtypes) to find to be 4 Click Next Knowledge-Guided Analysis KnowEnG Center 2017 23

Step 3E - Configure KnowNet Parameters Click Yes for question about using the Knowledge Network The Knowledge Network we will use is the high confidence protein interactions from the STRING database ( STRING Experimental PPI ) A amount of network smoothing controls how much importance is put on network connections instead of the somatic mutations. We will use the default of 50% Click Next Knowledge-Guided Analysis KnowEnG Center 2017 24

Step 3F - Configure Bootstrap Parameters Click Yes for question about using the bootstrapping For the purposes of the demo, we will sample the data 20 times by setting the number of bootstraps to 20 We will keep the default bootstrap sample percentage of 80% Click Next Knowledge-Guided Analysis KnowEnG Center 2017 25

Step 3G Launch NBS Workflow Review your selections Change your Job Name if you wish Click Submit Job Knowledge-Guided Analysis KnowEnG Center 2017 26

Step 3H: Wait for Results to Complete Once submitted, should see Running icon. Click on Go to Results Page. When your NBS workflow status turns from Running (gray) to Completed (green), click on the name of the run Then click on the blue View Results button Results should be ready after several minutes Knowledge-Guided Analysis KnowEnG Center 2017 27

Step 4A: Consensus Patient Similarity Matrix Patients by Patients matrix The color of each cell indicates how often a pair of patients fell within the same cluster across all samplings Dark blocks show genomic subtypes of patients that may relate to a phenotype To the right we see the number of patients in each cluster Knowledge-Guided Analysis KnowEnG Center 2017 28

Step 4B: Exploring Genomic Subtypes Click on the Add Phenotypes to Compare One at a time, check the boxes next to the various clinical features. Click on the color scale to see the phenotype legend. Are there genomic subtypes that enrich with various phenotypes? Knowledge-Guided Analysis KnowEnG Center 2017 29

Step 5A: Genes Specific to a Subtype Download the detailed results. Click on the Results link at the top of the page Click on the job name of the desired run Click on the download icon A zip file with the same job name should download to your machine. Right Click and Extract the contents of the archive to your course directory. Knowledge-Guided Analysis KnowEnG Center 2017 30

Step 5A: Genes Specific to a Subtype The file top_genes_by_cluster.tsv contains genomic spreadsheet with four gene sets, one for each genomic subtype we found. Each of the gene sets contain the 100 most important genes per subtype based on the network-smoothed clustering. Many of these will have frequent somatic mutations in the subtype and some of these will be the network neighbors of frequently mutated genes. We will use Gene Set Characterization to find out what these genes are related to. Knowledge-Guided Analysis KnowEnG Center 2017 31

Network Based Methods for Gene Set Characterization In this exercise, we will use the analysis workflows implemented by the KnowEnG Center to find standard and network-based characterizations of the gene sets that relate to each genomic subtypes of uterine cancer. The network-guided approach we demonstrate here is based on the DRaWR algorithm published by Blatti, et al. Knowledge-Guided Analysis KnowEnG Center 2017 32

Step 6: Gene Set Characterization Purpose: Find relationships between novel gene sets of the user to known annotations in the Knowledge Network in order to anchor understanding of findings to previous literature and generate consistent hypotheses/mechanisms Related webtools DAVID - https://david.ncifcrf.gov/ Enrichr - http://amp.pharm.mssm.edu/enrichr/ Metascape - http://metascape.org/gp/index.html GSEA - http://software.broadinstitute.org/gsea/index.jsp Knowledge-Guided Analysis KnowEnG Center 2017 33

Step 6A - Upload Gene Set Spreadsheet Click on Data among the options along the top Click on Upload New Data Click on browse link Find and highlight the data file top_genes_by_cluster.tsv Click Open to load files Knowledge-Guided Analysis KnowEnG Center 2017 34

Step 6B - Configure Enrichment Analysis Click on Analysis Pipelines along the top In the Select A Pipeline section: Hover over Gene Set Characterization and click on the Start Pipeline button Knowledge-Guided Analysis KnowEnG Center 2017 35

Step 6C - Select User Data Files Leave the default species Human To choose the genomics spreadsheet: Click on Select a spreadsheet Then click on the checkbox next to top_genes_by_cluster.tsv Click Next in the bottom right corner Knowledge-Guided Analysis KnowEnG Center 2017 36

Step 6D - Configure Algorithm Parameters We will choose to use a subset of 3 possible gene set collections available in the knowledge network Ontologies: Gene Ontology (default) Pathways: KEGG Pathway (default) Disease/Drugs: Enrichr Phenotype Signature (needs to be added) (unclick Protein Domains: PFam Protein Domains) Click Next Knowledge-Guided Analysis KnowEnG Center 2017 37

Step 6E - Configure KnowNet Parameters Click No for question about using the Knowledge Network This will run the one-sided Fisher exact test that is used by the DAVID webserver which you used in the lab earlier Click Next Knowledge-Guided Analysis KnowEnG Center 2017 38

Step 6F Launch GSC Workflow Review your selections Change your Job Name if you wish Click Submit Job Knowledge-Guided Analysis KnowEnG Center 2017 39

Step 6G: Wait for Results to Complete Once submitted, should see Running icon. Click on Go to Results Page. When your NBS workflow status turns from Running (gray) to Completed (green), click on the name of the run Then click on the blue View Results button Results should be ready after several minutes Knowledge-Guided Analysis KnowEnG Center 2017 40

Step 7: Pathways that Overlap with Important Genes Move the Filter Matches by Score slider all of the way to the right to see more results. These pathways overlap with our 100 genes that are important for each genomic subtype Right click Cluster_0 to sort columns to see the relationship of this type to other biological processes Click on individual heatmap cell to see the number of shared genes Knowledge-Guided Analysis KnowEnG Center 2017 41

Step 8: Value of Network-Guided Analysis Take advantage of gene neighbors Integrating multiple data types Knowledge-Guided Analysis KnowEnG Center 2017 42

Step 8: DRaWR Method for GSC DRaWR using random walks on a network Construct a network of interest Find stationary distribution on network Find gene set specific distribution Return annotation nodes that are especially related to the query Knowledge-Guided Analysis KnowEnG Center 2017 43

Step 8: DRaWR Method for GSC DRaWR using random walks on a network Construct a network of interest Find stationary distribution on network Find gene set specific distribution Return annotation nodes that are especially related to the query Knowledge-Guided Analysis KnowEnG Center 2017 44

Step 8: DRaWR Method for GSC DRaWR using random walks on a network Construct a network of interest Find stationary distribution on network Find gene set specific distribution Return annotation nodes that are especially related to the query Knowledge-Guided Analysis KnowEnG Center 2017 45

Step 8A - Configure Enrichment Analysis Click on Analysis Pipelines along the top In the Select A Pipeline section: Hover over Gene Set Characterization and click on the Start Pipeline button Leave the default species Human To choose the genomics spreadsheet: Click on Select a spreadsheet Then click on the checkbox next to top_genes_by_cluster.tsv Click Next in the bottom right corner Knowledge-Guided Analysis KnowEnG Center 2017 46

Step 8B - Configure Algorithm Parameters We will choose to use a subset of 3 possible gene set collections available in the knowledge network Ontologies: Gene Ontology (default) Pathways: KEGG Pathway (default) Disease/Drugs: Enrichr Phenotype Signature (needs to be added) (unclick Protein Domains: PFam Protein Domains) Click Next Knowledge-Guided Analysis KnowEnG Center 2017 47

Step 8C - Configure KnowNet Parameters Click Yes for question about using the Knowledge Network The Knowledge Network we will use is the evolutionary conservation network from protein homology ( BlastP Protein Sequence Similarity ) A amount of network smoothing controls how much importance is put on network connections instead of the somatic mutations. We will use the default of 50% Click Next Knowledge-Guided Analysis KnowEnG Center 2017 48

Step 8D Launch GSC Workflow Review your selections Change your Job Name if you wish Click Submit Job Once submitted, should see Running icon. Click on Go to Results Page. When your NBS workflow status turns from Running (gray) to Completed (green), click on the name of the run Then click on the blue View Results button Knowledge-Guided Analysis KnowEnG Center 2017 49

Step 9: Network-based Pathways that Overlap with Important Genes Move the Filter Matches by Score slider all of the way to the right. These pathways are especially well connected to the 100 genes that are important for each genomic subtype and/or their most similar proteins Sort by Cluster_0 to see the additional heart and stress related terms that are found Click on the Cluster_0 and Heart Failure individual heatmap cell to see that the number of overlap genes is low Knowledge-Guided Analysis KnowEnG Center 2017 50

Logout and Wrapup: Feedback Survey: https://goo.gl/forms/l9l06tqobixpvium2 KnowEnG Future Pipelines: https://hub.knoweng.org/features Knowledge Network Contents: https://hub.knoweng.org/app/site/knownet.html Other Test Examples (must be logged in): https://hub.knoweng.org/quickstart Contacts: Charles Blatti, Postdoc, blatti@illinois.edu Subha Srinivasan, Program Manager, ssubha@illinois.edu Knowledge-Guided Analysis KnowEnG Center 2017 51

KnowEnG Pipelines 1 Sample Clustering 2 Gene Set Characterization 3 Gene Prioritization 4 Literature Mining Clusters samples and scores clusters for their relation to provided phenotypes. Finds systems-level properties, e.g., pathway, biological process, of gene set. Finds genes associated w/ phenotype, based on genomic profiles of a cohort and reports top ranking genes. Searches the corpus for occurrences of specified genes ranking the genes for relevance to each specified term. 5 Signature Analysis 6 Phenotype Prediction 7 GRN Reconstruction 8 Multi-Omics Analysis Maps transcriptomic and other omics profiles of samples to best matching signatures. Trains and uses stats model to predict numeric phenotype from genomic profile. Predicts each gene s expression as a function of expression levels of a few TFs. Reports TFs and TF-gene relationships that can be linked to the phenotype.