Gene List Enrichment Analysis - Statistics, Tools, Data Integration and Visualization

Similar documents
Gene List Enrichment Analysis

Александр Предеус. Институт Биоинформатики. Gene Set Analysis: почему интерпретировать глобальные генетические изменения труднее, чем кажется

Pathway Analysis. Min Kim Bioinformatics Core Facility 2/28/2018

Analyzing Gene Set Enrichment

Downstream analysis of transcriptomic data

Suberoylanilide Hydroxamic Acid Treatment Reveals. Crosstalks among Proteome, Ubiquitylome and Acetylome

Methods for network visualization and gene enrichment analysis February 22, Jeremy Miller Postdoctoral Fellow - Computational Data Analysis

Lecture 8: Predicting and analyzing metagenomic composition from 16S survey data

Understanding protein lists from proteomics studies. Bing Zhang Department of Biomedical Informatics Vanderbilt University

Gene Expression Data Analysis (I)

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Knowledge-Guided Analysis with KnowEnG Lab

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

Understanding protein lists from comparative proteomics studies

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Introduction to Next Generation Sequencing (NGS) Data Analysis and Pathway Analysis. Jenny Wu

Flagship Project: InterOmics

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Pathway analysis. Martina Kutmon. Department of Bioinformatics, Maastricht University, The Netherlands

Seven Keys to Successful Microarray Data Analysis

Comparative study on gene set and pathway topology-based enrichment methods

MFMS: Maximal Frequent Module Set mining from multiple human gene expression datasets

Scoring pathway activity from gene expression data

High-throughput Proteomic Data Analysis. Suh-Yuen Liang ( 梁素雲 ) NRPGM Core Facilities for Proteomics and Glycomics Academia Sinica Dec.

Alexander Statnikov, Ph.D.

BABELOMICS: Microarray Data Analysis

BIMM 143: Introduction to Bioinformatics (Winter 2018)

dmgwas: dense module searching for genome wide association studies in protein protein interaction network

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

TUTORIAL. Revised in Apr 2015

Build Your Own Gene Expression Analysis Panels

Accelerating Gene Set Enrichment Analysis on CUDA-Enabled GPUs. Bertil Schmidt Christian Hundt

BGGN 213: Foundations of Bioinformatics (Fall 2017)

Our website:

CCRD Proteomics facility RIH-Brown University

and Promoter Sequence Data

Uncovering differentially expressed pathways with protein interaction and gene expression data

GeneAnswers Integrated Interpretation of Genes. Dr. Gang (Gilbert) Feng Northwestern University Biomedical Informatics Center

Effects of Different Normalization Methods on Gene Set Enrichment Analysis (GSEA)

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Comparative evaluation of gene set analysis approaches for RNA-Seq data

Systematic and Integrative Analysis of Large Gene Lists Using DAVID

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Gene Set Enrichment Analysis! Robert Gentleman!

Analysis of Microarray Data

Annotation and Function of Switch-like Genes in Health and Disease. A Thesis. Submitted to the Faculty. Drexel University. Adam M.

Genetics and Bioinformatics

Experience with Weka by Predictive Classification on Gene-Expre

Functional Enrichment Analysis & Candidate Gene Ranking

Standard Data Analysis Report Agilent Gene Expression Service

Microarray Informatics

Introduc)on to Pathway and Network Analysis

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Functional Enrichment Analysis & Candidate Gene Ranking

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. Immersion Course

Introduction to Bioinformatics

Computational Challenges of Medical Genomics

Sanger vs Next-Gen Sequencing

Gene Set Enrichment Analysis

Package GeneExpressionSignature

Pathway analysis. Martina Kutmon Department of Bioinformatics Maastricht University

Bioinformatics in Genomic and Proteomic Data Analysis November 2009 Brno, the Czech Republic

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Microarray Informatics

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Introduction to BIOINFORMATICS

Introduction to Bioinformatics

Protein-Protein-Interaction Networks. Ulf Leser, Samira Jaeger

Jianlin Cheng, PhD Informatics Institute, Computer Science Department University of Missouri, Columbia Fall, 2011

BIMM 143 Pathway Analysis and the Interpretation of Gene Lists Lecture 16 Barry Grant

BTRY 7210: Topics in Quantitative Genomics and Genetics

Pathway Analysis Adding Func2onal Context to High- Throughput Results

Introduction to microarrays. Overview The analysis process Limitations Extensions (NGS)

Drug target identification. Enabling our pharmaceutical and biotech partners to effectively discover proteins or genes as novel targets

Enabling Reproducible Gene Expression Analysis

Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation

Package DOSE. June 6, 2018

ECS 234: Introduction to Computational Functional Genomics ECS 234

Expression Analysis Systematic Explorer (EASE)

BioBridge: Bringing Data Exploration to Biologists

The New Thomson Reuters Cortellis Collection for Accelrys. Tom Mayo Partner Evangelist Business Development Team

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Biomarker discovery. Enabling pharmaceutical and biotech partners to discover relevant biomarkers in diseases of interest

Gene Annotation and Gene Set Analysis

Lecture 8: Predicting metagenomic composition from 16S survey data

University of Athens - Medical School. pmedgr. The Greek Research Infrastructure for Personalized Medicine

RNA-Seq Now What? BIS180L Professor Maloof May 24, 2018

I have my list of differentially expressed genes, now what?

Research Powered by Agilent s GeneSpring

B I O I N F O R M A T I C S

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

Introduction to Bioinformatics

DRAGON DATABASE OF GENES ASSOCIATED WITH PROSTATE CANCER (DDPC) Monique Maqungo

How advancement in biological network analysis methods empowers proteomics

User Guide. MAGNET : MicroArray & RNAseq Gene expression Network Evalua=on Toolkit. Page 1

Transcription:

Gene List Enrichment Analysis - Statistics, Tools, Data Integration and Visualization Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 10/4/2016 http://tanlab.ucdenver.edu/labhomepage/teaching/canb7640/

Outline Why do enrichment analysis? Types of Enrichment analysis Statistics Annotation sources / databases and Tools Data Integration and Visualization tools

Why Do Enrichment Analysis? From omics data and analysis, generate list (s) of interesting gene lists (List A) Genes act in concert to drive biological processes (List B) Can we compare the two lists (Lists A and B) Is the overlap different than expected? (pvalue) Is List A contains genes involved in List B biological process? (biology)

Idea Assume Total Genes in Genome = 20,000 Your Gene List 300 genes 50 500 genes Biological Process X Overlap Genes

Different Tests for Enrichment Fisher s exact Chi-square Binomial Hypergeometric Kolmogorov-Smirnov (This is implemented in GSEA) Permutation

Statistical Test for Enrichment Assume Total Genes in Genome = 20,000 Your Gene List = 300 genes 250 50 Overlap Genes 450 Biological Process X = 500 genes P = 2.2e -16 Assume Total Genes in Genome = 20,000 Your Gene List = 1000 genes 995 5 Overlap Genes 45 Biological Process Y = 50 genes P = 0.1882

Statistics to Test for Enrichment What is the chance of observing enrichment at least this extreme due to chance? Different tests produce very different ranges of p-values All test for over-enrichment, some test for under-enrichment Recommendation: Use p-values as a tool to rank gene sets and don t take them literally (remember, biology trumps statistics) Useful to correct for multiple testing (e.g. FDR)

Input for Enrichment Test Background gene set All genes that could appear in your gene list (usually, assume all the genes in the genome, e.g. 20,000 genes for human genome) Annotations / Gene sets Which annotation sources / databases? Some annotation terms may be subsets of other terms (e.g. in Gene Ontology) Goal: To identify gene sets/concepts with biological significance from your gene list

In Practice Choose a tool / program that Includes your species Includes your genes or probe identifiers (and do the matching from probes to genes) Up-to-date annotation Able to define background (if possible) Able to select gene sets/annotations/concepts Try at least a few tools and compare If the biology is real, you will find it from different tools Get recommendations from bioinformatics/ statistics for interpreting the results

Gene List Enrichment Analysis Annotation Database Backend annotation database User to Input Gene List Algorithms (sort and organize annotation terms in different ways for different discovery ideas) Statistics (calculate enrichment p-values with statistical methods) Data Mining Results Result Presentation (Adapted from Huang et al NAR 2008)

So, which database(s) and tool(s) to use? 178 databases, Online: 1685 databases http://www.oxfordjournals.org/nar/database/a/ 94 web servers DATABASE ISSUE (Jan) WEB SERVER ISSUE (July)

Survey

Survey Class I Singular enrichment analysis (SEA) = ORA Class II Gene set enrichment analysis (GSEA) = FCS Class III Modular enrichment analysis (MEA) = PT

Enrichment Analysis Tools

Class I SEA / ORA Evaluates the statistical significance on a fraction of genes in a particular pathway/ gene set/annotation term 2x2 table method Fisher s exact test, chi-square test, binomial test, hypergeometric test Use p-value (and FDR) to sort the gene sets/pathways/annotation terms enriched in your gene list

Class II GSEA / FCS Evaluates the statistical significance at particular pathway/gene set/annotation term instead of individual genes Covered in previous lecture GSEA Kolmogorov-Smirnov test, permutation Use p-value (and FDR) to sort the gene sets/pathways/annotation terms enriched from high-throughput omics data

Class III MEA / PT Pathway topology (PT)-based or MEA methods have been developed to utilize the additional information. PT-based methods are essentially the same as FCS methods in that they perform the same three steps as FCS methods. The key difference between the two is the use of pathway topology to compute genelevel statistics.

Limitations Class I ORA / SEA Class II FCS / GSEA Class III MEA / PT Limitations Evaluate each gene independently Different tests give different results that may alter the interpretation of the data No interactions are considered Evaluate each pathway independently No consideration of topology or interactions with the gene sets/ pathways Very specific to cell type and cellspecific expression based on conditions Topology dependency is hard to extract Dynamic?

Recommendations / Rules 1. Realistically positioning the role of enrichment P-values in the current datamining environment 2. Understanding the limitation of multiple testing correction on enrichment P-values 3. Cross-comparing enrichment analysis results derived from multiple gene lists

Recommendations / Rules 4. Setting up the right gene reference background 5. Extending backend annotation databases 6. Efficiently mapping users input gene identifiers to the available annotation 7. Enhancing the exploratory capability and graphical presentation

NIH DAVID http://david.abcc.ncifcrf.gov/

NIH DAVID http://david.abcc.ncifcrf.gov/

Func7onal Annota7on Clustering

Functional Related Terms

2D View of Functional Annotation Clusters

Functional Annotation Chart

Linking Enriched Genes to Pathways (KEGG only)

WEBGESTAL http://bioinfo.vanderbilt.edu/webgestalt/

WEBGESTAL http://bioinfo.vanderbilt.edu/webgestalt/

KEGG http://www.genome.jp/kegg/

KEGG

KEGG

KEGG

KEGG

Other Pathway Databases hap://www.reactome.org/ hap://www.pathwaycommons.org/

STRING hap://string- db.org/

Cytoscape hap://www.cytoscape.org/

Small and Large networks

BiNGO

BEReX BEReX (Collaboration with Dr. Jaewoo Kang, Korea University)

BEReX: Biomedical Entity-Relationship explorer (Collaboration with Dr. Jaewoo Kang, Korea University)

iberex (http://iberex.korea.ac.kr/) (Collaboration with Dr. Jaewoo Kang, Korea University)

Take Home Message Motivation to do gene list enrichment analysis. Choose the right statistics for the enrichment analysis depending on what you want to ask. Use p-values carefully. Many enrichment tools are available, test multiple tools. Use interactive visualization tools to present your enriched lists of genes.