BIMM 143 Pathway Analysis and the Interpretation of Gene Lists Lecture 16 Barry Grant

Size: px
Start display at page:

Download "BIMM 143 Pathway Analysis and the Interpretation of Gene Lists Lecture 16 Barry Grant"

Transcription

1 BIMM 143 Pathway Analysis and the Interpretation of Gene Lists Lecture 16 Barry Grant

2 Inputs Steps Control Treatment Reads R1 FastQ Reads R1 FastQ Reads R2 [optional] FastQ 1. Reads R2 [optional] FastQ Quality Control FastQC UCSC Reference Genome Fasta 2. Alignment (Mapping) TopHat2 Count Table Last day s step UCSC Annotation GTF 3. Read Counting CuffLinks data.frame Count Table 4. Differential expression analysis! DESeq2 data.frame

3

4 Volcano Plot Fold change vs P-value Significant (P < 0.01 & log2 > 2)

5 My high-throughput experiment generated a long list of genes/proteins What do I do now?

6 Pathway analysis! (a.k.a. geneset enrichment) Use bioinformatics methods to help extract biological meaning from such lists

7 Inputs Steps Control Treatment Reads R1 FastQ Reads R1 FastQ UCSC Reads R2 [optional] FastQ 1. Reads R2 [optional] FastQ Reference Genome Fasta Quality Control FastQC 2. Alignment (Mapping) TopHat2 Count Table 5. Gene set enrichment analysis nalysis! KEGG, GO, UCSC Annotation GTF 3. Read Counting CuffLinks data.frame Count Table 4. Differential expression analysis! DESeq2 data.frame

8 Basic idea Differentially Expressed Genes (DEGs) Gene-sets (Pathways, annotations, etc...) Annotate...

9 Basic idea Differentially Expressed Genes (DEGs) Gene-sets (Pathways, annotations, etc...) Annotate... Differentially Expressed Genes (DEGs) Overlap... Pathway analysis (geneset enrichment) Pathways

10 Pathway analysis (a.k.a. geneset enrichment) Principle Differentially Expressed Genes (DEGs) Pathway (geneset) DEGs Pathway Enriched Not enriched DEGs come from your experiment Pathway genes ( geneset ) come from annotations Variations of the math: overlap, ranking, networks... Critical, needs to be as clean as possible Important, but typically not a competitive advantage Not critical, different algorithms show similar performances

11 Pathway analysis (a.k.a. geneset enrichment) Side-note: Limitations Geneset annotation bias: can only discover what is already known Non-model organisms: no high-quality genesets available Post-transcriptional regulation is neglected Tissue-specific variations of pathways are not annotated e.g. NF-κB regulates metabolism, not inflammation, in adipocytes Size bias: stats are influenced by the size of the pathway Many pathways/receptors converge to few regulators e.g. Tens of innate immune receptors activate four TFs: NF-kB, AP-1, IRF3/7, NFAT

12 Starting point for pathway analysis: Your gene list You have a list of genes/proteins of interest You have quantitative data for each gene/protein Fold change p-value Spectral counts Presence/absence _at _at _s_at _at _x_at _a_at _at _at _s_at _at _s_at _at _at _at _at _at _at _s_at _s_at _s_at _s_at F26A1.8 F26A1.8 F53F C09H C09H F22E F09F C34F F07F F35B W08E B C03A Y40B10A M H04M C03A C03A7.7 F44F4.4 F32A5.2 W03F8.6 NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ NP_ ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG C20orf58 ACPL2 TNF EMP2 GDF15 ASPHD1 PANK3 EMP2 ICAM1 C20orf112 C20orf58 GPNMB IMPA2 TMEM50B EMP2 MSI2 C20orf58 C8orf4 ETV7 LTB ICAM1

13 Translating between identifiers Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc. Often you will have to translate one set of ids into another A program might only accept certain types of ids You might have a list of genes with one type of id and info for genes with another type of id

14 Translating between identifiers Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc. Often you will have to translate one set of ids into another A program might only accept certain types of ids You might have a list of genes with one type of id and info for genes with another type of id Various web sites translate ids -> best for small lists UniProt < IDConverter < idconverter.bioinfo.cnio.es >

15 Translating between identifiers: UniProt < >

16 Translating between identifiers Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc. Often you will have to translate one set of ids into another A program might only accept certain types of ids You might have a list of genes with one type of id and info for genes with another type of id Various web sites translate ids -> best for small lists UniProt < IDConverter < idconverter.bioinfo.cnio.es > VLOOKUP in Excel - good if you are an excel whizz - I am not! Download flat file from Entrez, Uniprot, etc; Open in Excel; Find columns that correspond to the 2 IDs you want to convert between; Sort by ID; Use vlookup to translate your list

17 Translating between identifiers: Excel VLOOKUP VLOOKUP(lookup_value, table_array, col_index_num)

18 Translating between identifiers Many different identifiers exist for genes and proteins, e.g. UniProt, Entrez, etc. Often you will have to translate one set of ids into another A program might only accept certain types of ids You might have a list of genes with one type of id and info for genes with another type of id Various web sites translate ids -> best for small lists UniProt < >; IDConverter < idconverter.bioinfo.cnio.es > VLOOKUP in Excel -> good if you are an excel whizz - I am not! Download flat file from Entrez, Uniprot, etc; Open in Excel; Find columns that correspond to the two ids you want to convert between; Use vlookup to translate your list Use the merge() or mapids() functions in R - fast, versatile & reproducible! Also clusterprofiler::bitr() function and many others [Link to clusterprofiler vignette]

19 Reminder Using the merge() function anno <- read.csv("data/annotables_grch38.csv") This is an annotation file merge(mygenes, anno, by.x="row.names", by.y= "ensgene") This is our differential expressed genes Using mapids() function from bioconductor library("annotationdbi") library("org.hs.eg.db") mygenes$symbol <- mapids( org.hs.eg.db, keys=row.names(mygenes), column="symbol", keytype="ensembl" )

20 Reminder Using the merge() function anno <- read.csv("data/annotables_grch38.csv") merge(mygenes, anno, by.x="row.names", by.y= "ensgene") Using mapids() function from bioconductor library("annotationdbi") library("org.hs.eg.db") Load the required Bioconductor packages mygenes$symbol <- mapids( org.hs.eg.db, column="symbol", keys=row.names(mygenes), keytype="ensembl" ) Annotation we want to add Our vector of gene names & their format

21 See package vignette: Alternative...

22 What functional set databases do you want? Most commonly used: Gene Ontology (GO) KEGG Pathways (mostly metabolic) GeneGO MetaBase Ingenuity Pathway Analysis (IPA) DEGs Pathway GO KEGG IPA etc... Many others... Enzyme Classification, PFAM, Reactome, Disease Ontology, MSigDB, Chemical Entities of Biological Interest, Network of Cancer Genes etc See: Open Biomedical Ontologies ( )

23 GO < > What function does HSF1 perform? response to heat; sequence-specific DNA binding; transcription; etc Ontology => a structured and controlled vocabulary that allows us to annotate gene products consistently, interpret the relationships among annotations, and can easily be handled by a computer GO database consists of 3 ontologies that describe gene products in terms of their associated biological processes, cellular components and molecular functions

24 GO Annotations GO is not a stand-alone database of genes/proteins or sequences Rather gene products get annotated with GO terms by UniProt and other organism specific databases, such as Flybase, Wormbase, MGI, ZFIN, etc. Annotations are available through AmiGO < amigo.geneontology.org >

25 GO is structured as a directed graph Relationships (edges) Parent terms are more general & child terms more specific GO Terms (nodes)

26 GO evidence codes Use and misuse of the gene ontology annotations Seung Yon Rhee, Valerie Wood, Kara Dolinski & Sorin Draghici Nature Reviews Genetics 9, (2008)

27 See AmiGO for details:

28 Can now do gene list analysis with GeneGO online!

29 Another popular online tool: DAVID at NIAID < david.abcc.ncifcrf.gov >

30 DAVID Functional Annotation Chart Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources Da Wei Huang, Brad T Sherman & Richard A Lempicki Nature Protocols 4, (2009)

31 Overlapping functional sets Many functional sets overlap In particular those from databases that are hierarchical in nature (e.g. GO) Hierarchy enables: Annotation flexibility (e.g. allow different degrees of annotation completeness based on what is known) Computational methods to understand function relationships (e.g. ATPase function is a subset of enzyme function) Unfortunately, this also makes functional profiling trickier Clustering of functional sets can be helpful in these cases

32 DAVID DAVID now offers functional annotation clustering:

33 DAVID Functional Annotation Clustering Based on shared genes between functional sets

34 Want more? GeneGO < portal.genego.com > MD/PhD curated annotations, great for certain domains (eg, Cystic Fibrosis) Nice network analysis tools us for access Oncomine < > Extensive cancer related expression datasets Nice concept analysis tools Research edition is free for academics, Premium edition $$$ Lots and lots other R/Bioconductor packages in this area!!!

35 Do it Yourself! Hands-on time! Also: R Quiz Online

36 Inputs Steps Control Treatment Reads R1 FastQ Reads R1 FastQ UCSC Reads R2 [optional] FastQ 1. Reads R2 [optional] FastQ Reference Genome Fasta Quality Control FastQC 2. Alignment (Mapping) TopHat2 Count Table 5. Gene set enrichment analysis nalysis! KEGG, GO, UCSC Annotation GTF 3. Read Counting CuffLinks data.frame Count Table 4. Differential expression analysis! DESeq2 data.frame

37 Data structure: counts + metadata countdata gene ctrl_1 ctrl_2 exp_1 exp_1 genea geneb genec gened genee genef countdata is the count matrix (number of reads coming from each gene for each sample) coldata id treatment sex... ctrl_1 control male... ctrl_2 control female... exp_1 treatment male... exp_2 treatment female... Sample names: ctrl_1, ctrl_2, exp_1, exp_2 coldata describes metadata about the columns of countdata First column of coldata must match column names of countdata (-1st)

38 Advice: Figure out What do I want to do with my list? Organize/summarize data for presentation or manuscript DAVID: GO_FAT -> Functional Annotation Clustering -> Pick threshold Infer biological processes from the list DAVID: Functional Annotation Chart -> explore functional databases and see which make sense GSEA: Select MSigDB sets of interest -> e.g., immunologic signatures Use domain specific database it at all possible! Find missing genes/proteins not detected by experiment ConceptGen: Gene-gene enrichment

39 Inputs Steps Control Treatment Reads R1 FastQ Reads R1 FastQ UCSC Reads R2 [optional] FastQ 1. Reads R2 [optional] FastQ Reference Genome Fasta Quality Control FastQC 2. Alignment (Mapping) TopHat2 Count Table 5. Gene set enrichment analysis nalysis! KEGG, GO, UCSC Annotation GTF 3. Read Counting CuffLinks data.frame Count Table 4. Differential expression analysis! DESeq2 data.frame

40 Pathways vs Networks Next Class - Detailed, high-confidence consensus - Biochemical reac2ons - Small-scale, fewer genes - Concentrated from decades of literature - Simplified cellular logic, noisy - Abstrac2ons: directed, undirected - Large-scale, genome-wide - Constructed from omics data integra2on

41 Next Class

42 Next Class What biological process is altered in this cancer? Are NEW pathways altered in this cancer? Are there clinically relevant tumor subtypes?

43 Pathway analysis (a.k.a. geneset enrichment) Side-note: Limitations Geneset annotation bias: can only discover what is already known Non-model organisms: no high-quality genesets available Post-transcriptional regulation is neglected Tissue-specific variations of pathways are not annotated e.g. NF-κB regulates metabolism, not inflammation, in adipocytes Size bias: stats are influenced by the size of the pathway Many pathways/receptors converge to few regulators e.g. Tens of innate immune receptors activate four TFs: NF-kB, AP-1, IRF3/7, NFAT

44 Pathway & Network Analysis Overview Find biological processes underlying a phenotype Databases Literature Network Analysis Network Information Expert knowledge Experimental Data

45 Do it Yourself! R Knowledge Check Quiz This will be marked but not graded (i.e. will not factor into your course grade)