Pathway Analysis Adding Func2onal Context to High- Throughput Results

Size: px
Start display at page:

Download "Pathway Analysis Adding Func2onal Context to High- Throughput Results"

Transcription

1 Pathway Analysis Adding Func2onal Context to High- Throughput Results Stephen D. Turner, Ph.D. Bioinforma2cs Core Director

2 Outline Bioinforma2cs & the Bioinforma2cs Core Service Highlight: Pathway Analysis IPA demo

3 Bioinforma2cs Origins Rooted in sequence analysis Driven by need to: - Collect - Annotate - Analyze

4 What is bioinforma2cs? (Diagram modified

5 What is bioinforma2cs? There is a tremendous amount of informa4on regarding evolu&onary history and biochemical func&on implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant informa4on, correlate it into a unified whole and interpret it. M. Dayhoff, February 27, 1967

6 UVA Bioinforma2cs Core A centralized resource for providing expert and 2mely bioinforma2cs consul2ng and data analysis. Main goals: help you publish and get funding. 1. Service 2. Training

7 This is the stuff we do in the bioinforma2cs core! Sequencing Sample prep Raw data Differential expression Gene identification Novel Genes Discoveries etc. Find out what this stuff is at

8 Services Gene expression: Microarray Analysis Gene expression: RNA- seq Analysis Pathway analysis DNA Varia2on (GWAS, NGS) DNA Binding / ChIP- Seq DNA Methyla2on Grant / Manuscript support Custom development

9 Services Gene expression: Microarray Analysis Accession and analysis of publicly available data (e.g. GEO, ArrayExpress). Preprocessing: background subtrac2on, summariza2on, and quan2le normaliza2on using RMA (Robust Mul2chip Average) expression measure described in Irizarry et al. Biosta2s2cs 4: Quality assessment: Visualiza2on of signal intensity distribu2ons of each array using boxplots and density plots. MA plots to visualize signal intensity over average intensity. Principal components analysis to visualize the overall data (dis)similarity between arrays. Analysis: Es2ma2on of fold changes and standard errors using a linear model. Empirical Bayes smoothing to standard errors. Lists of top differen2ally expressed genes, fold changes, sta2s2cal significance, mul2ple tes2ng correc2on. Visualiza2on: Heatmaps and dendrograms. Volcano plots to visualize sta2s2cal significance by fold change. Biological context Pathway/Func2onal Analysis.

10 Services Gene expression: RNA- seq Pre- alignment quality assessment: Per- base sequence quality Per- base sequence content Per- base GC content Search for overrepresented sequences (adapters, primers, etc) Alignment to a reference genome: Homo sapiens Mus musculus Rahus norvegicus Bos taurus Canis familiaris Gallus gallus Drosophila melanogaster Arabidopsis thaliana Caenorhabdi2s elegans Saccharomyces cerevisiae Post- alignment quality assessment: Flagging duplicate reads Es2ma2on of library complexity Insert size distribu2on (for paired- end sequencing) Analysis of coverage over transcript posi2on Transcript assembly Differen2al expression tes2ng Isoforms Genes Primary transcripts Coding sequence Differen2al splicing analysis Differen2al coding output Differen2al promoter use Visualiza2on: assistance with visualiza2on using IGV.

11 Services DNA Varia2on: Genotyping Study design & power calcula2ons for SNP genotype- phenotype associa2on studies Data management and quality control PCA for popula2on stra2fica2on control Imputa2on to a reference popula2on (e.g. HapMap, 1000 Genomes) Analysis, interpreta2on, visualiza2on Manuscript prepara2on Grant support (compliance with NIH data sharing policies, methodology for data management, design, analysis, and interpreta2on) Acquisi2on of publicly available data (dbgap) DNA Varia2on: Next- Gen Sequencing Alignment to a reference genome Calibra2on of quality scores and duplicate read removal Variant calling Variant annota2on SNP effect predic2on De novo assembly Any of the applicable analysis, interpreta2on, and visualiza2on services described above for genotyping data.

12 Service Highlight: Pathway Analysis You ve done your microarray/rna- Seq experiment You have a list of genes Want to put these into func2onal context What biological processes are perturbed? What pathways are being dysregulated? Data reduc2on: hundreds or thousands of genes can be reduced to 10s of pathways Iden2fying ac2ve pathways = more explanatory power Pathway analysis encompasses many, many techniques st Genera2on: Overrepresenta2on Analysis (E.g. GO ORA) 2. 2 nd Genera2on: Func2onal Class Scoring (e.g. GSEA) 3. 3 rd Genera2on (in development): Pathway Topology (E.g. SPIA) bit.ly/pathway- analysis

13 Over- representa2on analysis (ORA) Many varia2ons on the same theme: sta2s2cally evaluates the frac2on of genes in par2cular pathway that show changes in expression. Algorithm: 1. Create input list (e.g. significant at p<0.05 ) 2. For each gene set: a. Count number of input genes b. Count number of background genes (e.g. all genes on plaoorm). 3. Test each pathway for over- representa2on of input genes Gene Set: typically gene ontology (GO) term.

14 Gene Ontology Ontology = formal representa2on of a knowledge domain. Gene ontology = cell biology. GO represented by directed acyclic graph (DAG). Terms are nodes, rela2onships are edges. Parent terms are more general than their child terms. Unlike a simple tree, terms can have mul2ple parents. Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annota2ons. Nature reviews. Gene2cs, 9(7), doi: /nrg2363

15 GO ORA: Example Algorithm: 1. Create input list (e.g. significant at p<0.05 ) 2. For each gene set: a. Count number of input genes b. Count number of background genes (e.g. all genes on plaoorm). 3. Test each pathway for over- representa2on of input genes Ex: GO Purine Ribonucleo2de Biosynthe2c Process 1% of input (significant) genes are annotated with this term. 1% of genes on the chip are annotated with this term. Not significantly overrepresented. Ex: GO V(D)J Recombina2on 20% of input (significant) genes are annotated with this term. 1% of genes on the chip are annotated with this term. Highly significantly over- represented!.

16 GO ORA: Example

17 GO ORA: Limita2ons Some categories are so general they re meaningless (e.g. cellular process ). ORA uses genes above a cutoff and discards everything else. ORA only uses the number genes, and ignores their measured changes. Two assump2ons violated Genes are independent (NOT! Coexpression, interac2on, etc). Pathways are independent (by defini2on violated by DAG).

18 Func2onal Class Scoring Theory: while large changes in individual genes can have significant effects on pathways, weaker but coordinated changes in sets of func2onally related genes can also have significant effects. General Algorithm: 1. Compute gene- level sta2s2c (e.g. Fold Change, student s t). 2. Aggregate gene level sta2s2cs for all genes in pathway into single pathway- level sta2s2c. 3. Assess significance with permuta2on.

19 Gene Set Enrichment Analysis 1. Calculate an Enrichment Score a) Rank genes by their expression difference b) For each Gene Set*: i. Compute cumula2ve sum over ranked genes 1. Increase sum when gene is in set, decrease otherwise ii. 2. Magnitude of increment depends on gene- phenotype correla2on Record the maximum devia2on from zero as Enrichment Score (ES) 2. Assess significance a) Permute phenotype (or gene labels) mes b) Compute ES score for each permuta2on (empiric null). c) Compare ES score for actual data to distribu2on of ES scores from permuted data. d) Normalize ES by accoun2ng for gene set size e) Control mul2ple tes2ng by calcula2ng FDR for each NES * Gene sets: Come from MSigDB hhp:// MSigDB is collec2on of annotated gene sets for use with GSEA sovware. Posi2onal, curated, computa2onally predicted, GO. Curated: KEGG, Reactome, STKE, etc.

20 GSEA: Example

21 Violate same assump2ons as GO- ORA: Genes are independent Pathways are independent FCS/GSEA: Limita2ons Only consider number/magnitude of genes, and ignore other informa2on in databases: Direc4onality of the interac2on Nature of the interac2on (ac2va2ng, inhibi2on, etc). Where the interac2on occurs (nucleus, cytoplasm, etc).

22 U2lizes direc2onality, func2on, and topology. Computes two orthogonal p- values: pnde: Number of Differen2ally Expressed genes (E.g. like ORA). ppert: degree of perturba2on pg is overall p- value (pnde and ppert combined) pg FDR is overall FDR- corrected p- value Pathway Topology: SPIA

23 Pathway Topology: SPIA TCR Signaling Pathway Results pnde: 6.5e- 9 ppert:.29 pg FDR : 1.2e- 6 Conclusion: many differen2ally expressed genes, but pathway may not be badly perturbed.

24 Pathway Topology / SPIA: Limita2ons With SPIA, s2ll need arbitrary cutoff e.g. top 500, or p<0.05, etc. True topology is dependent on type of cell due to cell- specific gene expression profiles. Tissue- specific topology is rarely available and fragmented in databases, even if it s fully understood. Other general limita2ons of pathway analysis - - -

25 Pathway Analysis: General Limita2ons Low resolu2on knowledge bases E.g. RNA- seq studies have found >90% of transcriptome is alterna2vely spliced. Different transcripts can have different or opposing func2ons. Incomplete/inaccurate annota2ons. Oct 2007: 95% GO annota2ons inferred electronically (i.e. not manually curated). Missing condi2on- and cell- specific informa2on. Methodological challenge: lack of benchmarks.

26 Pathway Analysis: Conclusions Pathway analysis gives you more biological insight than staring at lists of genes. Pathway analysis is complex, and has many limita2ons. Pathway analysis is s2ll more of an exploratory procedure rather than a pure sta2s2cal endpoint. The best conclusions are made by viewing enrichment analysis results through the lens of the inves4gator s expert biological knowledge.

27 IPA Demo Background: Microarray data from Childhood Exacerbated Asthma compared to normal state. Ques2ons: Do data supported involvement of immune/ inflammatory responses and viral infec2on in the acute asthma ahack? Tasks: View Canonical pathways that contain significant numbers of genes from this dataset. Overlay a Func2on/Disease state that shows how key signaling pathways for figh2ng off respiratory infec2ons overlapped with asthma2c inflamma2on. Overlay Biomarkers that iden2fy genes in the infec2on signaling pathway that are also used for diagnosis and efficacy indicators for asthma treatments. Search the Ingenuity Knowledge Base for literature references that support your findings. Inves2gate a weird finding

28 Thank you Web: E- mail: Blog: Twiher: twiher.com/gene2cs_blog