11.5. Basic Training for Gene Expression Analysis. 李彥樑 (Jack Lee) 威健生技 Welgene Biotech. Welgene Biotech. Co. Ltd.

Size: px
Start display at page:

Download "11.5. Basic Training for Gene Expression Analysis. 李彥樑 (Jack Lee) 威健生技 Welgene Biotech. Welgene Biotech. Co. Ltd."

Transcription

1 11.5 Basic Training for Gene Expression Analysis 李彥樑 (Jack Lee) 威健生技 Welgene Biotech.

2 Why Bioinformatics Analysis? From this To this!

3 Agilent GeneSpring 最為廣泛使用的 array 分析軟體 支援 Agilent Affymetrix Illumina home-made array 與 ABI Q-PCR 數據 圖形化分析結果, 易於由生物觀點判讀 內建常用的統計篩選工具及歸群演算法 內建基因功能分析與 pathway 搜尋功能 ( 選配 ) 易於連結第三方 pathway 分析軟體 (IPA or MetaCore)

4 GeneSpring GX Links Array Data to Biology RNA Expression Data Find Significantly Expressed Genes BioPax, NLP, MeSH

5 Agilent GeneSpring 11.5: Integrative Platform for Multi-Omics Analysis Genomic (CNV, SNP) Transcriptomic (mrna, mirna, Exon Splicing) Proteomics (Protein Expression, optional) Metabolomics (Metabolite & small molecule, optional) NGS Data Analysis (coming in 2012)

6 DNA Applications in GeneSpring 11.5

7 RNA Applications in GeneSpring 11.5 Transcriptomic analysis Analysis of Agilent Exon Microarray data Probe-or gene-level expression analysis on all major microarray platforms, including Agilent, Affymetrix, and Illumina microrna analysis and identification of gene targets using integrated TargetScan information Exon splicing analysis using t-tests or multivariate splicing ANOVA and filtering for transcripts on splicing index Real-time PCR data analysis NCBI Gene Expression Omnibus Importer tool for expression datasets

8 Useful statistics and algorithms K-means, SOM, Hierarchical clustering, Class prediction, Gene Set Enrichment Analysis (GSEA), Gene Set Analysis (GSA), Principle Component Analysis(PCA), ANOVA, T-test, Repeated Measures Gene Ontology Analysis Allow the user to visualize and query the GO Tree; view the GO terms at any level as a Pie Chart; compute enrichment scores for GO terms based upon a set of selected entities Pathway analysis (optional) Import and view pathways in the BioPAX format Build relevant biological networks by natural language processing (NLP) algorithm Networks can also be created based on user-specified MeSH terms

9 Proteomics and Metabolomics Analysis in GeneSpring 11.5 Mass Spectrometry Module (MPP) Supported Various Data Formats Agilent LC/MS TOF, QTOF, and QQQ AMDIS GC/MS mzxml Tab-delimited text Statistical tests to identify significant peptides or metabolites Quickly and easily discover differences between sample groups Plot changing patterns of compound abundances over time Develop useful multivariate models for class prediction LC/MS Personal Compound Database (METLIN, pesticides, forensics) GC/MS libraries (NIST and Fiehn library) Empirical Formula Calculation using Agilent s Molecular Formula Generator (MFG) algorithm

10

11

12 NGS Data Analysis in GeneSpring 11.9 RNA-seq DNA-seq

13 Total Solution for Genomics

14 Agilent for Omics Research Proteomics NGS HPLC Systems Microarray HPLC Chip/MS Microarray Scanner Genomics Bioinformatics Metabolomics Bioanalyzer GC/MS Genomic Workgroup MassHunter Workstation Mass Profiler Pro GeneSpring GX

15 Interface of GeneSpring 11.5 Create entity list from selection Venn Diagram Import entity list from file 5. View / function bottom 1.Experiment (array data) 3. Display Area 4. Analysis tool 2. Interpretation, gene list, analysis result 6. Figure legend

16 Step-by-step GE Workflow 1. Experiments Setup 2. Quality Control 3. Analysis (Data filtering and clustering) 4. Class Prediction (optional) 5. Results Interpretations

17 Find Genes with Similar Expression Patterns (Clustering) Hierarchical Clustering K-means or SOM

18 Gene Ontology Analysis

19 Pathway Analysis Relations Human Mouse Rat Drosophila Binding Expression Metabolism Promoter Binding Protein modificatio n Regulation Transport Relations C.elegans Yeast Arabidopsis E.coli Binding Expression Metabolism Promoter Binding Protein modification Regulation Transport

20 Data Analysis 101

21 Workflow of GE Data Analysis Array Data Result Input Data (including normalization steps) Excludes Bad Data Find Differential Expressed Genes Clustering GO and Pathway Analysis Output Results

22 Two Types of GE Array Data Two color One color Exp. RNA Ctrl. RNA Exp. RNA Ctrl. RNA Red/Green Fold change of gene expression Green/Green

23 GE Array Normalization Settings One-color Data (ex: Agilent One-color, Affymetrix) Two-color Data (ex: Agilent Two-color) Summarization Algorithm (Chip-level Normalization) Quantile normalization or Constant/Percentile normalization LOWESS Normalization Baseline Transformation (Gene-level Normalization) To median/mean of all samples or To median/mean of control samples Optional To median of all samples

24 Effect of Chip-level Normalization (one-color array)

25 Effect of Baseline Transformation 原始訊號值 Ctrl T1 T2 T3 T4 Gene A Gene B 將每一個 gene 的訊號值除以該 gene 在所有樣品中的中位值 Ctrl T1 T2 T3 T4 Gene A Gene B 定義 Exp vs Ctrl 配對以算出 Ratio Exp./Ctrl Ratio Ctrl T1 T2 T3 T4 Gene A Gene B

26 Data Filtering Data with following characteristics need to be excluded: Bad quality Sample does not pass QC Probe with flags showing unusable Low intensity Genes was not stably and significantly differential expressed Difference is less than 1.5 or 2 fold Expression level is not statistically stable

27 Demo

28 Create entity list from selection User Interface of GeneSpring Venn Diagram Import entity list from file 5. Graph/function bottom 1.Experiment (array data) 3. Display Area 4. Analysis tool 2. Interpretation, gene list, analysis result 6. Figure legend

29 Vocabulary in GeneSpring Project primary workspace which contains a collection of experiments Sample data from a microarray run for a single biological source Experiment collection of samples that are analyzed as a set. Parameter variable in an experiment (Ex. Time) Condition a specific instance of a parameter for one or more samples that represent a common biological state (Ex. Time 14h) Interpretation Samples that are grouped together based on conditions. Entity (a.k.a Gene) a discrete feature measured by microarray analysis such as a probe or probeset Technology A file package containing information on array design as well as biological information (annotation) for all the entities on the array

30 Navigator Hierarchy One project can contain multiple experiments and technologies. Within an experiment, there is an Analysis folder containing all data objects created for the experiment. Data objects (lists, trees, classifications) within an experiment are saved under the input Entity List used for analysis. The idea of this storage system is to visually represent the workflow used to generate the results.

31 Help in GeneSpring 11.5

32 Preparation

33 Preparation Before Using GeneSpring Technology (Array information) Data (Raw or processed) GeneSpring 11.5 Analysis

34 Supported GE Technologies in GS 11.5 Agilent (One-color / Two-color) Expression Array Agilent (One-color / Two-color) Exon Array Agilent mirna Expression Agilent Custom Array Affymetrix Expression Array Affymetrix SNP Array Illumina Expression Illumina SNP Array ABI 7900 real-time PCR

35 Prepare Array Data Files Agilent: Feature Extraction TXT file (FE output) Affymetrix: CEL (recommended) / CHP files Illumina: Beadstation TXT files for GeneSpring Others: GPR files from GenePix

36 Import Commercial Technology For Agilent, Affymetrix, Illumina System Select technology (genome) and download by Update bottom (It may spend 5-10 min. to download one technology)

37 Demo Dataset Agilent Human 22k GE array 1-color data, biological triplicates design (n=3) US._Untreated.txt: HeLa cell US._Treated.txt: HeLa cell treated with Compound X Filename US _ _Untreated.txt US _ _Untreated.txt US _ _Untreated.txt US _ _Treated.txt US _ _Treated.txt US _ _Treated.txt Treatment Untreated Untreated Untreated Treated Treated Treated

38 Step-by-step Analysis Untreated Data Import Treated Set Summarization (Normalization) Set Grouping & Interpretation Data QC & Filtering T-test Find Differential Expressed Genes Fold change Clustering GO Analysis/ Pathway Analysis for DE genes

39 Data Import & Basic Setting

40 Create New Project

41 Create New Experiment Agilent Single Color Experiment type: Agilent One color / Two color Array Affymetrix Expression / Exon Expression Array Illumina One Color Array Generic One color / Two color Array (create from GPR or TDT) Workflow type: Advanced Analysis (recommended) Guided Workflow (for beginner s trial)

42 Choose Array Data Files File type depends on the technology you use

43 Agilent Flag Mapping Detected: 穩定有訊號 Not Detected: 穩定有無訊號 Compromised: 訊號不可信 Agilent Flag values: Flags are attributes that denote the quality of the entities. These flags are generally specific to the technology or the array type used. For Agilent microarrays, GeneSpring now utilizes Agilent's naming convention of Detected, Compromised, and Not Detected when importing experiments from Agilent's Feature Extraction (FE) software. This replaces the previous naming convention used in GeneSpring for Agilent microarray products. Specifically, the Agilent naming convention of Detected replaces the previous Present (P) call, Compromised replaces the Absent call (A), and Not Detected replaces the Marginal call (M). The resultant flag value of any probe is decided by the following logic. For each probe with multiple flags, the order of importance is Compromised > Not Detected > Detected. If there is even one 'Compromised', then the resultant flag is `Compromised'. If there is no Compromised, but `Not Detected' and `Detected', then `Not Detected' is assigned. If there are only `Detected' then only the resultant flag is assigned as `Detected'. If there are equal number of `Not Detected' and 'Detected', then `Not Detected' gets preference. At the end of this exercise, each probe is assigned one flag.

44 Set Normalization Algorithm (Chip-Level)

45 Set Normalization Algorithm (Gene-Level)

46 Data Import Completes A box plot shows up once the data were imported

47 Experiment Grouping Help Doc If you need, more than one parameter can be added.

48 Create Interpretations Uncheck this option!!

49 More than One Interpretation Can Be Created Averaged Un-averaged

50 Data Analysis

51 Quality Control on Samples Check for hybridization and internal control to exclude bad samples Exclude outliers

52 Filter Probesets by Flags Check settings every time when you filter genes Detected only

53 Filter Probeset by Expression (Signal Intensity) Number of remaining probes Log2 Ratio Raw signal

54 Statistical Analysis (t-test) p-value FDR (corrected p-value)

55 Fold Change Switching bottom Fold Change = Avg. Condition1 /Avg. Condition2

56 Hierarchical Clustering (Heatmap) Cluster on: entities and conditions Distance metric: Euclidean Linkage rule: Centroid

57 GO Analysis

58 Pathway Analysis

59 GSEA/GSA Analysis 不預先篩選變化倍率, 尋找整體偏移的功能基因群 ( 目前用於 human data) GSEA uses the list rank information without using a threshold. The introduction to the Gene Set Enrichment Analysis PNAS paper discusses the limitations of the former approach and how GSEA addresses them.

60 Significant enrichment No significant enrichment

61 Untreated Data Import Treated Set Summarization (Normalization) Set Grouping & Interpretation Data QC & Filtering T-test Find Differential Expressed Genes Fold change Clustering GO Analysis/ Pathway Analysis for DE genes

62 Cytogenetic sets (C1). This catalog includes 24 sets, one for each of the 24 human chromosomes, and 295 sets corresponding to cytogenetic bands. These sets are helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects. Functional sets (C2). This catalog includes 472 sets containing genes whose products are involved in specific metabolic and signaling pathways, as reported in eight publicly available, manually curated databases, and 50 sets containing genes coregulated in response to genetic and chemical perturbations, as reported in various experimental papers. Regulatory-motif sets (C3). This catalog is based on our recent work reporting 57 commonly conserved regulatory motifs in the promoter regions of human genes (11) and makes it possible to link changes in a microarray experiment to a conserved, putative cis-regulatory element. Neighborhood sets (C4). These sets are defined by expression neighborhoods centered on cancer-related genes. This database provides an initial collection of gene sets for use with GSEA and illustrates the types of gene sets that can be defined, including those based on prior knowledge or derived computationally.

63

64 Export Result

65 Export Gene List 存成 txt 檔後, 再以 excel 編輯整理 注意!!! 資料中標示 normalized 數據為 log2 轉換過的數值, 如果 array data 有做過 baseline transformation, 則 normalized 數值即為 log2 Ratio 值

66 Export Graph Copy View: copy the figure and paste to word, powerpoint, or phootoshop Export to Image: save figure as an image file Select file format W:H = 4 : 3 or W:H = 2.5 : 4 is recommended Uncheck to specify aspect ratio Uncheck to output full size of image >= 150 dpi is recommended Export to HTML: save figure as a html file

67 On-line Supporting Resources lenteseminar&theaction=poprecord&ecflag=true&recordid= agilenteseminar&theaction=poprecord&ecflag=true&recordid=

68 Lunch Break Please come back before PM 1:30

69 Chip-level Normalization Algorithms One-color Data Normalization Options Quantile Algorithms: Quantile normalization (Agilent) RMA / GC-RMA / PLIER / LiWong (Affy) Constant/Percentile Algorithms: 75 th Percentile Shift / Scale (to median, to mean, to N%, Agilent or Affy) MAS5 (normalize to mean and scale mean to 500, Affy) Normalize to control genes / Normalize to External Value (Agilent or Affy) Two-color Data Normalization: LOWESS (Agilent or Custom)

70 Signal Transformation in GeneSpring Normalization Data Analysis Data Export Chip-level Gene-level Signal Ratio Signal Ratio Ratio Ratio Log2 Transform 使用 Log2 Ratio 進行加減或取平均 Log2 Ratio (Normalized Signal)

71 Interpretation Interpretation 為篩選與分析的基礎, 數據的比對或統計驗證皆靠 interpretation 的設定進行, 因此設定 interpretation 前須確定實驗中的變數與分析目標, 再針對不同的分析需要建立不同的 interpretation 設定 不好的 interpretation 會將分析結果偏離研究主軸, 結果難以回答想了解的疑問, 甚至提供錯的答案

72 Background of Case Study Congestive heart failure (CHF) is a degenerative condition in which the heart no longer functions effectively as a pump. The most common cause of CHF is damage to the heart muscle by not enough oxygen. Idiopathic cardiomyopathy results in weakened hearts due to an unknown cause. Ischemic cardiomyopathy is caused by a lack of oxygen to the heart due to coronary artery disease.

73 Experimental Goal To identify the molecular mechanisms underlying congestive heart failure, gene expression profiles were compared between male and female patients with idiopathic, ischemic or non-failing heart conditions.

74 Experimental Setup in GeneSpring File Name CHF Etiology Gender PAD_4 Idiopathic Female PAD_7 Idiopathic Female PAD_9 Idiopathic Male PAD_10 Idiopathic Male PA-N_249 Non-failing Female PA-N_300 Non-failing Male PA-N_322 Non-failing Male PA-N_326 Non-failing Female PAS_3 Ischemic Female PAS_6 Ischemic Female PAS_7 Ischemic Male PAS_8 Ischemic Male Gender Interpretation (6 samples per condition) Condition 1: Female Condition 2: Male CHF Etiology Interpretation (4 samples per condition) Condition 1: Idiopathic Condition 2: Ischemic Condition 3: Non-failing Gender/CHF Etiology Interpretation (2 samples per condition) Condition 1: Female/Idiopathic Condition 2: Male/Idiopathic Condition 3: Female/Ischemic Condition 4: Male/Ischemic Condition 5: Female/Non-failing Condition 6: Male/Non-failing The selected Interpretation determines how the samples are displayed in the various views and the comparisons that are made in analyses such as statistics.

75 Statistical Tests One-way Tests: Compare conditions defined by a single parameter T-Test ANOVA Time 0 hr Time 24 hr Time 0 hr Time 24 hr Time 24 hr N-way Tests: Compare conditions defined by 2 or more parameters 2-Way 3-Way Time 0 hr 24hr Treatment Control Drug A X Time 0 hr 24hr X Treatment Control Drug A Genotype WT X KO Month ##, 200X Group/Presentation Title Agilent Restricted

76 One-way Tests Comparing Two Conditions Parametric Tests: T-test unpaired T-test paired T-test unpaired unequal variance Non-parametric Tests Mann-Whitney unpaired (Wilcoxon Rank-Sum test) Mann-Whitney paired One-way Tests Comparing More Than Two Conditions Parametric Tests ANOVA ANOVA unequal variance (Welch ANOVA) Repeated measures Non-parametric Tests Kruskal Walis Month ##, 200X Friedman Group/Presentation Title Agilent Restricted

77 P-value Calculation Methods Asymptotic Method Assumes expression values for a gene within each population is normally distributed and variances are equal between populations Thus, assumes test metrics (t-ratio, f-ratio) are normally distributed If you do not want to make these assumptions, then use permutation method to compute p-value Permutation Method Does not assume an underlying distribution Permute samples and build distribution of test metrics for probe P-value is the fraction of permutations in which the test metric computed is larger than the actual test metric for that gene Un-paired or Paired Un-paired Design: Data were collected from different populations. Ex. Patient vs Normal Paired Design: Data were collected from same population. Ex. Before treatment vs After treatment Month ##, 200X Group/Presentation Title Agilent Restricted

78 To correct or not to correct? 16,000 genes 原始輸入基因數 2,000 genes 統計篩選後基因數 P-value <0.05, 獲得的基因列表中可能是錯的基因數為 0.05 x 16,000 = 800 Corrected P-value (FDR) <0.05, 獲得的基因列表中可能是錯的基因數為 0.05 x 2,000 = 100 Multiple Testing Correction (MTC) Bonferonni FWER Bonferonni Holm FWER Benjamini Hochberg FDR No Correction Month ##, 200X More false negatives More false positives Group/Presentation Title Agilent Restricted

79 Clustering Algorithms K-means or SOM: 多用於尋找一致性變化趨勢的基因群 Hierarchical: 觀察樣品間相似度與比較基因變化狀況 Hierarchical Hierarchical clustering is one of the simplest and widely used clustering techniques for analysis of gene expression data. The method follows an agglomerative approach, where the most similar expression profiles are joined together to form a group. These are further joined in a tree structure, until all data forms a single group. The dendrogram is the most intuitive view of the results of this clustering method. K-Means This is one of the fastest and most efficient clustering techniques available, if there is some advance knowledge about the number of clusters in the data. Entities are partitioned into a fixed number (k) of clusters such that, entities/conditions within a cluster are similar, while those across clusters are dissimilar. Self Organizing Maps (SOM) SOM Clustering is similar to K-means clustering in that it is based on a divisive approach where the input entities/conditions are partitioned into a fixed user defined number of clusters. Besides clusters, SOM produces additional information about the affinity or similarity between the clusters themselves by arranging them on a 2D rectangular or hexagonal grid.

80 Distance Measures Euclidean: Standard sum of squared distance (L2-norm) between two entities. Squared Euclidean: Square of the Euclidean distance measure. This accentuates the distance between entities. Entities that are close are brought closer, and those that are dissimilar move further apart. Manhattan: This is also known as the L1-norm. The sum of the absolute value of the differences in each dimension is used to measure the distance between entities. Chebychev: This measure, also known as the L-Infinity-norm, uses the absolute value of the maximum difference in any dimension. Differential: The distance between two entities in estimated by calculating the difference in slopes between the expression profiles of two entities and computing the Euclidean norm of the resulting vector. This is a useful measure in time series analysis, where changes in the expression values over time are of interest, rather than absolute values at different times. Pearson Absolute: This measure is the absolute value of the Pearson Correlation Coefficient between two entities. Highly related entities give values of this measure close to 1, while unrelated entities give values close to 0. Pearson Centered: This measure is the 1-centered variation of the Pearson Correlation Coefficient. Positively correlated entities give values of this measure close to 1; negatively correlated ones give values close to 0, and unrelated entities close to 0.5. Pearsons Uncentered This measure is similar to the Pearson Correlation coefficient except that the entities are not mean-centered. In effect, this measure treats the two entities as vectors and gives the cosine of the angle between the two vectors. Highly correlated entities give values close to 1, negatively correlated entities give values close to -1, while unrelated entities give values close to 0.

81 Distance Measures Single Linkage: Distance between two clusters is the minimum distance between the members of the two clusters. Complete Linkage: Distance between two clusters is the greatest distance between the members of the two clusters Average Linkage: Distance between two clusters is the average of the pairwise distance between entities in the two clusters. Centroid Linkage: Distance between two clusters is the average distance between their respective centroids. This is the default linkage rule. Ward's Method: This method is based on the ANOVA approach. It computes the sum of squared errors around the mean for each cluster. Then, two clusters are joined so as to minimize the increase in error.

82 Overlay Expression information on the pathway Right click and select properties Choose interpretation overlap and the appropriate interpretation Month ##, 200X

83 Legend Nodes (molecular and biological entities) Relations (biological relations between entities) Edges (regulatory effects of a node on another) Node-Legend Edges-Legend Relation-Legend Month ##, 200X Group/Presentation Title Agilent Restricted

84 Merge Entity List with Venn Diagram Press Ctrl key and mouse bottom to select multiple gene groups Press Create Entity List key to create gene list

85 Add p-value and fold change value to specific entity list by Venn Diagram P-value Specific List Fold Change

86 Create New Gene-level Experiment The signal values will be averaged over the probes (for that gene Entrez ID) for the new experiment.

87 Thank You for Using GeneSpring!