Analysis pipe-line. Analysis pipe

Size: px

Start display at page:

Download "Analysis pipe-line. Analysis pipe"

Arron Mills
6 years ago
Views:

1 Bioconductor Bioconductor Platform specific Platform specific devices devices Analysis pipe Analysis pipe-line line Sample Sample Preparation Preparation Array Array Fabrication Fabrication Hybridization Hybridization Scanning Scanning + Image Image Analysis Analysis Normalization Normalization Filtering Filtering statistical statistical analysis analysis Annotation Annotation Biological Biological Knowledge Knowledge extraction extraction Quality Quality control control

2 onechannelgui This is a graphical interface to Bioconductor libraries devoted to the analysis of data derived from single channel platforms. affylmgui is a graphical interfase to limma library, which allows differential expression detection by mean of linear model analysis. onechannelgui is an extension of affylmgui capabilities.

3 onechannelgui 3 IVT / gene arrays: Primary (probe level QC, probe set summary and normalization), secondary analysis (replicates QC, filtering, statistical analysis, classification) and data mining (GO enrichment). Exon arrays: Secondary analysis (replicates QC, filtering, statistical analysis, classification, basic Splice Index inspection) using expression console as source of primary data. Large data set (i.e. probe set expression in tab delimited format): Secondary analysis (replicates QC, filtering, statistical analysis, classification) using expression console/geo/arrayexpress data as source of primary data.

$RAM RAM at at 2GB: 2GB: C:\..\R\R-2.3.0\bin\Rgui.$

4 Starting R and onechannelgui Setting Setting the the virtual virtual RAM RAM at at 2GB: 2GB: C:\..\R\R-2.3.0\bin\Rgui.exe --max-mem-size=2048m

5 A Double Double click click on on R R to to start start B

6 A B Click Click on on Package to to load load Bioconductor packages

7 A C B Click Click on on Load Load package to to select select the the onechannelgui package Click Click on on OK OK to to load load the the onechannelgui package

8 A Click Click on on Yes Yes to to start start the the affylmgui interface. Yes Wait few seconds! B C Yes Click Click on on Yes Yes to to start start the the onechannelgui interface.

9 Standard affylmgui menu menu Overlaying onechannelgui to to affylmgui will will change change the the default default affylmgui menu menu to to the the onechannelgui menu menu for for 3 IVT 3 IVT Affymetrix arrays arrays onechannelgui menu menu for for 3 IVT 3 IVT arrays arrays

10 A Summary of of loaded loaded data: data: none none is is available since since no no CEL CEL files files have have been been loaded loaded

A Click Click on on File File to to start

arrays arrays D Selected as as working dir

11 A Click Click on on File File to to start start a new new project project B C Click Click on on New New to to start start a new new project project Selected 3 IVT 3 IVT arrays arrays D Selected as as working dir dir the the folder folder containing the the.cel.cel files files

Selected the the targets file. file. Then Then press press OK OK to to continue Targets file file is is a tab tab delimited text text file filecontaining the the description of of the the experiment.

12 Selected the the targets file. file. Then Then press press OK OK to to continue Targets file file is is a tab tab delimited text text file filecontaining the the description of of the the experiment. It It is is made made of of three three columns: Name: Name: the the name name you you want want to to assign assign to to each each array. array. FileName: the the names names of of the the corresponding.cel.cel file file Target: the the experimental condition associated to to the the array array (e.g. (e.g. mock, mock, treated, etc). etc). At At least least two two conditions should should be be present.

13 Widget to create a target for Affy arrays

14 Widget to create a target for Affy arrays

15 Widget to create a target for Affy arrays

16 Widget to create a target for Affy arrays Skip Skipitit

17 Define Define the the name name of of you you analysis. Press Press OK OK to to continue... Now Now the the array array will will be be loaded loaded in in a specific R object object called called environment. Raw Raw data data are are now now loaded loaded and and are are ready ready for for normalization.

18 Analysis pipe-line Quality control Normalization Filtering Biological Knowledge extraction Statistical analysis Annotation

19 A The The next next steps steps are are few few simple simple basic basic quality quality controls. B Click Click on on Quality Quality Control Control menu menu

20 A You You can can now now evaluate: Intensity histogram for for one one array array at at a time. time. E C D

21 A You You can can now now evaluate: Intensity density density plot plot for for one one array array at at a time. time. E C D

22 A You You can can now now evaluate: all all arrays arrays intensities as as box box plots. plots. C

23 A B B A It It is is possible that that crna crna concentration in in sample sample se2 se2 was was over over estimated and and a low low crna crnaamount was was loaded loaded on on the the array. array. As As result result a lot lot of of signals signals are are below below the the value value [log [log 2 (100) 2 (100) = 6.44] 6.44]

affyplm Fit Fit the the model model (BE (BE PATIENT!

24 A Some Some other other basic basic controls can can be be done done after after the the calculation of of the the probe probe set set intensity summary using using a special special Bioconductor library library affyplm Fit Fit the the model model (BE (BE PATIENT!!!) B The The end end of of the the fitting fitting procedure is is given given by by a message. Then Then the the NUSE/RLE function is is automatically called called C

25 affyplm QC library affyplm provides a number of useful tools based on probe-level modelling procedures. affyplm package allows arrays quality controls.

26 What is a Probe Level Model? A Probe Level Model (PLM) is a model that is fit to probe-intensity data. affyplm fits a model with probe level and chip level parameters on a probe set by probe set basis. In quality control chip level parameters are a factor variable with a level for each array.

27 What is a PLMset? The main function for fitting PLM is the function fitplm. This function will fit a linear model with an effect estimated for each chip and an effect for each probe. fitplm implements iteratively re-weighted least squares M-estimation M regression. The fitted model is stored in a PLMset object containing chip level parameter estimates and the corresponding standard errors.

28 Default fitted model log 2 PM kij = β + α + ε kj ki kij where β kj is the log 2 probe set expression value on array j for probeset k and α ki are probe effects. To make the model identifiable the constrain I = α = 0 is used. i 1 ki For this default model, the parameter estimates given are probe set expression values.

29 Relative Log Expression (RLE) RLE values are computed for each probe set by comparing the expression value on each array against the median expression value for that probeset across all arrays. Assuming that most genes are not changing in expression across arrays means ideally most of these RLE values will be near 0. Boxplots of these values, for each array, provides a quality assessment tool. RLE plots: Estimation of expression θ gi for each gene g on each array i. Compute the median value across arrays for each gene

30 Relative Log Expression (RLE)

31 Normalized Unscaled Standard Errors (NUSE) Standard error measures the amount of errors done fitting y for every x value. se= Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. The standard error estimates obtained for each gene on each array from fitplm are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. This process accounts for differences in variability between genes. es. An array were there are elevated SE relative to the other arrays is typically of lower quality. Boxplots of these values, separated by array can be used to compare arrays.

32 NUSE ( θ ) ˆ = gi med ( ˆ θ gi ) ( SE( ˆ θ ) SE gi

33 A C B

34 A B

35 A Since the fitplm object can be be very big. It It is is a good idea, to to delete it it after quality control. Before Delete PLM After Delete PLM

36 Analysis pipe-line Quality control Normalization Filtering Biological Knowledge extraction Statistical analysis Annotation

37 Analysis steps: affylmgui Calculating probe set summaries: RMA GCRMA PLIER Normalization: Quantile method

38 Brief summary about probe set intensity calculation RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004). The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in probe behavior and handling error at the appropriately at low and high signal values. Methods such as PLIER+16 and GCRMA, which use model-based background correction, maintain relatively good accuracy without losing much precision.

39 Why Normalization? To remove systematic biases, which include, Sample preparation Variability in hybridization Spatial effects Scanner settings Experimenter bias Extracted from D. Hyle presentation,

40 What Normalization Is & What It Isn t Methods and Algorithms Applied after some Image Analysis Applied before subsequent Data Analysis Allows comparison of experiments Not a cure for poor data.

41 Quantile normalization Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

42 Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)

43 A The The next next step step is is normalization and and calculation of of probe probe set set summary. B Click Click on on probe probe set set menu menu and and select select the the probe probe set set summary and and normalization option. option.

Normalization and and intensity calculation come come

GCRMA + quantile normalization PLM PLM + quantile

44 Normalization and and intensity calculation come come together. Three Three Normalization/intensity calculation option option are are available: RMA RMA + quantile normalization GCRMA + quantile normalization PLM PLM + quantile normalization At At any any time time it it is is possible to to check check the the structure of of the the normalized data data set set

45 Replicates quality control To evaluate sample replicates quality we will use a partition technique called Principal component analysis (PCA).

46 Principal component analysis Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible Each succeeding component accounts for as much of the remaining variability as possible. The components can be thought of as axes in n-n dimensional space, where n is the number of components. Each axis represents a different trend in the data.

47 A B To To perform sample sample replicates QC QC we we use use principal component analysis (PCA) (PCA) This This check check is is performed on on probe probe set set summaries! t4 t4 is is clearly clearly an an outlier! outlier!

48 Analysis pipe-line Quality control Normalization Filtering Biological Knowledge extraction Statistical analysis Annotation

49 Filtering Filtering affects the false discovery rate. Researcher is interested in keeping the number of tests/genes as low as possible while keeping the interesting genes in the selected subset. If the truly differentially expressed genes are overrepresented among those selected in the filtering step, the FDR associated with a certain threshold of the test statistic will be lowered due to the filtering. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

50 Filtering can be performed at Annotation features: various levels: Specific gene features (i.e. GO term, presence of transcriptional regulative elements in promoters, etc.) Signal features: % intensities greater of a user defined value Interquantile range (IQR) greater of a defined value

51 Specific gene feature In transcriptional studies focusing on genes characterized by specific feature (i.e.( transcription factor elements in promoters) ) the best filtering approach is selecting only those genes linked to the peculiar feature. For example: Identification of genes modulated by estradiol:er or IGF1 by direct binding to Estrogen-Responsive Elements (ERE): HGU133plus2: probe sets Entrez Genes HGU133plus2 with ERE in putative promoter regions: 6764 probe sets 3058 Entrez Genes

52 Specific gene feature Data derived from specifically devoted annotation data set can be used for functional filtering. The Ingenuity Pathways Knowledge Base is the world's largest curated database of biological networks created from millions of individually modeled relationships between: proteins, genes, complexes, cells, tissues, drugs, diseases. The Ingenuity Pathways Analysis software (IPA) identifies relations between genes. The relations that can be grasped are: Regulates Regulated by Binds

53 Start an Ingenuity session at:

54 Specific classes of proteins can be searched and exported

56 A key word can also be used to perform a wide search

57 After selection of the Functions & diseases of interest genes should be visualized as gene details before exportation in a file to be used for filtering expression data Exporting results in a table as previously

To use filtering using a list of EG you need to extract from

58 The Entrez Gene IDs present in this file can be used to extract e specific subset of genes. To use filtering using a list of EG you need to extract from the IPA table only the Entrez genes of interest and save them on a text file without header.

60 Non specific filtering This technique has as its premise the removal of genes that are deemed to be not expressed or unchanged according to some specific criterion that is under the control of the user. The aim of non specific filtering is to remove genes that, e. g. due to their low overall intensity or variability,, are unlikely to carry information about the phenotypes under investigation. Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004

61 A B C D

63 A B C In In this this example will will be be selected only only those those genes genes characterized by by having having in in at at least least 50% 50% of of the the arrays arrays an an intensity

65 QC and filtering for exon data At the time onechannelgui was setup Bioconductor tools for handling raw data from Affymetrix exon arrays were not available. For this reason the onechannelgui uses the libraries and primary analysis outputs from Affymetrix Expression Console. Exon raw data quality control is done using the Expression Console. Sample QC and filtering are performed on onechannlegui.

66 Exon arrays on onechannelgui On onechannelgui gene level and exon level data from Expression Console are loaded. User needs to specify where Expression Console library files are located, at any time a new exon data set is loaded.

67 Loading an exon array data set it is necessary to indicate the organism and which kind of exon data are going to be loaded (core, extended, full) Loading an exon array data set it is necessary to specify the location of Expression Console libraries.

68 Subsequently three files have to be loaded: The target file, which has the same structure previously described. The tab delimited files containing GENE and EXON level data exported form the Expression Console.

69 A new Menu is then available for exon data

70 Exon arrays QC on onechannelgui

71 The brain (b) replicates are very poor. The quality is particularly bad for exon data. However, we have to consider that these data are derived from tissues coming from different post-mortem donors.

Exon arrays filtering Since the knowledge on exon data is still relatively limited we have little empirical information about background threshold.

72 Exon arrays filtering Since the knowledge on exon data is still relatively limited we have little empirical information about background threshold. Exon/intron housekeeping gene information available in exon data might be a possible approach to define it. Different color lines indicate the possible thresholds to be selected. In black are shown the intensity density plots for introns as in red those for exons.

73 IQR filter works as described for 3 IVT arrays. However, any filter done at gene level will also affect the corresponding exon data. Starting condition After filtering

74 Intensity filter is instead based on the threshold previously selected on the basis of exon/intron HK expression signals. In this example we are keeping only the genes where all samples have a signal greater than the pre-defined BG.

75 Splice Index The Splicing Index captures the basic metric for the analysis of alternative splicing. It is a measure of how much exon specific expression (with gene induction factored out) differs between two samples. Defining function-oriented oriented data set for splice index calculation

76 A Use Usea set set of of function-oriented EGs EGstoto select selectprobe set set IDs IDs B C

83 Use Usethe the selected probe probe set set IDs IDsfor for Filtering using usinga list list of of probe probe sets. sets.

84 B A

85 C ATTENTION: this is only a very rough descriptive instrument! Much work needs to be done on exon analysis! Splice SpliceIndex Indexinspection is is performed modelling the the splice spliceindex indexexon exonprofiles for fortwo twoexperimental conditions. Results are are saved savedon on a pdf pdf file file in in your yourworking dir. dir.

86 The The sub sub set set of of splice spliceindexes to tobe beinspected is isdefined definedusing usingtwo twofilters:

87 A D Example of of one one gene gene output output B C

88 This Thisplot gives givessome some advise adviseabout about the the scattering levels levelsof of the the Splice Splice Indexes over over the the gene gene under under analysis A Model Model of of splice spliceindexes over over the the two twoexperimental conditions. Red Reddashed dashedlines linesindicate the the confidence interval intervalof of the the model. model.

89 Plots Plotsof of significance p-value p-valueof of the the alternative splicing splicingversus versusthe the average Splice SpliceIndex Indexvalues. In In this thisexample only onlyone one exon exonseems seemstoto be be differentially spliced spliced ::

90 Significance p-value p-valueof of the the alternative splicing versus versusthe the average Splice Splice Index Indexvalues. IN IN this thisexample only onlyone one exon exonseems seemstoto be bedifferentially spliced. Filtering conditions are are shown shown over over the the plot plot of of intensity values values versus versusexon exonnumber.

91 Analysis pipe-line Quality control Normalization Filtering Biological Knowledge extraction Statistical analysis Annotation This step is the same for 3 IVT arrays and exon arrays gene level analyses

92 Fold change filtering The intensity change between experimental groups (i.e. control versus treated) are known as: Fold change. Frequently an arbitrary threshold Trtd log 2 = 1 Ctrl is used to define a significant differential expression.

93 Fold change filtering There are no rules to define the correct fold change (fc( fc) threshold for differential expression. fc >1 is an arbitrary threshold. Fc threshold estimation is dependent on the % of fc fluctuations due to experimental reasons. Fc threshold estimation can be better appreciated in time/concentration course experiments. Biologically speaking many small variations all together can be functionally important (i.e. fc fc =0.5 for all chr 21 genes induces the Down syndrome)

94 Statistical analysis Intensity changes between experimental groups (i.e. control versus treated) are known as: Fold change. Ranking genes based on fold change alone implicitly assigns equal variance to every gene. Fold change alone is not sufficient to indicate the significance of the expression changes. Fold change has to be supported by statistical information.

95 Multiple testing errors Performing multiple statistical tests two types of errors can occur: Type I error (False positive) Type II error (False negative) Reduction of type I errors increases the number of type II errors. It is important to identify an approach that reduces false positives with the minimum loss of information (false( negative)

96 Statistical analysis The sensitivity of statistical tests is affected by the number of available replicates. Replicates can be: Technical Biological Biological replicates better summarize the variability of samples belonging to a common group. The minimum number of replicates is an important issue!

97 How much replicates are important? Yang YH e Speed T, 2002

98 Sample size Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. The issue of how many replicates are required in a typical experimental system needs to be addressed. Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e.g. mouse and rat vs. human).

99 Assessing sample sizes in microarray experiments Assessment of sample sizes for microarray data is a tricky exercise. The reason why we are performing such analysis is to have a general feeling on the ability of our experimental data to robustly detect differential expression. The method implemented in onechannelgui is that proposed by Warnes & Liu and implemented in the Bioconductor library ssize.

100 Assessing sample sizes in microarray experiments The key component of Warnes method is the generation of cumulative plot of the proportion of genes achieving a desired power as a function of sample size,, based on simple gene-by by-gene calculations. Its real utility is as a visual tool for helping users to understand the trade off between sample size and statistical power.

101 Assumptions A microarray experiment is set up to compare gene expressions between one treatment group and one control group. Microarray data has been normalized and transformed so that the data for each gene is sufficiently close to a normal distribution that a standard 2-sample pooled-variance t-test will reliably detect differentially expressed genes.

102 The tested hypothesis for each gene is: versus where μt and μc are means of gene expressions for treatment and control group respectively. The analysis is done using the common variance described in: Wei et al. BMC Genomics. 2004, 5:87

103 Sample size estimation The required sample size of an experiment depends on: variance component (σ), the desired detectable fold change (δ), the power to detect this change (1-β,, the likelihood of detecting the change or the true positive rate), a chosen type I error rate (α=( false positive). IMPORTANT: This implementation of ssize functions uses BH type I error correction instead of Bonferroni, which is the default in ssize functions. β= type II error rate, i.e. false negative.

104 This is not log2(fc)

105 To detect 95% of the differentially expressed genes, characterized by a power of 0.8, a sample size, FOR EACH GROUP, greater than 20 is needed.

106 To detect 97% of the differentially expressed genes, characterized by a power of 0.8, a fold change greater than 10 (log 2 (10)=3.32) is needed.

107 Assessing sample sizes in microarray experiments The R package, sizepower, is used to calculate sample size and power in the planning stage of a microarray study. It helps the user to determine how many samples are needed to achieve a specified power for a test of whether a gene is differentially expressed or, in reverse, to determine the power of a given sample size.

108

109

110 Comments about experimental design If the biological material is not a limiting factor THINK WIDE : Experiment should be designed with many replicas (>3) Time course experiments should be designed with many time points (>4). Investigate part of the experiment by microarrays and use the rest for further validations.

111 Statistical validation Statistical validation can be performed using parametric and non-parametric tests. Parametric tests: The populations under analysis are normally distributed. Non parametric tests: There is no assumption on samples distribution. Non parametric are less sensitive than parametric.

112 Selecting differentially expressed genes Statistical validation method I Statistical validation method II Differential expression linked to a specific biological event. Statistical validation method III

113 Selecting differentially expressed genes Each method grasps some true signals but not all. Each method catches some false signals. The trick is is to find the best condition to maximize true signals while minimizing fakes.

114 SAM Significance Analysis of Microarray

115 A SAM analysis can be performed in Bioconductor using the siggenes library. Two class or multi class analysis is selected automatically due to the structure of Target information B C The delta table prompts to the user the information related to the amount of differentially expressed genes given a certain FDR.

116 The user selects a delta value and check the behaviour of the differentially expressed genes.

117 The user selects a delta value and check the behaviour of the differentially expressed genes.

118 Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes.

119 Subsequently the user performs a log2(fold change) filter to produce a table of differentially expressed genes.

120 The table can be saved in a tab delimited file

121 relative difference in gene expression Raw p-value Fold change Standard deviation Significance measurement derived from raw p-value

122 Limma Linear model analysis of microarrays

123 BH correction BH is the most used method for the correction of type I errors in microarray analysis. However, it has some limitation due to the initial hypotheses: The gene expressions are independent from each other. The raw distribution of p values should be uniform in the non significant range.

124

125 The application of of BH correction to to these pvalues will not produce any differential expressed gene!

data will will be be reorganized on on the the basis basis of of the the number of

126 Let s identify differentially expressed probe sets by by linear modelling To To use use linear linear models models targets targets description and and raw raw data data will will be be reorganized on on the the basis basis of of the the number of of factors factors under under analysis by by Compute Linear Linear Model Model Fit. Fit.

127 Next Next step step is is the the definition of of the the contrasts, which which represent the the differential expression couples to to be be considered. If If more more than than two two conditions are are available more more contrasts can can be be evaluated

Contrast parameterization is is saved saved with with a specific name name REMEMBER: contrasts represent the the different experimental groups groups (e.g. (e.g. Treated, Control).

128 Contrast parameterization is is saved saved with with a specific name name REMEMBER: contrasts represent the the different experimental groups groups (e.g. (e.g. Treated, Control). Making Making Treated Control Control means means that that the the log(expression) of of control control samples are are subtracted to to that that of of treated treated samples. The The result result is is the the log2(fold change)

129 A Before Before evaluating differential expression raw raw p-value p-value distribution is is checked. B C

130

131 A If If BH BH correction can can be be applied applied to to correct correct type type I I errors, errors, we we can can move move to to the the selection of of the the subset subset of of differentially expressed genes genes C B

132 A B

133 These results can can be be saved in in a new new toptable containing only only the the probe sets sets shown in in red red on on plots Yes

134 TopTable structure AffyID AffyID Gene Gene Symbol Gene Gene Description Log2 Log2 FC FC Average intensity T statistics P-values Log-odd statistics

A B Differential expressions probe probe set set lists lists generated by by

Diagrams. A max max of of three three files files can can be be compared.

column column of of probe probe sets sets ID ID without without header.

135 A B Differential expressions probe probe set set lists lists generated by by affylmgui or or SAM SAM can can be be compared using using Venn Venn Diagrams. A max max of of three three files files can can be be compared. Attention: C Each Each file file is is made made by by a unique unique column column of of probe probe sets sets ID ID without without header. Comparison can can be be performed at at probe probe sets sets or or EG EG level. level. D E F G

136 Yes The The various various list list subsets will will be be saved saved in in your your working directory

137 Making a Template A for Ingenuity Pathways Analysis

138 A If If BH BH correction can can be be applied applied to to correct correct type type I I errors, errors, we we can can move move to to the the selection of of the the subset subset of of differentially expressed genes genes C B

139 A B

140 A To create a template A you can use a function implemented in the affylmgui. B C D

141

142 The P value for subsetting is used to discriminate between the differentially expressed with respect to the other probe sets that are used for Ingenuity functional classes enrichment

143 Time Course experiments masigpro is a R package for the analysis of single and multiseries time course microarray experiments. masigpro follows a two steps regression strategy to find genes with significant temporal expression changes significant differences between experimental groups.

144 Time course experimental design: We denote experimental groups as the experimental factor (dummy variables) for which temporal profiles are defined (e.g. Treatment A, A Tissue1, etc) Conditions are each experimental group vs. time combination (e.g. Treatment A at Time 0 ). 0 Conditions can have or not replicates. Variables are the regression variables defined by the masigpro approach for the experiment regression model. masigpro defines dummy variables to model differences between experimental groups. Dummy variables,, Time and their interactions are the variables of the regression model.

145 Time Course design for masigpro IMPORTANT: each treatment at each time has its corresponding untreated control! All these information should be collapsed in the Target column of the targets file using _ to combine data. This can be done using the function JOIN in excel.

146 Time Course design for masigpro

Time Course design for masigpro The targets file for masigpro has a peculiar structure: Each row of the column named Target describes the array on the basis of the experimental design.

147 Time Course design for masigpro The targets file for masigpro has a peculiar structure: Each row of the column named Target describes the array on the basis of the experimental design. Each element describing the time course experiment is separated from the others by an underscore. The first three elements of the row are fixed and represent Time, Replicate, Control, all the other elements refer to various experimental conditions. In this case we have a 8, h time course, in triplicates with two different treatments: cond1 and cond2

148 The Target column is reformatted to be used by masigpro using the command

149 Large data set onechannelgui interface has some limits (RAM memory) in loading/handling large set of.cel files. This is expecially true for a large time course experiment like our example. To overcome this problem probe set average expression intensities are calculated by Expression Console.

150

151

152 Loading tab delimited file the Bioconductor annotation library is not automatically defined. Annotation Library information can be attached using:

153 Do not forget! Multiple test problem is also present in msigpro analysis. Therefore, before running masigpro, remember to perform some filter based on functional information or samples distribution.

154

155 Ones the experiment design for masigpro is ready it is possible to run the analysis Yes When masigpro is running, check what is going on in the main R window!

156 Some parameters need to be set Q: The first step is to compute a regression fit for each gene. The p-value associated to the F-Statistic of the model are computed and they are subsequently used to select significant genes. masigpro corrects this p-value for multiple comparisons by applying false discovery rate (FDR) procedures. The level of FDR control is given by the function parameter Q.

157 Some parameters need to be set Alpha: masigpro applies a variable selection procedure to find significant variables for each gene. This will ultimatelly be used to find which are the profile differences between experimental groups. At each regression step the p-value of each variable is computed and variables get in/out the model when this p-value is lower or higher than the given cut-off value alfa.

158 Some parameters need to be set R-squared: The following step is to generate lists of significant genes according to the way we want to see results. As filtering masigpro uses the R-squared of the regression model.

159 Computation info are available in the main R window Step 1 The procedure first adjusts this global model by the least-squared technique to identify differentially expressed genes and selects significant genes applying false discovery rate control procedures. Step 2 Secondly, stepwise regression is applied as a variable selection strategy to study differences between experimental groups and to find statistically significant different profiles.

160 When the computation is finished a message pops up The coefficients obtained in the second regression model will be useful to cluster together significant genes with similar expression patterns and to visualize the results.

161 Results can be visualized as Venn diagrams or plotting in a PDF file the curves. The K mean clustering is not yet implemented

162 Results can be visualized plotting in a PDF file the curves. A B C The plots are related only to the sub set of genes specific of each treatment condition. D

163

164 Analysis pipe-line Quality control Normalization Filtering Biological Knowledge extraction Statistical analysis Annotation

165 Gene Ontology

166 Ontologies An ontology is a specification of a conceptualization: a hierarchical mapping of concepts within a given frame of reference. An ontology is a restricted structured vocabulary of terms that represent domain knowledge. An ontology specifies a vocabulary that can be used to exchange queries and assertions. A commitment to the use of the ontology is an agreement to use the shared vocabulary in a consistent way.

167 The Gene Ontology The goal of the Gene Ontology (GO) Consortium is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address the problem of defining common set of terms and descriptions for basic biological functions. GO provides a restricted vocabulary as well as clear indications of the relationships between terms.

168 The Gene Ontology The Gene Ontology (GO) consortium produces three independent ontologies for gene products. The three ontologies are: molecular function of a gene product which is defined to be biochemical activity or action of the gene product (MF 7220). biological process interpreted as a biological objective to which the gene product contributes (BP 9529). cellular component is a component of a cell that is part of some larger object or structure (CC 1536).

169 The Graph Structure of GO The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents. GO node is interchangeable with GO term. Child terms are more specific than their parents: The term transmembrane receptor proteintyrosine kinase is child of transmembrane receptor and protein tyrosine kinase.

170 The Graph Structure of GO The relationship between a child and a parent can be characterized by the relations: is a has a (part of) mitotic chromosome is a child of chromosome and the relationship is an is a relation. telomere is a child of chromosome with the has a relation.

171 GO structure Top node Graph Graph of of GO GO relationships for for the the term: term: transcription factor factor (GO: )

172 Induced GO graph for a set of diff exprs genes. Top node The induced GO graph colored according to unadjusted hypergeometric p-value 0.01 GO can be used to link differentially expressed genes to specific functional classes.

173 Hypergeometric Distribution a c a+c b d b+d a+b c+d The probability of any particular matrix occurring by random selection, given no association between the two variables, is given by the hypergeometric rule. ( a + c)! ( b + d)! a! c! b! d! n! ( a + b)!( c + d)! = ( a + b)!( c + d)!( a + c)!( b n! a! b! c! d! + d)!

174 Assigning Significance to the Findings The HyperGeometric Test permits us us to to determine if if there are non-random associations between the two variables, differential expression membership and membership to to a particular Gene Ontology term. in Subset out 8 2 in GO term p.0002 out 4 26 ( 2x2 contingency matrix )

175 GOstats package To perform an analysis using the Hypergeometric-based test, one needs to define a gene universe and a list of selected genes from the universe. To identify the set of expressed genes from a microarray experiment,, R. Gentleman (GOstats( developer) proposed that a non-specific filter be applied and that the genes that pass the filter be used to form the universe for any subsequent functional analyses.

B A In In Bioconductor is is available a library library called called GOstat GOstatwhich allows allows the the calculation of of enriched GO GO classes within within a set set of of differentially

176 B A In In Bioconductor is is available a library library called called GOstat GOstatwhich allows allows the the calculation of of enriched GO GO classes within within a set set of of differentially expressed probe probe sets. sets. Select Select the the threshold of of significance and and the the GO GO class class of of interest. C D Select Select the the list list of of affyids affyidsrepresenting the the differentially expressed probe probe sets. sets. REMEMBER: the the file file should should contain contain only only the the affy affyids!!!!

it it as as pdf pdf and and visualize it it with with

177 If If the the names names of of GO GO classes are are too too tiny tiny in in the the plot plot,, save save it it as as pdf pdf and and visualize it it with with Acrobat Reader, zooming in in the the figure. figure.

178 The reason of this representation is the selection of the GO terms that contains smaller subsets.

179 GO GO identifier significance N. N. of of genes genes in in the the differentially expressed set set N. N. of of genes genes belonging to to the the GO GO terms terms in in the the universe Description of of GO GO term term

180 To To know know more more on on the the parents of of a specific GO GO term term you you can can use use the the plotgo plotgofunction

181 A It It is is possible to to identify identify the the affy affyids ids associated to to a specific GO GO term. term. B C D

182

183 Classification

184 Classification The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature. The task is to classify and predict the diagnostic category of a sample on the basis of its gene expression profile.

185 Large Large data data set set can can be be loaded loaded as as tab tab delimited files files To To load load them them you you need need 1) 1) a tab tab delimited file file with with array array names names on on the the first first row row and and probe probe set set ids ids on on first first column column 2) 2) A target target file file containing the the clinical clinical information. The The usual usual Target Target column column o the the target target file file should should have have this this characterstics.

186 This This file file can can be be generated joining joining the the columns on on the the clinical clinical parameters by by an an underscore _. _. Join function in excel

187

188

189 Riorganize clinical information Load a large data set as tab delimited file. Save in a file the description of the clinical parameters collapsed in the Target column of the targets file.

190 Riorganize clinical information

191 run PAMR analysis

192

193

194 If the selected probe sets are less than 50

195

196 Yes

197 Nice separation between ER positive and negative samples can be achieved also on the test set

Basic aspects of Microarray Data Analysis

Hospital Universitari Vall d Hebron Institut de Recerca - VHIR Institut d Investigació Sanitària de l Instituto de Salud Carlos III (ISCIII) Basic aspects of Microarray Data Analysis Expression Data Analysis