Basic aspects of Microarray Data Analysis

Similar documents
Analysis pipe-line. Analysis pipe

Gene Expression Data Analysis

Analysis of Microarray Data

Analysis of Microarray Data

Gene expression analysis: Introduction to microarrays

Agilent GeneSpring GX 10: Beyond. Pam Tangvoranuntakul Product Manager, GeneSpring October 1, 2008

Introduction to gene expression microarray data analysis

Array Quality Metrics. Audrey Kauffmann

Generating quality metrics reports for microarray data sets. Audrey Kauffmann

Bioinformatics for Biologists

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

Measuring and Understanding Gene Expression

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Analysis of Microarray Data

RNA Degradation and NUSE Plots. Austin Bowles STAT 5570/6570 April 22, 2011

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Analysis of Microarray Data

advanced analysis of gene expression microarray data aidong zhang World Scientific State University of New York at Buffalo, USA

Bioinformatics for Biologists

Computing with large data sets

Microarray Informatics

GS Analysis of Microarray Data

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

Microarray Informatics

The essentials of microarray data analysis

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

SAS Microarray Solution for the Analysis of Microarray Data. Susanne Schwenke, Schering AG Dr. Richardus Vonk, Schering AG

Measuring gene expression

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

From CEL files to lists of interesting genes. Rafael A. Irizarry Department of Biostatistics Johns Hopkins University

Standard Data Analysis Report Agilent Gene Expression Service

Integrative Genomics 1a. Introduction

Seven Keys to Successful Microarray Data Analysis

10.1 The Central Dogma of Biology and gene expression

Microarray Data Analysis in GeneSpring GX 11. Month ##, 200X

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

GS Analysis of Microarray Data

Deakin Research Online

Measuring gene expression (Microarrays) Ulf Leser

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Annotation. (Chapter 8)

ChIP-seq and RNA-seq. Farhat Habib

Exercise on Microarray data analysis

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

DNA Microarray Data Oligonucleotide Arrays

Release Notes. JMP Genomics. Version 3.1

Comparative Analysis using the Illumina DASL assay with FFPE tissue. Wendell Jones, PhD Vice President, Statistics and Bioinformatics

Gene expression: Microarray data analysis. Copyright notice. Outline: microarray data analysis. Schedule

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

Gene List Enrichment Analysis

Bioinformatics : Gene Expression Data Analysis

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Analysis of microarray data

Annotation and Function of Switch-like Genes in Health and Disease. A Thesis. Submitted to the Faculty. Drexel University. Adam M.

The first and only fully-integrated microarray instrument for hands-free array processing

A Parallel Approach to Microarray Preprocessing and Analysis

GS Analysis of Microarray Data

Multivariate Methods to detecting co-related trends in data

DETERMINING SIGNIFICANT FOLD DIFFERENCES IN GENE EXPRESSION ANALYSIS

Technical Note. GeneChip 3 IVT PLUS Reagent Kit vs. GeneChip 3 IVT Express Reagent Kit Comparison. Introduction:

From hybridization theory to microarray data analysis: performance evaluation

A WEB-BASED TOOL FOR GENOMIC FUNCTIONAL ANNOTATION, STATISTICAL ANALYSIS AND DATA MINING

Final exam: Introduction to Bioinformatics and Genomics DUE: Friday June 29 th at 4:00 pm

Computational Biology I

Introduction to Bioinformatics and Gene Expression Technology

Genomic data visualisation

Expression summarization

ChIP-seq and RNA-seq

Basic GO Usage. R. Gentleman. October 13, 2014

Next-Generation Sequencing Gene Expression Analysis Using Agilent GeneSpring GX

Upstream/Downstream Relation Detection of Signaling Molecules using Microarray Data

Expression data analysis with Chipster. Eija Korpelainen, Massimiliano Gentile

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

CodeLink Human Whole Genome Bioarray

Microarray analysis challenges.

Microarray Technique. Some background. M. Nath

Package TIN. March 19, 2019

A Genetic Algorithm Approach to DNA Microarrays Analysis of Pancreatic Cancer

Mixture modeling for genome-wide localization of transcription factors

2007/04/21.

Exercise1 ArrayExpress Archive - High-throughput sequencing example

Introduction to Bioinformatics

ALLEN Human Brain Atlas

Introduction to microarrays. Overview The analysis process Limitations Extensions (NGS)

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Gene Expression Data Analysis (I)

Computational Approaches to Analysis of DNA Microarray Data

Rafael A Irizarry, Department of Biostatistics JHU

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

RNA-Seq Analysis. Simon Andrews, Laura v

Transcription:

Hospital Universitari Vall d Hebron Institut de Recerca - VHIR Institut d Investigació Sanitària de l Instituto de Salud Carlos III (ISCIII) Basic aspects of Microarray Data Analysis Expression Data Analysis Course Ricardo Gonzalo Sanz ricardo.gonzalo@vhir.org 13/11/13

1 Introduction. 2 Software Installation. OneChannel GUI. 3 Quality control. 4 Normalization. 5 Filtering. 6 Statistical inference of diferential expression. 7 Clustering. 8 Annotation. 9 Biological interpretation. Extracted from Rafaele Callogero course slides

1 Introduction. Before beginning the analysis Any analysis of microarray data is useless if: there is not a clear biological question to be investigated biological experiments are not carefully designed to minimize error sources: human intervention, reagent lots, EXPERIMENTAL DESIGN equipments. etc.

1 Introduction. Experimental Design: Experiment should be designed with many replicas (>3) Time course experiments should be designed with many points (>4). Investigate part of the experiment by microarrays and use the rest for further validations. Discuss the experiment with the statistician/bioinformatician involved in data analysis

1 Introduction. Experimental Design: Experiments involving various samples and conditions need to be carefully designed to avoid unwanted effects. C 2C2C2 T 1T1T1 C 1C1C1 T 2T2 C 1C1C1 T 2T2 C 2C2 T 2 T 1T1 T 2 C 2 Day 1 Day 2 T 1 Day 1 Day 2

1 Introduction. To pool or not to pool? The basic assumption underlying sample pooling is biological averaging: the expression from a pooled sample averages out the expression from the individual contributing samples. Bioinformatics 2004, 20:3318

1 Introduction. To pool or not to pool? It is impossibile to associate the gene expression from the pooled sample with the individual phenotypic information: Making unfeasible certain statistical inference or predictions for individuals. Conclusions: Researcher has to be cautious about designing a pooled experiment. Pooling of samples is recommended when there is not enough RNA from each individual sample to run an array.

1 Introduction. Biological question Experimental design FAILED Microarray experiment QC Image analysis PASS Normalization Estimation Testing Analysis Clustering Discrimination Biological verification and interpretation

2 Software Installation. OneChannel GUI. onechannelgui This is a graphical interface to Bioconductor libraries devoted to the analysis of data derived from single channel platforms. Able to analyze 3 IVT, Exon, Gene arrays Also able to analyze RNAseq.

2 Software Installation. OneChannel GUI. Open R software In the usb stick you will find a folder called onechannelgui. Copy to a known location of your computer. Select the script that is inside the onechannelgui folder previously copied Positionate the cursor in the first line and press Control+R line by line. And wait it will take a long.

Biological question Experimental design FAILED Microarray experiment QC Image analysis PASS Normalization Estimation Testing Analysis Clustering Discrimination Biological verification and interpretation

3 Quality control. Was the experiment a success??? Microarray experiments generate huge quantities of data It is hard to decide if things seem to be all right just by looking at the numbers. Standard statistical approach use plots to check the quality show all data together highlight structures may help to detect problems ( unusual patterns )

3 Quality control. Diagnostics plots for microarrays: Microarray data usually considered at two levels 1. Low level: Data directly coming from the scanner 2. High level: processed from low-level data. Expression values, normalized or not. Adjusted PLM model. Some plots specific for some type of arrays or for some level. Any previous classification may be misleading

3 Quality control. Diagnostics plots for microarrays: Low level: Layout image Degradation plots (only in 3 IVT) Histogram/Density plots PCA, Boxplot High level: MA plots Model based plots (NUSE, RLE,...) PCA, Boxplot

3 Quality control. Diagnostics plots for microarrays. Low level. Layout image.

3 Quality control. Diagnostics plots for microarrays. Low level. RNA degradation plot.

3 Quality control. Diagnostics plots for microarrays. Low level. Histogram/density plot.

3 Quality control. Diagnostics plots for microarrays. Low level. Boxplot.

3 Quality control. Diagnostics plots for microarrays. Low level. PCA. Principal component analysis involves a mathematical procedure that transforms a number of correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible. Each succeeding component accounts for as much of the remaining variability as possible. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.

3 Quality control. Diagnostics plots for microarrays. Low level. PCA.

3 Quality control. Diagnostics plots for microarrays. Low level. PCA.

3 Quality control. Diagnostics plots for microarrays. High level. RLE (Relative Log Expression) RLE values are computed for each probe set by comparing the expression value on each array against the median expression value for that probeset across all arrays. Assuming that most genes are not changing in expression across arrays means ideally most of these RLE values will be near 0. Boxplots of these values, for each array, provides a quality assessment tool.

3 Quality control. Diagnostics plots for microarrays. High level. RLE.

3 Quality control. Diagnostics plots for microarrays. High level. NUSE (Normalized Unscaled Standard Error). Normalized Unscaled Standard Errors (NUSE) can also be used for assessing quality. The standard error estimates obtained for each gene on each array from fitplm are taken and standardized across arrays so that the median standard error for that genes is 1 across all arrays. This process accounts for differences in variability between genes. An array were there are elevated SE relative to the other arrays is typically of lower quality. Boxplots of these values, separated by array can be used to compare arrays.

3 Quality control. Diagnostics plots for microarrays. High level. NUSE.

3 Quality control. Diagnostics plots for microarrays. High level. MA plots. MA plots allow pair wise comparison of log-intensity of each array to a reference array and identification of intensity-dependent biases. The Y axis of the plot contains the log-ratio intensity of one array to the reference median array, which is called 'M' while the X axis contains the average log-intensity of both arrays - called 'A'. The normalization is expected to correct for intensity-dependent biases: these graphs plotted before and after normalization allow checking the efficiency of this correction. The probe levels are not likely to differ a lot so we expect a MA plot centered on the Y=0 axis from low to high intensities.

3 Quality control. Diagnostics plots for microarrays. High level. MA plots.

Biological question Experimental design FAILED Microarray experiment QC Image analysis PASS Normalization Estimation Testing Analysis Clustering Discrimination Biological verification and interpretation

4 Normalization. Why normalization? To remove systematic biases sample preparation Variability in hybridization Scanner settings Experimenter bias To achieve a measured scale such that Why not normalization? has the same origin for all spots Use the same unit for all arrays Linear relationship with RNA To cure poor data

4 Normalization. General Steps: Background correction (correcting the scale origin for spots) Normalization (standardizing the scale unit - rescaling) Probe level intensity calculation Summary of information of several spots into a single measure for each gene.

4 Normalization. Exists different methods: RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004). The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in probe behavior and handling error at the appropriately at low and high signal values.

4 Normalization.

5 Filtering. In a microarray experiment only a few hundreds/thousand of genes change their expression due to the different conditions. Genes that do not change introduce noise, therefore is better not to be present when the statistical analysis is done. Researcher is interested in keeping the number of tests/genes as low as possible while keeping the interesting genes in the selected subset. If the truly differentially expressed genes are overrepresented among those selected in the filtering step, the FDR associated with a certain threshold of the test statistic will be lowered due to the filtering.

5 Filtering. Exists different types of filtering: Annotation features (specific): Specific gene features (i.e. GO term, presence of transcriptional regulative elements in promoters, etc.) Signal features (non specific): % intensities greater of a user defined value Interquantile range (IQR) greater of a defined value

5 Filtering. Annotation filtering In transcriptional studies focusing on genes characterized by specific feature (i.e. transcription factor elements in promoters) the best filtering approach is selecting only those genes linked to the peculiar feature. For example: Identification of genes modulated by estradiol:er or IGF1 by direct binding to Estrogen-Responsive Elements (ERE): HGU133plus2: 54675 probe sets 19951 Entrez Genes HGU133plus2 with ERE in putative promoter regions: 6764 probe sets 3058 Entrez Genes

5 Filtering. Anotation filtering. How? Data derived from specifically devoted annotation data set can be used for functional filtering. The Ingenuity Pathways Knowledge Base is the world's largest curated database of biological networks created from millions of individually modeled relationships The Ingenuity Pathways Analysis software (IPA) identifies relations between genes.

5 Filtering. Signal filtering. This technique has as its premise the removal of genes that are deemed to be not expressed or unchanged according to some specific criterion that is under the control of the user. The aim of non specific filtering is to remove genes that, e. g. due to their low overall intensity or variability, are unlikely to carry information about the phenotypes under investigation.

5 Filtering. Signal filtering. 22300 42/42 SpikeIn Enrichment: 100% 5553 42/42 SpikeIn Enrichment: 401%

Biological question Experimental design FAILED Microarray experiment QC Image analysis PASS Normalization Estimation Testing Analysis Clustering Discrimination Biological verification and interpretation

6 Statistical inference of diferential expression. Class comparison problem: Identify genes whose expression is significantly associated with different conditions Treatment, cell type, (qualitative variables) Dose, time, (quantitative variables) Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(x)-log(y) [=log(x/y)]

6 Statistical inference of diferential expression. But.what is a significal change? Depends on the variability within groups, which may be different from gene to gene. Fold change it is not sufficient to indicate significance of the expression changes. Has to be supported by statistical information. To assess the statistical significance of differences, conduct a statistical test for each gene.

6 Statistical inference of diferential expression. Which situations can we found? Indirect comparisons: 2 groups, 2 samples, unpaired E.g. 10 individuals: 5 suffer diabetes, 5 healthy One sample fro each individual Typically: Two sample t-test or similar Direct comparisons: Two groups, two samples, paired E.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and one from affected (region 2). Typically: One sample t-test (also called paired t-test) or similar based on the individual differences between conditions.

6 Statistical inference of diferential expression. Some issues in gene selection Gene expression values have peculiarities that have to be dealt with. Some related with small sample sizes Variance unstability (very low variances produces a high t statistic value) Non-normality of the data Other related to big number of variables Multiple testing Standard t test is not strictly correct to used here, it is better to use a moderated t-test

6 Statistical inference of diferential expression. To know if a gene is differentially expressed, we need to assign to each contrast a p-value: Genes with p-values falling below a prescribed level may be regarded as significant But what happens when you repeat the same test thousand of times.? Consider more than one test at once: Two tests each at 5% level. Now probability of getting a false positive is: 1 0.95*0.95 = 0.0975 Three tests : 1 0.953 =0.1426 n tests : 1 0.95 n Converge towards 1 as n increases MULTIPLE TESTING PROBLEM

6 Statistical inference of diferential expression. MULTIPLE TESTING PROBLEM It is needed to control the type I error (False positives) FDR :Controls the proportion of false positives if you can tolerate more false positives you will get fewer false negatives No information lost

6 Statistical inference of diferential expression. After statistics is performed a nice (or not) Top Table is obtained: Gene Description Average intensity P-values AffyID Gene Symbol Log2 FC T statistics Log-odd statistics

6 Statistical inference of diferential expression. Visualization of the statistical inference: Venn diagrams and Volcano plots

7 Clustering. Types: Supervised clustering try to find the best partition for data that belong to a know set of classes Unsupervised clustering try to define the number and the size of the classes in which the transcription profiles can be fitted in.

7 Clustering. Distances: The ability to calculate a distance (or similarity, it s inverse) between two expression vectors is fundamental to clustering algorithms. Distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression. Different types of distances: Euclidean distance, Manhattan distance, Mahalanobis distance. This can originate different sample grouping

7 Clustering. Hierarchical Clustering (HCL) HCL is an agglomerative/divisive clustering method. The iterative process continues until all groups are connected in a hierarchical tree. Samples more similar between them are closed.

7 Clustering. Heatmaps They allow the quick visualization of the possible expression patterns that could exists among samples.

8 Annotation. Relation between probes sets and genes: An important issue in microarray data analysis is the specific association of probe identifiers with genome annotated transcripts. A critical point in annotation is the way in which the association between probes and genes is produced. In Affymetrix arrays usually NetAffx (from Affymetrix web page) is used.

9 Biological interpretation. The goal of the Gene Ontology (GO) Consortium is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. http://www.geneontology.org/ For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address the problem of defining common set of terms and descriptions for basic biological functions. GO provides a restricted vocabulary as well as clear indications of the relationships between terms.

9 Biological interpretation. GENE ONTOLOGY The Gene Ontology (GO) consortium produces three independent ontologies for gene products. The three ontologies are: molecular function of a gene product which is defined to be biochemical activity or action of the gene product (MF 7220). biological process interpreted as a biological objective to which the gene product contributes (BP 9529). cellular component is a component of a cell that is part of some larger object or structure (CC 1536).

9 Biological interpretation. The Graph Structure of GO The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents. GO node is interchangeable with GO term. Child terms are more specific than their parents: The term transmembrane receptor proteintyrosine kinase is child of transmembrane receptor and protein tyrosine kinase.

9 Biological interpretation. GO structure Graph of GO relationships for the term: transcription factor (GO:0003700)