Analysis of Microarray Data

Similar documents
Analysis of Microarray Data

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Bioinformatics for Biologists

Introduction to gene expression microarray data analysis

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Expression summarization

Analysis of Microarray Data

Analysis of Microarray Data

Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

Microarray Data Analysis. Normalization

Background and Normalization:

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

2007/04/21.

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

Bioinformatics for Biologists

Preprocessing Methods for Two-Color Microarray Data

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

The essentials of microarray data analysis

Pre processing and quality control of microarray data

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Integrative Genomics 1a. Introduction

Technical Note. GeneChip 3 IVT PLUS Reagent Kit vs. GeneChip 3 IVT Express Reagent Kit Comparison. Introduction:

A Microarray Analysis Teaching Module. for Hamilton College. July 2008 Megan Cole Post-doctoral Associate Whitehead Institute, MIT

Measuring gene expression (Microarrays) Ulf Leser

DNA Microarray Data Oligonucleotide Arrays

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Introduction to Bioinformatics and Gene Expression Technology

Standard Data Analysis Report Agilent Gene Expression Service

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Measuring and Understanding Gene Expression

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

Deakin Research Online

AFFYMETRIX c Technology and Preprocessing Methods

Gene Expression Data Analysis

Exercise on Microarray data analysis

Rafael A Irizarry, Department of Biostatistics JHU

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

3. Microarrays: experimental design, statistical analysis and gene clustering

A Parallel Approach to Microarray Preprocessing and Analysis

PLM Extensions. B. M. Bolstad. October 30, 2013

Exploration and Analysis of DNA Microarray Data

affy: Built-in Processing Methods

A note on oligonucleotide expression values not being normally distributed

Microarray analysis challenges.

From CEL files to lists of interesting genes. Rafael A. Irizarry Department of Biostatistics Johns Hopkins University

Gene expression analysis: Introduction to microarrays

Lecture 2: March 8, 2007

Pre-processing DNA Microarray Data

Microarray Informatics

Bioinformatics for Biologists

From hybridization theory to microarray data analysis: performance evaluation

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

ALLEN Human Brain Atlas

Gene expression analysis. Biosciences 741: Genomics Fall, 2013 Week 5. Gene expression analysis

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Microarray Informatics

Analysis of a Proposed Universal Fingerprint Microarray

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Parameter Estimation for the Exponential-Normal Convolution Model

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

10.1 The Central Dogma of Biology and gene expression

Computational Biology I

Preprocessing of Microarray data I: Normalization and Missing Values. For example, a list of possible sources in spotted arrays:

EECS730: Introduction to Bioinformatics

ADVANCED STATISTICAL METHODS FOR GENE EXPRESSION DATA

Microarray pipeline & Pre-processing

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Measuring gene expression

Gene Signal Estimates from Exon Arrays

Bioinformatics. Microarrays: designing chips, clustering methods. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Pre-processing DNA Microarray Data

RNA Degradation and NUSE Plots. Austin Bowles STAT 5570/6570 April 22, 2011

Oligonucleotide microarray data are not normally distributed

Introduction to Microarray Analysis

Introduction to Genome Wide Association Studies 2015 Sydney Brenner Institute for Molecular Bioscience Shaun Aron

Designing a Complex-Omics Experiments. Xiangqin Cui. Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham

Gene expression: Microarray data analysis. Copyright notice. Outline: microarray data analysis. Schedule

Microarray probe expression measures, data normalization and statistical validation

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

Determining Method of Action in Drug Discovery Using Affymetrix Microarray Data

Lecture #1. Introduction to microarray technology

DNA Chip Technology Benedikt Brors Dept. Intelligent Bioinformatics Systems German Cancer Research Center

The Affymetrix platform for gene expression analysis Affymetrix recommended QA procedures The RMA model for probe intensity data Application of the

A normalization method based on variance and median adjustment for massive mrna polyadenylation data

DETERMINING SIGNIFICANT FOLD DIFFERENCES IN GENE EXPRESSION ANALYSIS

Analyzing DNA Microarray Data Using Bioconductor

Exam 1 from a Past Semester

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Rough Draft Not For Distribution. Quality Control Metrics for Array Comparative Genomic Hybridization Data

Some Principles for the Design and Analysis of Experiments using Gene Expression Arrays and Other High-Throughput Assay Methods

What we ll do today. Types of stem cells. Do engineered ips and ES cells have. What genes are special in stem cells?

Do engineered ips and ES cells have similar molecular signatures?

Transcription:

Analysis of Microarray Data Lecture 1: Experimental Design and Data Normalization George Bell, Ph.D. Senior Bioinformatics Scientist Bioinformatics and Research Computing Whitehead Institute

Outline Introduction to microarrays Experimental design Data normalization Other data transformation Exercises 2

Expression microarrays: Underlying assumption and concepts Measuring relative changes in levels of specific mrnas provide information about what s going on in the cells from which the mrna came. Samples provide info about Genes A gene expression profile is a molecular phenotype of a cell in a specific state 3

Experimental design: Most important question Why are you doing this experiment? (Be as specific as possible.) To learn something interesting about my cells is usually not the best answer. 4

Common partial experimental objectives Comparison: Discovery: Prediction: identify differentially expressed genes identify clusters of genes or samples use a gene expression profile to label a cell sample 5

General experimental issues What is the best source of mrna? Reduce variables as much as possible Avoid confounding by randomizing remaining variables Collect comprehensive information about all potential variables Make no more assumptions than necessary Does a factor influence your measurements? Collect the data and find out with ANOVA. 6

Comparisons Virtually all array analysis depends on a comparison between samples (on 3+ chips) Expression is usually described in relative terms What comparison(s) do you plan to make? Research in progress: How can one measure absolute (molar) expression levels? Spike-in controls? 7

Replication cells RNA *Biological replicates: use different cell cultures prepared in parallel Technical replicates: use one cell culture, first processed and then split just before hybridization Sample replicates: use one cell culture, first split and then processed * most informative 8

How many replicates? Replication is needed to have confidence about your results. To determine the optimal number using statistics, How large an effect do you want to identify? How confident do you want to be of your conclusions? How variable is gene expression in your system? Perform a test of statistical power (such as power.t.test in R) Most common practical answer: More than you ve planned If microarray analysis is followed by further confirmation, a high error rate may be tolerated (and may be more efficient) 9

Designs for 2-color arrays Given two replicates of samples A and B, Reference design A1-R A2-R B1-R B2-R Balanced block design A1-B1 B2-A2 Loop design A1-B1 B1-A2 A2-B2 B2-A1 10

What design to use? Best design depends on objective(s) of experiment What comparisons are most important? Some guidelines: Balanced block is most efficient for 2-way comparison Reference design is often best when making lots of different comparisons Loop design is not very robust 11

Spike-in controls How can you confirm that your experiment and analysis was done correctly? Control mrna added before hybridization (or RNA extraction) can help with quality control Some chip manufacturers recommend a control mix of exogenous mrna External RNA Control Consortium (ERCC): determining optimal control mix to evaluate "reproducibility, sensitivity, and robustness in gene expression analysis" 12

Image analysis Map region of the chip to a probe and convert its pixels into foreground and background intensities for the spot This is a crucial step in the analysis pipeline but will not be covered in this course What instruments and algorithms are recommended by the chip manufacturer? 13

Why normalize data? The experimental goal is to identify biological variation (expression changes between samples) Technical variation can hide the real data Unavoidable systematic bias should be recognized and corrected the process referred to as normalization Normalization is necessary to effectively make comparisons between chips and sometimes within a single chip 14

Normalization assumptions and approaches Some genes exhibit constant mrna levels: Housekeeping genes The level of some mrnas are known: Spike-in controls The total of all mrna remains constant: Global median and mean; Lowess The distribution of expression levels is constant quantile 15

Within-array normalization 2-color arrays may need normalization between red and green channels These methods are similar to between-array methods What should be normalized? Red intensities vs green intensities? Global mean/median Log ratio vs average intensity? Linear regression or loess Within-array may be followed by between-array methods 16

Normalization by global mean (total intensity) Procedure: Multiply/divide all expression values for one color (or chip if one-color) by a factor calculated to produce a constant mean (or total intensity) for every color. Example with 2 one-color arrays with a total intensity target of 50,000: Chip Sample gene expr (raw) Total expr on chip (raw) Norm. factor (tot des / tot obs ) Sample gene expr (norm) A 2.0 100,000 50,000 / 100,000 = 0.5 2.0 x 0.5 = 1.000 B 2.2 125,000 50,000 / 125,000 = 0.4 2.2 x 0.4 = 0.88 Similar scheme can be used with a subset of genes such as with spike-in controls or housekeeping genes 17

Global median normalization Procedure: Transform all expression values to produce a constant median More robust than using the mean 18

Variance normalization Different chips may have the same median or mean but still very different standard deviations If we assume the chips should have common standard deviations, they may be transformed in that manner 19

Quantile normalization Different chips may have the same standard deviation but different distributions If we assume the chips should have common distributions, they may be transformed in that manner 20

Lowess normalization Some 2-color arrays exhibit a systematic intensity-dependent bias As a result, the normalization factor needs to change with spot intensity Lowess (locally weighted scatterplot smoothing) uses local regression to address this Quackenbush, 2002 21

Local normalization Sometimes global within-array normalization may not correct all systematic unwanted variation Examples: print tip differences, degradation in chip regions, thumbprints Local normalization adjusts intensities according to chip geography It s best to avoid technologies that require these excessive transformations 22

Normalization - summary Normalization removes technical variation and improves power of comparisons The assumption(s) you make determine the normalization technique to use Always look at all the data before and after normalization Spike-in controls can help show which method may be best 23

Handling low-level values What is the background intensity of the chip? What expression values are just noise? Filtering / flagging low values Settings floors and ceilings Effects on fold changes and determination of differential expression 24

Affymetrix preprocessing Some oligo chip designs (like Affymetrix) represent each gene ( probeset ) with a set of oligos ( probes ). Affymetrix software (MAS) uses a special algorithm to convert measurements for a set of probes into one probeset value. Other algorithms (RMA, GCRMA, MBEI) have been developed by people who want to improve this calculation. These other algorithms appear to increase precision but decrease dynamic range. 25

Why use logarithms? Produce similar scales for fold changes in both the up and down directions Since log (a*b) = log(a) + log(b) Multiplicative effects are converted to additive effects, which simplifies statistical analysis Produce data with a more normal distribution variability that s not correlated with intensity 26

Summary Why are you doing a microarray experiment? What design will best help address your goal(s)? Normalize based on the biology and technology of the experiment Other transformations: preprocessing, dealing with low level values; logarithms Does your analysis pipeline make sense biologically and statistically? 27

References Dov Stekel. Microarray Bioinformatics. Cambridge, 2003. Churchill, GA. Fundamentals of experimental design for cdna microarrays. Nature Genetics Supp. 32:490-495, 2002. Quackenbush J. Microarray data normalization and transformation. Nature Genetics Supp. 32:496-501, 2002. Smyth GK et al. Statistical issues in cdna microarray data analysis. Methods Mol Biol. 224:111-36, 2003. Affymetrix. Statistical Algorithms Description Document. http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf Irizarry RA et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249-64, 2003. [RMA] Li C and Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2(8), 2001 [MBEI] Wu Z and Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proceedings of RECOMB 04. [GCRMA] 28

Microarray tools Bioconductor (R statistics package) http://www.bioconductor.org/ BaRC analysis tools: http://iona.wi.mit.edu/bio/tools/bioc_tools.html Excel TIGR MultiExperiment Viewer (MeV) Many commercial and open source packages 29

Exercise 1 Excel syntax A2 A2:A100 =B5 =$B$5 =data!b4 =[otherfile.xls]data!b4 Cell reference Series of cells Formula Absolute link ( $ ) Reference other sheet Reference other file 30

Exercise 1: Excel functions MEDIAN SUM AVERAGE TRIMMEAN LOG IF TTEST VLOOKUP 31

R-Project web site 32

Introduction to R # Read a data file dat = read.delim( Data1.txt ) dim(dat) # Get dimension of matrix colnames(dat) # Get names of columns # Print rows 1-5, columns 2-4 dat[1:5, 2:4] # or use column and row names mean(dat[, my.col.1 ]) # get the mean of a column # Combine data by columns all.data = cbind(data1, data2) # Print a tab-delimited text file write.table(all.data, myfile.txt, sep= \t, quote=f) q() # quit [or use pull-down menu] 33

Exercise 1 To do Goal: Discovery of human developmentallyregulated genes Fetal vs adult; liver vs brain; assayed with Affymetrix chips Normalize data - 8 chips (replicates) Normalization by trimmed means k = (expression signal / chip trimmed mean) * 100 Calculate ratios Reduce data (replicates) Use AVERAGE function Ratio of fetal tissue/adult tissue Calculate log 2 of expression values and ratios 34