Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Similar documents
Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Introduction to gene expression microarray data analysis

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

DNA Microarray Data Oligonucleotide Arrays

Analysis of Microarray Data

Bioinformatics III Structural Bioinformatics and Genome Analysis. PART II: Genome Analysis. Chapter 7. DNA Microarrays

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Exploration and Analysis of DNA Microarray Data

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

6. GENE EXPRESSION ANALYSIS MICROARRAYS

ADVANCED STATISTICAL METHODS FOR GENE EXPRESSION DATA

Introduction to Bioinformatics. Fabian Hoti 6.10.

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays. Materials are from

Technical Review. Real time PCR

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

DNA Arrays Affymetrix GeneChip System

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

affy: Built-in Processing Methods

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY

Comparison of Affymetrix GeneChip Expression Measures

Gene Expression Technology

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Soybean Microarrays. An Introduction. By Steve Clough. November Common Microarray platforms

What does PLIER really do?

Microarray Technique. Some background. M. Nath

SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

Gene Signal Estimates from Exon Arrays

Please purchase PDFcamp Printer on to remove this watermark. DNA microarray

SPH 247 Statistical Analysis of Laboratory Data

Gene Expression Data Analysis (I)

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Detection and Restoration of Hybridization Problems in Affymetrix GeneChip Data by Parametric Scanning

Comparison of Normalization Methods in Microarray Analysis

RNA Degradation and NUSE Plots. Austin Bowles STAT 5570/6570 April 22, 2011

Lecture #1. Introduction to microarray technology

Introduction to Bioinformatics and Gene Expression Technology

Analysis of Microarray Data

Agilent RNA Spike-In Kit

Probe-Level Data Analysis of Affymetrix GeneChip Expression Data using Open-source Software Ben Bolstad

New Stringent Two-Color Gene Expression Workflow Enables More Accurate and Reproducible Microarray Data

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Quantitative Real time PCR. Only for teaching purposes - not for reproduction or sale

Microarrays: since we use probes we obviously must know the sequences we are looking at!

Guidelines for setting up microrna profiling experiments v2.0

QUANTITATIVE RT-PCR PROTOCOL (SYBR Green I) (Last Revised: April, 2007)

3.1.4 DNA Microarray Technology

Quality Measures for CytoChip Microarrays

Canadian Bioinforma2cs Workshops

EECS730: Introduction to Bioinformatics

Parameter Estimation for the Exponential-Normal Convolution Model

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer

STATISTICAL TECHNIQUES. Data Analysis and Modelling

Quantitative Real Time PCR USING SYBR GREEN

Introduction to Bioinformatics and Gene Expression Technologies

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

Applications and Uses. (adapted from Roche RealTime PCR Application Manual)

What you still might want to know about microarrays. Brixen 2011 Wolfgang Huber EMBL

Description of Logit-t: Detecting Differentially Expressed Genes Using Probe-Level Data

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

Roche Molecular Biochemicals Technical Note No. LC 10/2000

A learned comparative expression measure for Affymetrix GeneChip DNA microarrays

Technical Note. Performance Review of the GeneChip AutoLoader for the Affymetrix GeneChip Scanner Introduction

DNA Microarrays Introduction Part 2. Todd Lowe BME/BIO 210 April 11, 2007

Agilent s Mx3000P and Mx3005P

Quality Control Assessment in Genotyping Console

Bioinformatics and Genomics: A New SP Frontier?

BABELOMICS: Microarray Data Analysis

Designing Complex Omics Experiments

DNA Microarray Technology

Real-Time PCR Workshop Gene Expression. Applications Absolute and Relative Quantitation

Oligonucleotide microarray data are not normally distributed

Quantitation of mrna Using Real-Time Reverse Transcription PCR (RT-PCR)

User Bulletin. ABI PRISM 7000 Sequence Detection System DRAFT. SUBJECT: Software Upgrade from Version 1.0 to 1.1. Version 1.

DNA Microarray Experiments: Biological and Technological Aspects

Using Low Input of Poly (A) + RNA and Total RNA for Oligonucleotide Microarrays Application

Gene Expression Profiling and Validation Using Agilent SurePrint G3 Gene Expression Arrays

Class Information. Introduction to Genome Biology and Microarray Technology. Biostatistics Rafael A. Irizarry. Lecture 1

University of Groningen

Introduction to ChIP Seq data analyses. Acknowledgement: slides taken from Dr. H

Introduction To Real-Time Quantitative PCR (qpcr)

An overview of image-processing methods for Affymetrix GeneChips

For research use only. Not for use in diagnostic procedures. AFFYMETRIX UK Ltd., AFFYMETRIX, INC.

Introduction to microarrays. Overview The analysis process Limitations Extensions (NGS)

Gasoline Consumption Analysis

Module1TheBasicsofRealTimePCR Monday, March 19, 2007

Expression Array System

Comparison of 3 labeling and amplification protocols: for total RNA labeling, amplification, and Affymetrix GeneChip expression (last updated 9/26/05)

High-Resolution Oligonucleotide- Based acgh Analysis of Single Cells in Under 24 Hours

User Bulletin #2. ABI PRISM 7700 Sequence Detection System. Introduction. Contents. December 11, 1997

Supporting Online Material for

Data Mining For Genomics

Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Transcription:

Background Correction and Normalization Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Feature Level Data

Outline Affymetrix GeneChip arrays Two color platforms

Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch GTACTACCCAGTGTTCCGGAGGCTA Mismatch NSB & SB NSB

Before Hybridization Sample 1 Sample 2 Array 1 Array 2

More Realistic Sample 1 Sample 2 Array 1 Array 2

Non-specific Hybridization Array 1 Array 2

Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch (PM) GTACTACCCAGTGTTCCGGAGGCTA Mismatch (MM) NSB & SB NSB

GeneChip Feature Level Data MM features used to measure optical noise and nonspecific binding directly More than 10,000 probesets Each probeset represented by 11-20 feature Note 1: Position of features are haphazardly distributed about the array. Note 2: There are between 20-100 chip types So we have PM gij, MM gij (g is gene, i is array and j is feature) A default summary is the avg of the PM-MM

Two color platforms Common to have just one feature per gene Typically, longer molecules are used so nonspecific binding not so much of a worry Optical noise still a concern After spots are identified, a measure of local background is obtained from area around spot

Local background ---- GenePix ---- QuantArray ---- ScanAnalyze

Two color feature level data Red and Green foreground and and background obtained from each feature We have Rf gij, Gf gij, Rb gij, Gb gij (g is gene, i is array and j is replicate) A default summary statistic is the log-ratio: (Rf-Rb) / (Gf - Gb)

Affymetrix Spike In Experiment

Spike-in Experiment Throughout we will be using Data from Affymetrix s spike-in experiment Replicate RNA was hybridized to various arrays Some probesets were spiked in at different concentrations across the different arrays This gives us a way to assess precision and accuracy Done for HGU95 and HGU133 chips

Spikein Experiment (HG-U95) Probeset A r r a y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024 256 32 B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0 512 64 C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25 1024 128 D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5 0 256 E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1 0.25 512 F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2 0.5 1024 G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4 1 0 H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8 2 0.25 I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16 4 0.5 J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32 8 1 K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64 16 2 L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128 32 4 M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256 64 8 Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16 T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512 128 16

Spikein Experiment (HG-U133) A similar experiment was repeated for a newer chip The 1024 picomolar concentration was not used. 1/8 was used instead. No groups of 12 Note: More spike-ins to come!

Introduction to Preprocessing

Composition of Feature Intensities Specific binding Non-specific binding and Optical noise Systematic variations Stochastic noise We will focus on Affymetrix GeneChip first

Three main steps Background adjustment of correction Normalization Summarization

Why is it difficult? These are from replicate RNA. Left are empirical density estimates for 5 GeneChip arrays. Right is MA plot for one two-color array

Approaches Stepwise: The most common approach is to solve these issues one by one and produce one measurement of expression (absolute or relative) for each gene Typically the order is: correct for background, normalize, then summarize. Integrated: Statistical models that quantify the need for normalization as well as the amount of expression are used

Background Noise

Distribution of Optical and NSB Noise Note: Material in this lecture are for Affymetrix Arrays

Effect on Data Concentration of 0 pm Concentration of 0.5 pm 100 100 Concentration of 1.0 pm 100

Spike-In Data Specific Binding: Linear with concentration Background Additive Y=B+aC

Effect of Background B + ac 1 B + ac 2 C 1 C 2 log(b + ac 1 ) log(b + ac 2 ) log(c 1 ) log(c 2 ) log(b + ac) log(b) When a or C is small

Why background correct

Different NSB for different features Therobe Effect

Direct Measurement Strategy Affymetrix s tries to measure nonspecific binding directly PM-MM is supposed to give us a good estimate of specific binding But it is not so simple

Sometimes MM larger then PM

Deterministic Model PM = O + N + S MM = O + N PM MM = S For reasons explained later, we want to take log of S

Deterministic model is wrong Do MM measure nonspecific binding? Look at Yeast DNA hybridized to Human Chip Look at PM, MM logscale scatter-plot R 2 is only 0.5

Stochastic Model (Additive background/multiplicative error) PM = O PM + N PM + S, MM = O MM + N MM log (N PM ), log (N MM ) ~ Bivariate Normal (ρ 0.7) S = exp ( s + a + ε ) s is the quantity of interest (log scale expression) E[ PM MM ] = S, but Var[ log( PM MM ) ] ~ 1/S 2 (can be very large) Note: Technically, var[ log(pm-mm) ] is not defined. We need to make sure we avoid taking logs of negative. Affymetrix s current default does this.

Simulation We create some feature level data for two replicate arrays Then compute Y=log(PM-kMM) for each array We make an MA using the Ys for each array We make a observed concentration versue known concentration plot We do this for various values of k. The following movie shows k moving from 0 to 1.

Does this happen in practice? MAS 5.0 RMA

RMA: The Basic Idea Observed: PM Of interest: S PM=B+S Pose a statistical model and use it to predict S from the observed PM

The Basic Idea PM=B+S A mathematically convenient, useful model B ~ Normal (µ,σ) S ~ Exponential (λ) S ˆ = E[ S PM No MM Borrowing strength across probes ]

Normalization

Normalization Normalization is needed to ensure that differences in intensities are indeed due to differential expression, and not some printing, hybridization, or scanning artifact. Normalization is necessary before any analysis which involves within or between slides comparisons of intensities, e.g., clustering, testing. Normalization is different in spotted/two-color and high-density-oligonucleotides (Affy) technologies

Outline Why do we need to normalize Types of normalization Affy cdna Case study Discussion

Technical Replicates Different scanners where used! These is Affymetrix data

Empirical Densities for Replicates Compliments of Ben Bolstad

Self-self hybridization log 2 R vs. log 2 G

Self-self hybridization M vs. A Two-color platform

Self-self hybridization M vs. A Robust local regression within sectors (print-tip-groups) of intensity log-ratio M on average log-intensity A.

Intensity log-ratio, M Boxplots by print-tip-group

What can we do? Throw away the data and start again? Maybe. Statistics offers hope: Use control genes to adjust Assume most genes are not differentially expressed Assume distribution of expression are the same

Simplest Idea Assume all arrays have the same median log expression or relative log expression Subtract median from each array In two-color platforms, we typically correct the Ms. Median correction forces the median log ratio to be 0 Note: We assume there are as many over-expressed as underexpressed genes) For Affymetrix arrays we usually add a constant that takes us back to the original range. It is common to use the median of the medians Typically, we subtract in the log-scale

How does it work Notice subtracting in the original scale wont work well

How does it work The medians match but there are some discrepancies

What are the consequences These are two technical replicates. Only the red are differentially expressed

There appears to be a non-linear dependence on intensity Proposed solutions Force distributions (not just medians) to be the same: Amaratunga and Cabrera (2001) Bolstad et al. (2003) Use curve estimators such as splines to adjust for the effect: Li and Wong (2001) Note: they also use a rank invariant set Colantuoni et al (2002) Dudoit et al (2002) Use adjustments based on additive/multiplicative model: Rocke and Durbin (2003) Huber et al (2002) Cui et al (2003)

Quantile normalization All these non-linear methods perform similarly Quantiles is my favorite because its fast Basic idea: order value in each array take average across probes Substitute probe intensity with average Put in original order

Example of Example of quantile quantile normalization normalization 9 3 3 8 5 3 8 6 4 14 4 5 4 4 2 14 6 5 9 5 4 8 4 3 8 4 3 4 3 2 8 8 8 6 6 6 5 5 5 5 5 5 3 3 3 6 3 5 5 6 5 5 8 6 8 5 8 3 5 3 Original Ordered Averaged Re-ordered

How does it work

How does it work Does it wash away real differential expression?

Before and These are two technical replicates. Only the red are differentially expressed

After quantile normalization We say more later..

Two-color platforms Identify and remove the effects of systematic variation in the measured fluorescence intensities, other than differential expression, for example different labelling efficiencies of the dyes; different amounts of Cy3- and Cy5-labelled mrna; different scanning parameters; print-tip, spatial, or plate effects, etc.

Normalization Much more complicated than Affy The need for normalization can be seen most clearly in self-self hybridizations, where the same mrna sample is labeled with the Cy3 and Cy5 dyes. The imbalance in the red and green intensities is usually not constant across the spots within and between arrays, and can vary according to overall spot intensity, location, plate origin, etc. These factors should be considered in the normalization.

MA-plot by print-tip-group Intensity log ratio, M The smooth curves were created using loess. Splines work just as well Average log intensity, A

Example of Normalization log 2 R/G log 2 R/G L(intensity, sector, ) Constant normalization: L is constant Adaptive normalization: L depends on a number of predictor variables, such as spot intensity A, sector, plate origin. Intensity-dependent normalization. Intensity and sector-dependent normalization. 2D spatial normalization. Other variables: time of printing, plate, etc. Composite normalization. Weighted average of several normalization functions.

2D images of L values Global median normalization Global loess normalization Within-print-tipgroup loess normalization 2D spatial normalization

2D images of normalized M-L Global median normalization Within-print-tipgroup loess normalization Global loess normalization 2D spatial normalization

Boxplots of normalized M-L Global median normalization Global loess normalization Within-print-tipgroup loess normalization 2D spatial normalization

MA-plots of normalized M-L Global median normalization Global loess normalization Within-print-tipgroup loess normalization 2D spatial normalization

Comparison to other methods MSP Rank invariant Housekeeping Tubulin, GAPDH

What if we know most genes are up and down regulated? Then it will be hard to normalize Let us look at an example

Dilution Experiment crna hybridized to human chip (HGU95) in range of proportions and dilutions Dilution series begins at 1.25 µg crna per GeneChip array, and rises through 2.5, 5.0, 7.5, 10.0, to 20.0 µg per array. 5 replicate chips were used at each dilution A few control genes were spiked in at same concentration across all arrays. Notice we can only make usual assumption within groups of 5

Raw Data

If you can t t normalized your conclusions will likely be wrong!

Normalize with control genes Looks almost as bad!

We can not median normalize We wash out real biological effects

Or quantile normalize

But we can quantile normalize and then use controls

Look at the improvement

Look at the improvement

Method Med slope Log SD Std % change Raw 0.39 0.35 27% Controls 0.52 0.25 19% Median -0.01 0.17 12% Quantiles 0.00 0.14 10% Quantiles/ctrls 0.55 0.14 10%

Why the non-linear dependence? The additive background noise plus multiplicative error models predicts these non-linear behavior. The multiplicative part explains the difference in over-all mean log-intensity Different background noise mean levels causes non-linear MA plots

Add a different constant

Add noise with different mean

Conclusions I still haven t seen microarry data that doesn t need to be normalized Non-linear methods are better than linear ones If you can t assume most genes are not differentially expressed or that probe intensities have roughly the same distribution you need many control genes (hundreds maybe thousands) Control genes must cover the dynamic range