Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Similar documents
Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

From CEL files to lists of interesting genes. Rafael A. Irizarry Department of Biostatistics Johns Hopkins University

DNA Microarray Data Oligonucleotide Arrays

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Expression summarization

Pre-processing DNA Microarray Data

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Background and Normalization:

Pre-processing DNA Microarray Data

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

Rafael A Irizarry, Department of Biostatistics JHU

Introduction to gene expression microarray data analysis

The essentials of microarray data analysis

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

AFFYMETRIX c Technology and Preprocessing Methods

STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays. Materials are from

The Affymetrix platform for gene expression analysis Affymetrix recommended QA procedures The RMA model for probe intensity data Application of the

affy: Built-in Processing Methods

Biology 644: Bioinformatics

Parameter Estimation for the Exponential-Normal Convolution Model

Bioinformatics III Structural Bioinformatics and Genome Analysis. PART II: Genome Analysis. Chapter 7. DNA Microarrays

2007/04/21.

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

From hybridization theory to microarray data analysis: performance evaluation

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Microarray probe expression measures, data normalization and statistical validation

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

Microarray Data Analysis. Normalization

Analysis of Microarray Data

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

Probe-Level Data Analysis of Affymetrix GeneChip Expression Data using Open-source Software Ben Bolstad

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Identification of spatial biases in Affymetrix oligonucleotide microarrays

Comparison of Normalization Methods in Microarray Analysis

PLM Extensions. B. M. Bolstad. October 30, 2013

Exam 1 from a Past Semester

Measuring gene expression (Microarrays) Ulf Leser

Analysis of Microarray Data

SPH 247 Statistical Analysis of Laboratory Data

GS Analysis of Microarray Data

Ning Tang ALL RIGHTS RESERVED

Preprocessing Methods for Two-Color Microarray Data

What does PLIER really do?

Bioinformatics Advance Access published February 10, A New Summarization Method for Affymetrix Probe Level Data

Description of Logit-t: Detecting Differentially Expressed Genes Using Probe-Level Data

Lecture 2: March 8, 2007

Normalizing Affy microarray data

A Statistical Framework for the Analysis of Microarray Probe-Level Data

Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Mixture modeling for genome-wide localization of transcription factors

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

ADVANCED STATISTICAL METHODS FOR GENE EXPRESSION DATA

Comparison of Affymetrix GeneChip Expression Measures

An Interactive Power Analysis Tool for Microarray Hypothesis Testing and Generation

Microarray Informatics

Analyzing DNA Microarray Data Using Bioconductor

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

A Model Based Background Adjustment for. Oligonucleotide Expression Arrays

An Exploration of Affymetrix Probe-Set Intensities in Spike-In Experiments

Comparison of Microarray Pre-Processing Methods

Detection and Restoration of Hybridization Problems in Affymetrix GeneChip Data by Parametric Scanning

A Parallel Approach to Microarray Preprocessing and Analysis

Introduction to Bioinformatics and Gene Expression Technology

For research use only. Not for use in diagnostic procedures. AFFYMETRIX UK Ltd., AFFYMETRIX, INC.

A learned comparative expression measure for Affymetrix GeneChip DNA microarrays

Gene Signal Estimates from Exon Arrays

Array Quality Metrics. Audrey Kauffmann

Computational Biology Lecture #11: OMICS: Transcriptomics & Proteomics. Probes & ProbeSets in Affymetrix Chips

Generating quality metrics reports for microarray data sets. Audrey Kauffmann

What you still might want to know about microarrays. Brixen 2011 Wolfgang Huber EMBL

Exploration and Analysis of DNA Microarray Data

Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks

Improvements to the RMA Algorithm for Gene Expression Microarray Background Correction

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

On the adequate use of microarray data

A REVIEW OF GENE EXPRESSION ANALYSIS ON MICROARRAY DATASETS OF BREAST CELLS USING R LANGUAGE

Analysis of a Proposed Universal Fingerprint Microarray

Introduction to biology and measurement of gene expression

New Spiked-In Probe Sets for the Affymetrix HGU-133A Latin Square Experiment

Exploration and Analysis of DNA Microarray Data

Bioinformatics for Biologists

Package sscore. R topics documented: October 4, Version Date

Joint Estimation of Calibration and Expression for High-Density Oligonucleotide Arrays

Pre- Processing Methodologies for Microarray Gene Data for Cancer Detection

A note on oligonucleotide expression values not being normally distributed

1. Introduction Gene regulation Genomics and genome analyses

DNA Microarray Technology

Clones. Glass Slide PCR. Purification. Array Printing. Post-process. Hybridize. Scan. Array Fabrication. Sample Preparation/Hybridization.

Microarrays The technology

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

Computational Biology I

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Technical Note. GeneChip 3 IVT PLUS Reagent Kit vs. GeneChip 3 IVT Express Reagent Kit Comparison. Introduction:

Integrative Genomics 1a. Introduction

Transcription:

Preprocessing Affymetrix GeneChip Data Credit for some of today s materials: Ben Bolstad, Leslie Cope, Laurent Gautier, Terry Speed and Zhijin Wu Affymetrix GeneChip Design 5 3 Reference sequence TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT CCCTTACCCAGTCTTCCGGAGGCTA Perfectmatch CCCTTACCCAGTGTTCCGGAGGCTA Mismatch NSB & SB NSB Terminology Each gene or portion of a gene is represented by 11 to 20 oligonucleotides of 25 base-pairs. Reporter/Feautre/Probe: an oligonucleotide of 25 base-pairs, i.e., a 25-mer. Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene). Mismatch (MM): same as PM but with a single homomeric base change for the middle (13 th ) base (transversion purine <-> pyrimidine, G <->C, A <->T). Probe-pair: a (PM,MM) pair. Probe-pair set: a collection of probe-pairs (11 to 20) related to a common gene or fraction of a gene. Affy ID: an identifier for a probe-pair set. The purpose of the MM probe design is to measure nonspecific binding and background noise. 1

Affymetrix files Main software from Affymetrix company MicroArray Suite - MAS, now version 5. DAT file: Image file, ~10^7 pixels, ~50 MB. CEL file: Cell intensity file, probe level PM and MM values. CDF file: Chip Description File. Describes which probes go in which probe sets and the location of probe-pair sets (genes, gene fragments, ESTs). Expression Measures 10-20K genes represented by 11-20 pairs of probe intensities (PM & MM) Obtain expression measure for each gene on each array by summarizing these pairs We already discussed background adjustment and normalization. We assume this has been done. There are many methods Data and notation PM ijg, MM ijg = Intensity for perfect match and mismatch probe in cell j for gene g in chip i. i = 1,, n -- from one to hundreds of chips; j = 1,, J -- usually 11 or 20 probe pairs; g = 1,, G -- between 8,000 and 20,000 probe sets. Task: summarize for each probe set the probe level data, i.e., PM and MM pairs, into a single expression measure. Expression measures may then be compared within or between chips for detecting differential expression. 2

MAS 4.0 GeneChip MAS 4.0 software used AvDiff up until 2001 AvDiff = 1 " % j $" (PM j # MM j ) where A is a set of suitable pairs, e.g., pairs with d j = PM j -MM j within 3 SDs of the average of d (2),, d (J-1) Obvious problems: Negative values No log scale Why use log? Original scale Log scale Li and Wong s s observations There is a large probe effect There are outliers that are only noticed when looking across arrays Non- linear normalization needed (discussed in previous lecture) PS vol. 98. no. 1, 31-36 3

Probe effect Probe effect makes correlation deceiving Correlation for absolute expression of replicates looks great! But Probe effect makes correlation deceiving It is better to look at relative expression because probe effect is somewhat cancelled out. Later we will see that we can take advantage of probe effect to find outlier probes. 4

Li & Wong Li & Wong (2001) fit a model for each probe set, i.e., gene PM ij " MM ij = # i $ j + % ij, % ij & N(0,' 2 ) where θ i : model based expression index (MBEI), φ j : probe sensitivity index. Maximun likelihood estimate of MBEI is used as expression measure for the gene in chip i. Non-linear normalization used Ad-hoc procedure used to remove outliers Need at least 10 or 20 chips There is one more reason why PM-MM is undesirable Especially for large PM 5

We see bimodality log 2 PM-log 2 MM (log 2 PM+log 2 MM)/2 Two more problems with MM MM detect signal MM cost $$$ MAS 5.0 Current version, MAS 5.0, uses Signal signal = Tukey Biweight{log(PM j " MM j * )} Notice now log is used But what about negative PM-MM? MM* is a new version of MM that is never larger than PM. If MM < PM, MM* = MM. If MM >= PM, SB = Tukey Biweight (log(pm)-log(mm)) (log-ratio). log(mm*) = log(pm)-log(max(sb, +ve)). Tukey Biweight: B(x) = (1 (x/c)^2)^2 if x <c, 0 ow. 6

Can this be improved? We will discuss P/M/A calls later Rank of Spikeins (out of 12626) 141 250 364 368 480 586 686 838 945 1153 1567 Original MBEI RMA Robust regression method to estimate expression measure and SE from PM * (background adjusted normalized PM) Use quantile normalization Assume additive model log 2 (PM ij * ) = a i + b j + " ij Estimate RMA = a i for chip i using robust method, such as median polish (fit iteratively, successively removing row and column medians, and accumulating the terms, until the process stabilizes). Works with n=2 or more chips This is a robust multi-array analysis (RMA) 7

Can this be improved? Rank of Spikeins (out of 12626) 141 250 364 368 480 586 686 838 945 1153 1567 RMA Irizarry et al. (2003) R 31:e15 Rank of Spikeins (out of 12626) 1 2 3 4 7 11 15 21 35 122 1182 230 450 1380 11700 Detection 8

Detection The detection problem: Given the probe-level data, which mr transcripts are present in the sample? Biologists are mostly interested in expression levels, and so detection has received less attention To date only Affymetrix has tackled this, with Rank-based tests Implemented in MAS5.0 MAS Rank-based Detection The test used in MAS 5.0 compares the following two hypotheses H 0 : median (PM j - MM j )/(PM j + MM j ) = τ ; H 1 : median (PM j - MM j )/(PM j + MM j ) > τ. Significance levels: 0 < α 1 < α 2 < 0.5. If p is the p-value for the (rank) test, MAS 5.0 calls a transcript absent: if p > α 2, marginal: if α 1 p α 2, and present: if p < α 1. Typically tests are carried out with τ = 0.15, α 1 =.04 and α 2 =.06. Expression Detection MAS 5.0 9

Remember uncertainty Some data analysts remove probesets called absent from further analysis This creates false negatives: HG95 Present Absent HGU133 Present Absent P 82% 0% P 77% 0% M 1% 0% M 3% 0% A 17% 100% A 20% 100% From spike-in experiments Consistency across reps Consistency across reps 10

Current work We need better estimates of means and variances of bivariate normal background noise Use observed MM intensities along with sequence information We also have a solution that does not use the MM 11