SPH 247 Statistical Analysis of Laboratory Data

Similar documents
AFFYMETRIX c Technology and Preprocessing Methods

Expression summarization

Introduction to gene expression microarray data analysis

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

DNA Microarray Data Oligonucleotide Arrays

Image Analysis. Based on Information from Terry Speed s Group, UC Berkeley. Lecture 3 Pre-Processing of Affymetrix Arrays. Affymetrix Terminology

Affymetrix GeneChip Arrays. Lecture 3 (continued) Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

Normalization. Getting the numbers comparable. DNA Microarray Bioinformatics - #27612

Pre-processing DNA Microarray Data

Description of Logit-t: Detecting Differentially Expressed Genes Using Probe-Level Data

Microarray Data Analysis. Normalization

Introduction to Bioinformatics! Giri Narasimhan. ECS 254; Phone: x3748

Preprocessing Affymetrix GeneChip Data. Affymetrix GeneChip Design. Terminology TGTGATGGTGGGGAATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Pre-processing DNA Microarray Data

Lecture #1. Introduction to microarray technology

STATC 141 Spring 2005, April 5 th Lecture notes on Affymetrix arrays. Materials are from

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Integrative Genomics 1a. Introduction

Microarrays The technology

Introduction to biology and measurement of gene expression

Normalizing Affy microarray data

From hybridization theory to microarray data analysis: performance evaluation

CS-E5870 High-Throughput Bioinformatics Microarray data analysis

Package sscore. R topics documented: October 4, Version Date

GS Analysis of Microarray Data

Analysis of Microarray Data

Background and Normalization:

Mixed effects model for assessing RNA degradation in Affymetrix GeneChip experiments

Background Correction and Normalization. Lecture 3 Computational and Statistical Aspects of Microarray Analysis June 21, 2005 Bressanone, Italy

6. GENE EXPRESSION ANALYSIS MICROARRAYS

Introduction to Bioinformatics and Gene Expression Technology

DNA Arrays Affymetrix GeneChip System

Biology 644: Bioinformatics

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Gene Expression Technology

Exam 1 from a Past Semester

Package pumadata. July 24, 2018

Microarray Data Analysis Workshop. Preprocessing and normalization A trailer show of the rest of the microarray world.

Computational Biology I

Microarray Informatics

Analysis of Microarray Data

DNA Microarray Technology

From CEL files to lists of interesting genes. Rafael A. Irizarry Department of Biostatistics Johns Hopkins University

A REVIEW OF GENE EXPRESSION ANALYSIS ON MICROARRAY DATASETS OF BREAST CELLS USING R LANGUAGE

Measuring and Understanding Gene Expression

Analyzing DNA Microarray Data Using Bioconductor

The essentials of microarray data analysis

Exercise on Microarray data analysis

Microarray Informatics

Measuring gene expression (Microarrays) Ulf Leser

A Distribution Free Summarization Method for Affymetrix GeneChip Arrays

Introduction to Microarray Technique, Data Analysis, Databases Maryam Abedi PhD student of Medical Genetics

Computing with large data sets

Microarray. Key components Array Probes Detection system. Normalisation. Data-analysis - ratio generation

3.1.4 DNA Microarray Technology

PLM Extensions. B. M. Bolstad. October 30, 2013

2007/04/21.

Lecture 2: March 8, 2007

The Affymetrix platform for gene expression analysis Affymetrix recommended QA procedures The RMA model for probe intensity data Application of the

Introduction to Microarray Analysis

Rafael A Irizarry, Department of Biostatistics JHU

Gene Signal Estimates from Exon Arrays

Soybean Microarrays. An Introduction. By Steve Clough. November Common Microarray platforms

Affymetrix Quality Assessment and Analysis Tool

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Bioinformatics III Structural Bioinformatics and Genome Analysis. PART II: Genome Analysis. Chapter 7. DNA Microarrays

2. (So) get (fragments with gene) R / required gene. Accept: allele for gene / same gene 2

Preprocessing Methods for Two-Color Microarray Data

INTRODUCTION. The Technology of Microarrays January Hanne Jarmer

What you still might want to know about microarrays. Brixen 2011 Wolfgang Huber EMBL

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

GS Analysis of Microarray Data

Intro to Microarray Analysis. Courtesy of Professor Dan Nettleton Iowa State University (with some edits)

GS Analysis of Microarray Data

Introduction to DNA microarrays. DTU - January Hanne Jarmer

Improvements to the RMA Algorithm for Gene Expression Microarray Background Correction

Measuring gene expression

Quantitative Real Time PCR USING SYBR GREEN

Mixture modeling for genome-wide localization of transcription factors

10.1 The Central Dogma of Biology and gene expression

Predicting Microarray Signals by Physical Modeling. Josh Deutsch. University of California. Santa Cruz

Bioinformatics for Biologists

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

ADVANCED STATISTICAL METHODS FOR GENE EXPRESSION DATA

Moc/Bio and Nano/Micro Lee and Stowell

FACTORS CONTRIBUTING TO VARIABILITY IN DNA MICROARRAY RESULTS: THE ABRF MICROARRAY RESEARCH GROUP 2002 STUDY

Release Notes. JMP Genomics. Version 3.1

Parameter Estimation for the Exponential-Normal Convolution Model

Ning Tang ALL RIGHTS RESERVED

Identification of biological themes in microarray data from a mouse heart development time series using GeneSifter

1. Introduction Gene regulation Genomics and genome analyses

Introduction to Bioinformatics. Fabian Hoti 6.10.

CAP BIOINFORMATICS Su-Shing Chen CISE. 10/5/2005 Su-Shing Chen, CISE 1

affy: Built-in Processing Methods

Then, we went on to discuss genome expression and described: Microarrays

Analysis of a Tiling Regulation Study in Partek Genomics Suite 6.6

Probe-Level Data Analysis of Affymetrix GeneChip Expression Data using Open-source Software Ben Bolstad

Transcription:

SPH 247 Statistical Analysis of Laboratory Data April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 1

Basic Design of Expression Arrays For each gene that is a target for the array, we have a known DNA sequence. mrna is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 2

Exon Intron TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC Probe Sequence cdna arrays use variable length probes derived from expressed sequence tags Spotted and almost always used with two color methods Can be used in species with an unsequenced genome Long oligoarrays use 60-70mers Agilent two-color arrays Illumina Bead Arrays Usually use computationally derived probes but can use probes from sequenced EST s April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 3

Affymetrix GeneChipsuse multiple 25-mers For each gene, one or more sets of 8-20 distinct probes May overlap May cover more than one exon Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibit binding. This is supposed to act as a control, but often instead binds to another mrna species, so many analysts do not use them April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 4

Illumina Bead Arrays Beads are coated with many copies of a 50-mer gene specific probe and a 29-mer address sequence Multiple beads per probe, random, but around 20 Each chip of the Ref-8 contains 8 arrays with ~ 25,000 targets, plus controls Each chip of the WG-6 contains 6 arrays with ~ 50,000 targets, plus controls Each chip of the HT-12 chip contains 12 arrays with ~ 50,000 targets and controls April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 5

Probe Design A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome. Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 6

The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content. This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene. Thus only comparisons of intensity within the same probe across arrays makes sense. A higher signal for one gene than another on the same array does not mean that the copy number is higher April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 7

Affymetrix GeneChips For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly Most of us ignore the MM probes April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 8

Expression Indices A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene). There have been a large number of suggested methods. Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences Summary of Illumina beads is simpler, but there are still issues. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 9

Usable Methods Li and Wong s dchip and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA The RMA method of Irizarry et al. is available in Bioconductor. The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor/CRAN as part of the LMGene R package April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 10

Bioconductor Documentation > library(affy) Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openvignette()'. To cite Bioconductor, see 'citation("biobase")' and for packages 'citation(pkgname)'. Loading required package: affyio Loading required package: preprocesscore April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 11

Bioconductor Documentation > openvignette() Please select a vignette: 1: affy - 1. Primer 2: affy - 2. Built-in Processing Methods 3: affy - 3. Custom Processing Methods 4: affy - 4. Import Methods 5: affy - 5. Automatic downloading of CDF packages 6: Biobase - An introduction to Biobase and ExpressionSets 7: Biobase - Bioconductor Overview 8: Biobase - esapply Introduction 9: Biobase - Notes for eset developers 10: Biobase - Notes for writing introductory 'how to' documents 11: Biobase - quick views of eset instances Selection: April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 12

Reading Affy Data into R The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes. The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 13

Example Data Set Data from Robert Rice s lab on twelve keratinocyte cell lines, at six different stages. Affymetrix HG U95A GeneChips. For each gene, we will run a one-way ANOVA with two observations per cell. For this illustration, we will use RMA. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 14

Files for the Analysis.CDF file has U95A chip definition (which probe is where on the chip). Built in to the affy package..cel files contain the raw data after pixel level analysis, one number for each spot. Files are called LN0A.CEL, LN0B.CEL LN5B.CEL and are on the web site. 409,600 probe values in 12,625 probe sets. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 15

The ReadAffy function ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch, which is itself an object of class ExpressionSet ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset You can also specify filenames, phenotype or experimental data, and MIAME information April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 16

rrdata <- ReadAffy() > class(rrdata) [1] "AffyBatch" attr(,"package") [1] "affy > dim(exprs(rrdata)) [1] 409600 12 > colnames(exprs(rrdata)) [1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL" [7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL" > length(probenames(rrdata)) [1] 201800 > length(unique(probenames(rrdata))) [1] 12625 > length((featurenames(rrdata))) [1] 12625 > featurenames(rrdata)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 17

The ExpressionSet class An object of class ExpressionSet has several slots the most important of which is an assaydata object, containing one or more matrices. The best way to extract parts of this is using appropriate methods. exprs() extracts an expression matrix featurenames() extracts the names of the probe sets. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 18

Expression Indices The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25- mer) Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number This has conceptually several steps April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 19

Steps in Expression Index Construction Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays. We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mrna species that is the target of the probe set. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 20

Data transformation is the process of changing the scale of the data so that it is more comparable from high to low. Common transformations are the logarithm and generalized logarithm Normalization is the process of adjusting for systematic differences from one array to another. Normalization may be done before or after transformation, and before or after probe set summarization. April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 21

One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 22

Probe intensities for LASP1 in a radiation dose-response experiment 0 1 10 100 Mean 200618_at1 360 216 158 198 233.0 200618_at2 313 402 106 103 231.0 200618_at3 130 182 79 91 120.5 200618_at4 351 370 195 136 263.0 200618_at5 164 130 98 107 124.8 200618_at6 223 219 164 196 200.5 200618_at7 437 529 195 158 329.8 200618_at8 509 554 274 128 366.3 200618_at9 522 720 285 198 431.3 200618_at10 668 715 247 260 472.5 200618_at11 306 286 144 159 223.8 Expression Index 362.1 393.0 176.8 157.6 April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 23

Log probe intensities for LASP1 in a radiation dose-response experiment 0 1 10 100 Mean 200618_at1 2.56 2.33 2.20 2.30 2.35 200618_at2 2.50 2.60 2.03 2.01 2.28 200618_at3 2.11 2.26 1.90 1.96 2.06 200618_at4 2.55 2.57 2.29 2.13 2.38 200618_at5 2.21 2.11 1.99 2.03 2.09 200618_at6 2.35 2.34 2.21 2.29 2.30 200618_at7 2.64 2.72 2.29 2.20 2.46 200618_at8 2.71 2.74 2.44 2.11 2.50 200618_at9 2.72 2.86 2.45 2.30 2.58 200618_at10 2.82 2.85 2.39 2.41 2.62 200618_at11 2.49 2.46 2.16 2.20 2.33 Expression Index 2.51 2.53 2.21 2.18 April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 24

The RMA Method Background correction that does not make 0 signal correspond to 0 amount Quantile normalization Log 2 transform Median polish summary of PM probes April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 25

> eset <- rma(rrdata) trying URL 'http://bioconductor.org/packages/2.1/ Content type 'application/zip' length 1352776 bytes (1.3 Mb) opened URL downloaded 1.3 Mb package 'hgu95av2cdf' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\dmrocke\Local Settings updating HTML package descriptions Background correcting Normalizing Calculating Expression > class(eset) [1] "ExpressionSet" attr(,"package") [1] "Biobase" > dim(exprs(eset)) [1] 12625 12 > featurenames(eset)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 26

> exprs(eset)[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL 100_g_at 9.195937 9.388350 9.443115 9.012228 9.311773 9.386037 9.386089 1000_at 8.229724 7.790238 7.733320 7.864438 7.620704 7.930373 7.502759 1001_at 5.066185 5.057729 4.940588 4.839563 4.808808 5.195664 4.952883 1002_f_at 5.409422 5.472210 5.419907 5.343012 5.266068 5.442173 5.190440 1003_s_at 7.262739 7.323087 7.355976 7.221642 7.023408 7.165052 7.011527 LN3B.CEL LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL 100_g_at 9.394606 9.602404 9.711533 9.826789 9.645565 1000_at 7.463158 7.644588 7.497006 7.618449 7.710110 1001_at 4.871329 4.875907 4.853802 4.752610 4.834317 1002_f_at 5.200380 5.436028 5.310046 5.300938 5.427841 1003_s_at 7.185894 7.235551 7.292139 7.218818 7.253799 April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 27

> summary(exprs(eset)) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : 2.713 Min. : 2.585 Min. : 2.611 Min. : 2.636 1st Qu.: 4.478 1st Qu.: 4.449 1st Qu.: 4.458 1st Qu.: 4.477 Median : 6.080 Median : 6.072 Median : 6.070 Median : 6.078 Mean : 6.120 Mean : 6.124 Mean : 6.120 Mean : 6.128 3rd Qu.: 7.443 3rd Qu.: 7.473 3rd Qu.: 7.467 3rd Qu.: 7.467 Max. :12.042 Max. :12.146 Max. :12.122 Max. :11.889 LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : 2.598 Min. : 2.717 Min. : 2.633 Min. : 2.622 1st Qu.: 4.444 1st Qu.: 4.469 1st Qu.: 4.425 1st Qu.: 4.428 Median : 6.008 Median : 6.058 Median : 6.017 Median : 6.028 Mean : 6.109 Mean : 6.125 Mean : 6.116 Mean : 6.117 3rd Qu.: 7.426 3rd Qu.: 7.422 3rd Qu.: 7.444 3rd Qu.: 7.459 Max. :13.135 Max. :13.110 Max. :13.106 Max. :13.138 LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL Min. : 2.742 Min. : 2.634 Min. : 2.615 Min. : 2.590 1st Qu.: 4.468 1st Qu.: 4.433 1st Qu.: 4.448 1st Qu.: 4.487 Median : 6.074 Median : 6.050 Median : 6.053 Median : 6.068 Mean : 6.122 Mean : 6.120 Mean : 6.121 Mean : 6.123 3rd Qu.: 7.460 3rd Qu.: 7.478 3rd Qu.: 7.477 3rd Qu.: 7.457 Max. :12.033 Max. :12.162 Max. :11.925 Max. :11.952 April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 28

Probe Sets not Genes It is unavoidable to refer to a probe set as measuring a gene, but nevertheless it can be deceptive The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied April 14, 2015 SPH 247 Statistical Analysis of Laboratory Data 29