Exploration and Analysis of DNA Microarray Data

Size: px
Start display at page:

Download "Exploration and Analysis of DNA Microarray Data"

Transcription

1 Exploration and Analysis of DNA Microarray Data Dhammika Amaratunga Senior Research Fellow in Nonclinical Biostatistics Johnson & Johnson Pharmaceutical Research & Development Javier Cabrera Associate Professor in Statistics Rutgers University A short course offered at the Joint Statistical Meetings, San Francisco, August

2 Preface We are in the midst of a genomics revolution! DNA sequence information Technology to use this information Profound changes in the biological sciences and related disciplines (and everyday life!) 2

3 8 AM - START Agenda 1. A brief introduction 2. Microarray experiments: procedure 3. Preprocessing microarray data 4. Finding differentially expressed genes 10 AM - SHORT BREAK 5. Clustering genes and/or samples 6. Class prediction 7. A few concluding remarks + Q&A 12 NOON - END 3

4 The central dogma of molecular biology An organism s genetic information is encoded in DNA as sequences of bases: A,T,G,C. A gene is a segment of DNA which codes for a specific protein. A gene is expressed via the process: DNA mrna protein transcription translation 4

5 Gene expression An organism s genome is the complete set of genes in each of its cells. Given an organism, every one of its cells has a copy of the exact same genome, but not all its cells express the same genes different genes express under different conditions Measure the levels of the various mrnas in a cell in a specific state gene expression 5 profile of cell in that state cell state.

6 DNA microarrays One of the most promising tools for obtaining gene expression data is the DNA microarray. A DNA microarray is a tiny glass slide on which genes (purified single-stranded cdna sequences in solution) have been robotically printed in an (approximately) rectangular array; each spot on the array corresponds to a single gene. 6

7 Experimental procedure Prepare DNA microarray. Prepare labeled test sample. cellular contents mrna (isolate & purify) cdna sample (reverse (add flourescent transcription) dye) 7

8 Experimental procedure (continued) Flood the microarray with the labeled sample. Whenever there is cdna in the array complementary to mrna in the sample, the two will hybridize. Hybridization is allowed to take place, then the array is washed and dried. The array is scanned with a laser microscope. 8

9 Scanned image 9

10 Interpreting the scanned image High intensity spot the DNA at that spot corresponds to some RNA in sample. Low intensity spot no RNA in sample that corresponds to the DNA at that spot. Intensity ~ RNA abundance. For any gene, can compare intensities across different samples (but shouldn t compare intensities for different genes for the same sample). 10

11 Comparing two scanned images Control Treatment 11

12 Paradigm [expressed] gene mrna amount of mrna in cell transcription level of gene change in state of cell change in gene expression pattern within cell change in mrna abundance in cell change in spot intensity pattern 12

13 Uses of DNA microarrays DNA microarrays can be used to compare gene expression patterns, multiple genes at a time: Which genes are expressed in which cells and under what conditions. Which genes are expressed differently in diseased cells compared to normal cells. Which genes are expressed differently when a patient is administered a drug. 13

14 Technology differences Pin spotting or inkjet printing or photolithography 1-sample or multi-sample Almost-complete sequences (cdna) or subsequences (oligoneucleotides) 14

15 Processing steps Raw Image Spotted Image Statistical Analysis Probability Interv Genes 15

16 Raw array speckles (B) (A) Light Background (C) Shape 16

17 Processing the raw image Gridding: where are the spots? Segmentation: which pixels correspond to the spot (signal) and which to background? Measurement: what is the intensity at each spot? Spot intensity = average pixel intensity within the spot Background intensity = average pixel intensity immediately around the spot 17

18 Processing steps Convert scanned image to spotted image Check quality of spotted image Adjust for background Transform data Spot Intensity = Normalize data Check quality of normalized data Analyze data Interpret and report findings

19 Image plot of a good array Signal Background

20 Image plot of a defective array Signal Background

21 Quality assessments Flag spots of dubious quality [sd, cv, spot regularity, spot circularity, etc] Flag arrays (or areas of arrays) of dubious quality [due to scratches, washing problems, etc] When there are replicates, flag any observation that can be considered an outlier compared to its siblings [should be done after normalization] 21

22 Signal The signal at a particular spot is taken to be or or Signal = Spot Signal = Spot - Background Signal = Spot - SmoothedBackground LOG SIGNAL LOG BACKGROUND 22

23 Thresholding Sometimes the signal may be thresholded if low intensity values are considered unreliable: Signal = max(signal,threshold) or Signal = Signal or Missing depending on whether S >< T or if S is considered unreliable 23

24 Transformation The signal data is often highly skewed and is often transformed for analysis. Logs: Signal = log(signal) Started logs: Signal = log(signal+λ) INTENSITY LOG(INTENSITY)

25 Removal of extraneous systematic effects Day effect Day effect removed via linear model + normalized 25

26 Normalization Often the signals on even identical microarrays tend to be on different scales (due to quality and quantity of RNA, labeling efficiency, laser setting, etc). These scales need to be normalized prior to further analysis, so that the arrays are on more directly comparable scales. 26

27 Two arrays

28 Two arrays LOG SIGNAL INTENSITY LOG(C2) C1 C LOG(C1) ARRAY 28 ρ (Concordance) = 0.90, ρ (Spearman) = 0.97

29 Normalization To normalize arrays C(1),..., C(n): Equate the 75th percentiles and calculate the median pseudoarray M. Fit a continuous monotone increasing function to the percentiles of C(i) vs the percentiles of M (take special care of the tails). Back-predict to obtain the normalized values of C(i). 29

30 Two arrays (after normalization)

31 Two arrays (after normalization) LOG SIGNAL INTENSITY LOG(C2) C1 C LOG(C1) ARRAY 31 ρ (Concordance) = 0.98, ρ (Spearman) = 0.97

32 Normalization issues Normalization is based on a function fitted to a gene set comprised mostly of constantly expressing genes -- how to select this set? [all / housekeeping genes / spiking controls / rank invariant genes] Stagewise normalization when there are several levels of effects Spatial effects 32

33 Summarization Summarize across technical replicates (means/medians/robust location estimates) In Affymetrix GeneChips, summarize across the perfect matches (PM) and mismatches (MM) that comprise the observations for a gene (ave(pm-mm), ave(pm), ave(pm-mode(mm))) 33

34 A comparative experiment Data: Gene expression profiles for genes in 6 mice (= 3 Control + 3 Test). Question: Which genes are differentially expressed in C vs T? Or: Could ask whether the differential gene expression profiles discriminate between C and T. 34

35 Preprocessed data Gene _ G G G G G G G G G G G G G G

36 Simple approaches Nonstatistical: Seek genes that exhibit a specified fold increase in mean intensity (e.g., 2-fold). Statistical: Seek genes that exhibit a statistically significant difference across the 2 groups (via e.g., t test, Welch s test, Wilcoxon test, robust t test, permutation test - or perhaps a modelbased test depending on the situation). 36

37 Analysis results Top 10 genes (sorted by p-value) Gene Fold p-value p(bonf) q-value G G G G G G G G G G

38 Issue 1: Effect of small sample size Often the sample size per group is small. Variances (and therefore inferences) could be unreliable. t= ( X X )/( s ) T C P t sp or t Pooled StdErr t = ( X X )/( s + s ) SAM T C P 0 38

39 Issue 2: Multiplicity Rank the genes or select a subset of genes according to their (individual) significance for differential gene expression (test stats or p-values) Associate a number with each gene (or with the selected subset) that tells us how confident we should be that including it in our list of potentially differentially expressing genes does not substantially increase the rate of false findings. 39 (FDR or q-values)

40 The positive False Discovery Rate pfdr = AVE(#FalsePositives/#Positives) To calculate: Either Decision rule says reject if T>c h 0 permute m times on average h pfdr=h/h 0 refine or recursive formula. The q-value is the minimum pfdr incurred when calling test significant. 40

41 Wrap up Reference D. Amaratunga and J. Cabrera (2003) Exploration and Analysis of DNA microarray and protein array data, John Wiley. Webpage 41