3. Microarrays: experimental design, statistical analysis and gene clustering

Size: px
Start display at page:

Download "3. Microarrays: experimental design, statistical analysis and gene clustering"

Transcription

1 WUEMED Drought Course 3. Microarrays: experimental design, statistical analysis and gene clustering John Bennett International Rice Research Institute Los Baños, Philippines

2 3.1: Experimental overview

3 Photolithography for oligonucleotide synthesis

4 Overview of the two microarray platforms and TaqMan GeneExpression Assay based real-time PCR Applied Biosystems Agilent TaqMan Gene Expression Assays Platforms Human Genome Human Whole TaqMan Expression Survey Microarray Genome Arrays Technology Hybridization Comparative Hybridization 5' Nuclease Chemistry & Real Time PCR Probe (bases) 60mer 60mer TaqMan Primer & Probes Substrate Nylon Glass slide - Deposition Contact Spotting In-situ Ink Jet Printing - Detection One-color Two-color One-color FAM Chemilumin. Cy3/Cy5 Fluorescence Fluorescence Software 1700 Chemilumin. Feature Extraction SDS 2.1 Microarray Analyzer A Software v 1.1 Total Probes 33,096 probes 44,000 probes 1375 Selected Targets Wang et al. (2006). Large scale real-time PCR validation on gene expression measurements from two commercial long-oligonucleotide microarrays. BMC Genomics 7: 59.

5 Management of library of cdna clones with Biomek 2000 liquid transferring system PCR and library replication use Biomek

6 Slides printing using GeneTAC printer Microtiter plates Glass slides PCR products from >9000 genes 3000 spots per slide

7 UV-crossing linking after printing

8 Reverse transcription and labeling with fluorescent dyes Smears showing labeled reverse transcripts Un-incorporated dyes

9 Automated and manual hybridization chambers GeneTAC Hyb station Manual hyb chamber in water bath

10 Slides Scanning with ScanArray 22K chips from Agilent 59 K oligo array from BGI, Beijing 10K rice panicle cdna library printed at

11 Images captured by scanner

12 Quantification---gridding

13 3.2: Experimental design

14 Steps in microarray analysis Steps Biological experiment Sample collection RNA extraction RNA labeling Array printing Hybridization Scanning Data acquisition Data analysis Data interpretation Sources of error Plant growth and stress conditions Tissue variation RNA quality Efficiency of labeling (esp. Cy3 vs Cy5) Reproducibility, pin effects Non-uniformity, background, cross-hybrid n Varaiable scanner performance Inaccurate gridding Inconsistent background subtraction Faulty annotation

15 Types of replication Technical replication (on same RNA sample to gauge effects of different arrays, hybridization conditions, etc.) Dye swap (to gauge the effect of using different dyes) Biological replication (different plants from the same treatment in the same experiment) Experimental replication (same experimental design but conducted at different times) log-transformed gene expression signal = log(y) = µ + A + D + V + G + (AG) + (V G) + ε (1) where: µ is the average expression signal A array effect D dye effect V sample variety effect G gene effect (AG) combination of array and gene (VG) combination of variety and gene ε independent noise.

16 Sources of error Random factors contributing to technical variance include: variation among replicate spots within a slide hybridization (corr > 95%) variation among replicate spots between slides (corr ~60-80%) variation introduced by scratches or dust or local hybridization effects variation introduced by subtraction of background from spot signal intensities variation introduced by tissue sampling variation introduced by RNA extraction Systematic sources of variation include: different dyes (corr <60 80%) include dye swaps multiple print tips (print group effects) local data normalization Unlike earlier microarray studies, most journals will no longer accept manuscripts without adequate sampling.

17 Reference and balanced designs Replication requires more resources and appropriate experimental design can increase the efficiency of resource utilization and optimize statistical power. Reference and balanced are the two basic designs. In reference designs, all experimental samples are labelled with one dye and each co-hybridized with a common reference sample that is labelled with a second dye. In balanced designs such as loops, experimental samples are labelled with both dyes and hybridized to each other. For the same number of slides, twice the number of experimental samples can be included in a balanced design compared to a reference design, leading to improved precision and increased statistical power. Furthermore, error due to technical variability is highest for reference designs.

18 Simple two-treatment design Two treatments X two replicates X two dyes X dye swap two scan λs = 4 slides

19 Design with and without reference samples Gary A. Churchill GA Fundamentals of experimental design for cdna microarrays. Nature Genetics 32:

20 3.3: Statistical analysis

21 TM4 from The Institute for Genome Research (TIGR) The TM4 suite of tools consist of four major applications: 1. Microarray Data Manager (MADAM) 2. TIGR_Spotfinder 3. Microarray Data Analysis System (MIDAS) 4. Multiexperiment Viewer (MeV) Plus 5. A (MIAME*)-compliant MySQL database Freely available at Saeed et al. (2003). TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 34: *Minimal Information About a Microarray Experiment

22 Data normalization and filtering via MIDAS After the spot intensity values are measured in TIGR Spotfinder, they must be normalized to help compensate for variability between slides and fluorescent dyes, as well as other systematic sources of error, by appropriately adjusting the measured array intensities. Data filtering can reduce the dataset by removing poor or questionable data. TIGR s MIDAS, a Java application, provides an interface to design analysis protocols combining one or more normalization and filtering steps. MIDAS reads.tav files generated by TIGR Spotfinder or retrieved from the database via MADAM. Normalization modules include locally weighted linear regression [lowess] and total intensity normalization. These can be linked with filters, including low-intensity cutoff, intensity-dependent Z-score cutoffs, and replicate consistency trimming, creating a highly customizable method for preparing expression data for subsequent comparison and analysis. When the normalization and filtering steps are complete, MIDAS outputs the data in.tav format. Global versus local normalization. Most normalization algorithms, including lowess, can be applied either globally (to the entire data set) or locally (to some physical subset of the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (sometimes referred to as a 'pen group' or 'subgrid'). Quackenbush J Microarray data normalization and transformation. Nature Genetics 32:

23 TIGR Spotfinder for image analysis TIGR Spotfinder was designed for the rapid, reproducible, and computer-aided analysis of microarray images and the quantification of gene expression. It reads paired 16-bit TIFF image files generated by most microarray scanners. Automatic and manual grid adjustments help to ensure that each rectangular grid cell is centered on a spot. Spot intensities are calculated as an integral of non-saturated pixels. Local background is subtracted from each intensity value. These calculated intensities, along with each spot s position on the array, spot area, background values, and quality control flags, are written to a TIGR ArrayViewer (.tav ) file format, a Microsoft Excel workbook, or the database. In noisy areas of the slide, the user may manually identify or discard spots. Quality-control views allow the user to assess systematic biases in the data.

24 Proposed models for statistical analysis of microarray expression data ANOVA log-transformed gene expression signal (Kerr et al., 2000) mixture models for gene effect (Lee et al., 2000) multiplicative model (not logarithm-transformed) (Yang et al., 2001; Sasik et al., 2002) ratio-distribution model (Chen et al., 1997, 2002) binary model (Shmulevich and Zhang, 2002) rank-based models not sensitive to noise distributions (Ben-Dor et al., 2000) replicates using mixed models (Wernisch et al., 2003) quantitative noise analysis (Tu et al., 2002; Fathallah-Shaykh et al., 2002) design of reverse dye microarrays (Dobbin et al., 2003). Pan (2002) compared different microarray statistical analysis methods: log-linear ANOVA mixed model (Pan et al., 2001; based on Tusher et al., 2001) two-sample t-test (Devore & Peck, 1997) regression (Thomas et al., 2001) Pan W A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18:

25 Quantification---spot quality

26 Spot scanning for control of spot quality For 16 positions across each spot, determine intensity and calculate p value from t-test. Spots with low p values are acceptable. High p values could result from poor printing, damage, poor hybridization, poor gridding

27 LOWESS normalized data in GPR format

28 Distribution plot view

29 Data loaded into TMeV(TIGR) for statistical analysis

30 Comparison of biological replicates

31 Lowess - Locally Weighted Linear Regression log 10 (R/G) log 10 R*G log 10 R*G Quackenbush J Microarray data normalization and transformation. Nature Genetics 32:

32 Replicated determination of ratio of two treatments What was wrong here? log 2 (A/B) 2 log 2 (A/B) 1 Quackenbush J Microarray data normalization and transformation. Nature Genetics 32:

33 Intensity-dependent Z scores for identifying differential expression Z>2 log 10 R/G 1<Z<2 Z<1 log 10 R*G Quackenbush J Microarray data normalization and transformation. Nature Genetics 32:

34 TM4 utilities: SlideMap and ExpressConverter SlideMap SlideMap.pm is a Perl module used for conversion of spots to wells and wells to spots. It is useful when the array is custom-printed from PCR products presented to the arrayer in microtiter plates. SlideMap currently supports several commercial arrayers and 'generic' arrayers. ExpressConverter ExpressConverter is a file transformation tool that reads microarray data files in a variety of file formats and generates.mev or.tav files as output for uploading microarray data to the database with MADAM and analyzed with MIDAS and MEV. These supported formats include Genepix, ImaGene, ScanArray, ArrayVersion and Agilent files. Affymetrix data files cannot be converted with the ExpressConverter, but can be loaded directly into MeV. FAQ

35 Data analysis via TIGR MeV Normalized and filtered expression files are analyzed from.tav files using TIGR MeV, which generates informative and interrelated displays of expression and annotation data from single or multiple experiments. Analysis modules currently implemented in MeV include: hierarchical clustering (8) k-means clustering (18) self-organizing maps (15) principal components analysis (17) cluster affinity search technique (3) self-organizing trees (13) template matching between-groups tests (including t-tests) bootstrapping and jackknifing resample the dataset to generate consensus clusters.

36 3.4: Gene clustering

37 Comparison of clustering methods At first a mainly visual analysis was used for clustering of genes into similar groups (e.g., DeRisi et al., 1997) Subsequently, simple sorting of expression ratios and some form of correlation distance were used to identify genes (Spellman et al., 1998; Eisen et al., 1998). Datta & Datta (2003) compared six different clustering methods: (i) Hierarchical clustering with correlation (e.g., UPGMA)(Eisen et al., 1998) (ii) Clustering by K-means (iii) Diana (divisive clustering) (iv) Fanny (Fuzzy logic) (v) Model-based clustering (vi) Hierarchical clustering with partial least squares They used microarray data of Chu et al. (1998) for yeast sporulation: 6118 genes, seven time points during the onset of sporulation (0-12 h) [ Datta S, Datta S Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19:

38 Use of microarray data to cluster cancer types Chung et al. (2002). Molecular portraits and the family tree of cancer. Nature Genetics 32: (2002)

39 Cluster analysis requires a suitable co-variable Examples: Time (e.g., duration of treatment) Genotypes Stress level (e.g., salt concentration, temperature, water status) Any other suitable independent variable, or dummy independent variable, or co-variable Certain suitable combinations (e.g., temperature and water status)

40 Using FTSW as the co-variable for cluster analysis NTR = normalized transpiration rate NTR Each point on the curve represents a stage in stress development and can be related to changes in other physiological and molecular factors (such as photosynthesis and transcript levels) FTSW (fraction of transpirable soil water)

41 Data management and analysis for gene expression arrays Ermolaeva et al. (1998). Data management and analysis for gene expression arrays. Nature Genetics 20: