SIMS2003. Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School. Introduction to Microarray Technology.

SIMS2003 Instructors:Rus Yukhananov, Alex Loguinov BWH, Harvard Medical School Introduction to Microarray Technology.

Lecture 1 I. EXPERIMENTAL DETAILS II. ARRAY CONSTRUCTION III. IMAGE ANALYSIS Lecture 2 IV. DATA ANALYSIS: V. DATA INTEGRATION

Microarray is about gene expression. Why bother? Gene expression All information about living being is coded in DNA as a set of genes. Each gene contains structural information about protein sequence and regulatory information about protein expression. Intermediate step between gene and protein is mrna. The concentration of mrna is measured by microarray.

Measuring RNA we know expression profile of cell that defines cell properties and functions Problem: RNA levels and protein levels are not always directly correlated: But no mrna no protein, but relation is not simple and not universal. Functional genomics fill the gap between gene expression and organism function The meaning of life is hidden in gene expression value but it is not easy to get it out

DNA mrna DNA array: Universal Easy to measure Scalable Protein array each protein unique not easy to measure, not scalable PROTEIN

1600 1200 Microarray Publication 1493 Pubmed analysis using keyword microarray total review 800 400 0 1995 1996 1997 1998 1999 2000 2001 2002

Complimentary hybridization the basis of RNA measurement A--T G--C T--A C--G Northern Blot/Southern Blot put RNA on support (membrane, glass) and measure the concentration of unknown sample RNA protection assay NMDAR1 bl 2 5 20 50 B L RT PCR quantitative measurement by amplification

Dot Blot is closest relative for microarray Total RNA -target obe is a labeled fragment used to measure oncentration of unknown sample Probe becomes target Reverse northern blot Difference: Size 1-3 mm vs 0.05-0.1 mm Switch target and probe Microarray Result: revolution 1. Miniaturization 2. Multiple probe 3. Sensitivity 4.Excitement!!!

Selection of cdna - Selection of sequences that represent the gene of interest.. - Finding sequences, usually in the EST database. - Problems : sequencing errors, alternative splicing, chimeric sequences, contamination selected DNA target DNA SEQUENCING ARRAYING

Array Fabrication cdna clones or oligonucleotides (probes) printing microarray 0.1nl/spot Apply target to slide 25x75 mm

Source: Affymetrix website

I. ARRAY CONSTRUCTION WHOLE GENOME ARRAY: e.coli yeast HIV PATHWAY ARRAY apoptotic toxicology array DISCOVERY ARRAY known and novel genes

DISCOVERY ARRAY gene selection: specific tissue library subtractive library genes of interest

DrugAbuse Array (ver. 2) Brain cdna library: ~ 1100 genes Subtractive library:~ 900 genes (following chronic opioid administration) Ver. 3: Ver. 2 + NIA15K set + kidney cdna library NIA15 K set (~15,000 genes): mouse embryonic libraries Embryonic Kidney libraries (~ 2,200 genes)

DrugAbuse array about 20,000 spots

DrugAbuse Array Gene class distribution Regulatory 44% Metabolic 35% Structural 21%

DrugAbuse Array Regulatory gene distribution Cell division & Differentiation 30% Intracellular messengers 20% Membrane associated 15% Transcription factors 45%

A. Synthesis of cdna EXPERIMENTAL PROTOCOL AAAAAA- mrna TTTTTTT- Synthesis of the second strand DNA B. Labeling cdna Single channel Multiple channel 350 nm 480 nm 570 nm 680 nm Cy3 Cy5 C. Hybridization D. Scanning

Control Treatment RNA extraction reverse transcription and labeling Red dye Cy5 Green dye Cy3 hybridization

laser 2 excitation laser 1 II. Image Analysis emission scanning Presentation: overlay images Image analysis

Scanner Details Laser PMT A/D Convertor Dye Photons Electrons Signal excitation amplification filtering

The computer (digital) image is two dimensional array of numbers of pixel intensity (z) z =f(x;y) x, y pixel location 8 bit image corresponds to 256 (2 8) levels of intensity 16 bit image corresponds to 65,000 (2 16-1) gray levels of intensity Image can be considered as a realization of a stochastic process, as a sample of whole class of possible images. Each pixel has a probability distribution f(z). This representation of image used in image inference: to reconstruct the characteristic of true unobservable image X from observed image Z using various statistical methods (MMS, ML, Bayesian estimation etc.)

Image structure

Steps in Images Processing I. Gridding: locate spots II. Segmentation: classification of pixels either as signal or background. III. Measurement: for each spot of the array, calculates signal intensity (mean,median,mode) background and quality measures. Assumption: signal intensity ~ mrna level

Local Background Mixed density distribution F(z)=pf(s)+(1-p)f(b) Local background Signal+background

Steps in Image Processing 3. Data Extraction Spot Intensities mean (pixel intensities). median (pixel intensities). M-estimators for location of pixel intensity distribution Background values Local Constant (global) None Quality Information Pixel distribution. Background Signal

Quality Measurements Spot Signal / Background ratio. Variation in pixel intensities. shape (circularity) Array Correlation between spot intensities. Percentage of bad spots. Distribution of spot signal area.

Pixel distribution C3_5 No of obs Expected Normal Upper Boundaries (x <= boundary)

Pixel distribution 50 B3_6 45 40 35 30 No of obs 25 20 15 10 5 0-100 0 100 300 500 700 900 1100 1300 1500 200 400 600 800 1000 1200 1400 1600 Expected Normal Upper Boundaries (x <= boundary)

Pixel distribution C3_7 No of obs Expected Normal Upper Boundaries (x <= boundary)

IIB. Image Transformation: BACKGROUND CORRECTION: subtract local background from each pixel BACKGROUND SUBTRACTION: subtract local background from mean or median POWER FAMILY OF TRANSFORMATION: Xt=X p, p 0 Xt= lnx, p=0 monotonous, differentiable can adjust nonlinearity and asymmetry

pixel distribution and background correction C3-4 C3-4-bgr C3-3 C3-3-bgr

Background subtracted Power 3/4 Power 1/2 Power 1/4 logarithmic Logarithmic+ background subtraction

Power Family of Transformation and Background Correction NORMAL BGR CORRECTION logarithmic POWER 0.25 POWER 0.5 POWER 0.75

No background correction Background correction Background subtraction Power 0.25

Data diagnostic:non-transformed image Control: label with Cy3-dUTP Stress: label with Cy5 dutp

Data diagnostic:background subtraction Control: label with Cy3-dUTP Stress: label with Cy5 dutp

Data diagnostic:power 1/2 image transformation Control: label with Cy3-dUTP Stress: label with Cy5 dutp

Data diagnostic:background (mean+2s) correction Control: label with Cy3-dUTP Stress: label with Cy5 dutp

Image Analysis Algorithm: optimize scanning parameters spot quantification background determination background subtraction or correction (µ + 1.5 2s) DATA ANALYSIS apply regression analysis detect differentially expressed genes pattern recognition and functional analysis

Normalization: Why We need to remove systematic error Within slide normalization Between slide normalization

Based on data from P.Brown et

Normalization: How Question: What kind of normalization should be applied: No normalization Global normalization with nonlinearity (lowess) correction Using normalization by regression (log-transform ratios-based normalization) Using non-changed gene Housekeeping genes Block-plate (Print-tip) normalization

Spinal cord injury

PKC knockout mouse

next lecture: DATA ANALYSIS STATISTICAL ANALYSIS array as list of gene PATTERN RECOGNITION expression profile classification FUNCTIONAL ANALYSIS content analysis, gene network construction