Computational Biology I

Similar documents
Transcription:

Computational Biology I Microarray data acquisition Gene clustering Practical

Microarray Data Acquisition H. Yang

From Sample to Target cdna Sample Centrifugation (Buffer) Cell pellets lyse cells (TRIzol) mrna, rrna, trna Proteins Cell membrane DNA Others Hybridization on chip cdna (cdna 1, cdna 2, ) Isolation of total RNA (Isopropanol) Purification of mrna (GFX) mrna, rrna, trna mrna (mrna 1, mrna 2, ) RT (random hexamer primer) Labeling (Fluorescent dye/radioactive material)

Target cdna Binds with Probe cdna Spotted/synthesized cdna/oligos Probe DNA Denature Spotting or synthsis Target Labeled cdna generated from sample Hybridization of 20-50 µl on a chip for 12-20 h at 40-50 C

141µm 2cm Spots on a Microarray 2cm 2cm 2cm 20,000 spot =141µm 141µm/spot cdna arrays/long oligo arrays 2-3 replicate spots per gene Negative control spots for evaluation of cross-hybridization Affymetrix chips: 1 mismatch for each perfect match 11-20 spots with different sequences for each gene Perfect match Mismatch Background

What is the Corrected Intensity? Corrected intensity For a gene spot (perfect match) x = detected intensity background intensity non-specific binding For a negative control spot (mismatch) non-specific binding = detected intensity background intensity

What is the Corrected Intensity? Nb --- No. of pixels in the background area For genes For negative controls Measured intensity of a pixel in the probe area Ns --- No. of pixels in the probe area 1 Np x = Ip 1 Np In = Ip 1 Nb 1 Nb Ib In Ib Measured intensity of a pixel in the background area

Plastic/Nylon Membrane Microarry Medium 1 Medium 2 RNA isolation mrna purification Reverse transcription Radioactive labeling Hybridization Wash Scanning x y

Glass cdna Microarray Mutant Labeling with Cy5 Sample 1 RNA isolation mrna purification Reverse transcription Hybridization Wash Scanning Sample 2 Wildtype Labeling with Cy3 x y

Presentation of Microarray Data 80000 60000 y 40000 20000 0 0 20000 40000 60000 80000 x

Better Presentation of Microarray Data 100000 Arrays 10000 x y w z y 1000 Gene 1 Gene 2 x 1 x 2 y 1 y 2 w 1 w 2 z 1 z 2 Log ratio (M=log(y/x)) 100 100 1000 10000 100000 x 1.5 1 0.5 0 2 3 4 5-0.5-1 -1.5 Gene i Gene N Log mean intensity (A=log xy) x i x N y i y N w i w N z i z N

Gene Clustering H. Yang

Clustering Methods Hierarchical clustering Pairwise comparison Cluster tree Partitional clustering Self-organizing maps (SOM) Several distinguishing clusters

Hierarchical Clustering Comparison of two genes (gene groups) with increasing distance or decreasing similarity y 3 1 4 2 x 5 5 4 3 2 1 0 1 Distance Similarity

Hierarchical Clustering Distance: Pairwise approach Ratio/log ratio Gene i d(r i,r j ) = r i -r j Gene j Similarity: (r i -r i )(r j -r j ) s(r i,r j ) = (r i -r i ) 2 (r j -r j ) 2 Array 0 1 2 3 4 Sample r i =(y i1 /x i, y i2 /x i,. ) r i = 1 N n r i i=1

Hierarchical Clustering Complete linkage Simple linkage 1 4 3 Average linkage 2 5

Hierarchical Clustering Using expression ratio of 20% to 5% O 2 Expression at 20% compared to 5% O 2 is: Down regulated Not altered Up regulated Exp 1 Day 6 15 20 1 2 3 4 5 6

Partitional Clustering Self-organizing maps are employed A expression vector has three elements Only two elements y Gene Cluster x

Self-Organizing Maps (SOM) Iterative training y Gene Cluster x

Cluster Determination using SOM f Iterative approach ( h) = f ( h) + ( x f ( h)) 1 i ij j i τ i+ Positions of cluster h at two consecutive steps Learning rate Expression of gene j h =1,, M (number of clusters); j=1,,n (number of genes) Learning rate τ ij = α i ( ) (, ) d m i d h j,

Other Clustering Methods K-means KNN (K-nearest neighbors) Principle Component Analysis Neural Network Fuzzy clustering

Practical H. Yang

Two Sets of Microarray Data 1. Five T-cell culture samples on 5 separate microarrays Nylon membrane array with radioactive labeling Data already corrected and normalized 3000 spots on chip with 1250 genes in duplicate spots and 11 housekeeping genes plus >400 negative controls 2. One microarray with 2 samples from C. acetobutylicum fermentation cdna glass microarray with labeling dyes Cy3 & Cy5 Raw data without subtracting background and nonspecific binding 4000 spots on chip with 1200 genes in >triplicate spots and 120 spots as negative controls

1 st Microarray Data Set Given normalized ratios with 631 genes left TO perform Hierarchical clustering SOM clustering with 6 clusters

Hierachical Clustering with 1 st Microarray Data Set Cluster (Brown s Lab at Stanford University) Website http://rana.lbl.gov/eisensoftware.htm Cluster TreeView Load File Hierarchical Clustering Average Linkage Clustering Load File

SOM Clustering with 1 st Microarray Data Set Cluster2 (Whitehead Institute, Center for Genome Research ) Website http://www-genome.wi.mit.edu/cancer/software/genecluster2/gc_license.html Cluster2 File Open Data analysis Find Classes SOM rows: 3 SOM cols: 2 Run Data View View Clusters Compute View

2 nd Microarray Data Set Given: raw data To perform Correction of measured intensities by subtracting background and non-specific binding Prefiltering Normalization (global log mean) Identification of differentially expressed genes (with 2.5 fold change) Plot original data and normalized data in A-M plot

Processing of 2 nd Microarray Data Set Correction of gene spot intensities and negative control intensities by subtracting corresponding background (non-specific binding ) intensities 1. Corrected* gene spot intensity = gene spot intensity background intensity 2. Corrected negative control intensity = negative control intensity background intensity * Needs to be further corrected by removal of nonspecific binding

Processing of 2 nd Microarray Data Set Prefiltering 1. Calculate mean and standard deviation (Std) of 116 negative control intensities 2. Further corrected gene spot intensity = Corrected gene spot intensity corrected negative control mean intensity 3. Final corrected gene intensity x x for x 2 Std x = 2 Std for x<2 Std

Processing of 2 nd Microarray Data Set Normalization 1. Using global log mean and calculate the mean intensities of Cy3 and Cy5 dyes 2. Using the mean intensity ratio (Cy3/Cy5) to correct the Cy5 intensities

Processing of 2 nd Microarray Data Set Identification of differentially expressed genes 1. Logarithmic intensity ratio (Cy5/Cy3) 2. Identify genes with 2.5 fold change 3. Identify up-regulated genes 4. Identify down-regulated genes

Processing of 2 nd Microarray Data Set Plot data in log(x)-log(y) and M-A diagrams 1. x=cy3 & y=cy5 2. plot log(x) vs log(y) 3. plot M=log(y/x) vs A=log(xy)/2