Computational Biology I

Size: px
Start display at page:

Download "Computational Biology I"

Transcription

1 Computational Biology I Microarray data acquisition Gene clustering Practical

2 Microarray Data Acquisition H. Yang

3 From Sample to Target cdna Sample Centrifugation (Buffer) Cell pellets lyse cells (TRIzol) mrna, rrna, trna Proteins Cell membrane DNA Others Hybridization on chip cdna (cdna 1, cdna 2, ) Isolation of total RNA (Isopropanol) Purification of mrna (GFX) mrna, rrna, trna mrna (mrna 1, mrna 2, ) RT (random hexamer primer) Labeling (Fluorescent dye/radioactive material)

4 Target cdna Binds with Probe cdna Spotted/synthesized cdna/oligos Probe DNA Denature Spotting or synthsis Target Labeled cdna generated from sample Hybridization of µl on a chip for h at C

5 141µm 2cm Spots on a Microarray 2cm 2cm 2cm 20,000 spot =141µm 141µm/spot cdna arrays/long oligo arrays 2-3 replicate spots per gene Negative control spots for evaluation of cross-hybridization Affymetrix chips: 1 mismatch for each perfect match spots with different sequences for each gene Perfect match Mismatch Background

6 What is the Corrected Intensity? Corrected intensity For a gene spot (perfect match) x = detected intensity background intensity non-specific binding For a negative control spot (mismatch) non-specific binding = detected intensity background intensity

7 What is the Corrected Intensity? Nb --- No. of pixels in the background area For genes For negative controls Measured intensity of a pixel in the probe area Ns --- No. of pixels in the probe area 1 Np x = Ip 1 Np In = Ip 1 Nb 1 Nb Ib In Ib Measured intensity of a pixel in the background area

8 Plastic/Nylon Membrane Microarry Medium 1 Medium 2 RNA isolation mrna purification Reverse transcription Radioactive labeling Hybridization Wash Scanning x y

9 Glass cdna Microarray Mutant Labeling with Cy5 Sample 1 RNA isolation mrna purification Reverse transcription Hybridization Wash Scanning Sample 2 Wildtype Labeling with Cy3 x y

10 Presentation of Microarray Data y x

11 Better Presentation of Microarray Data Arrays x y w z y 1000 Gene 1 Gene 2 x 1 x 2 y 1 y 2 w 1 w 2 z 1 z 2 Log ratio (M=log(y/x)) x Gene i Gene N Log mean intensity (A=log xy) x i x N y i y N w i w N z i z N

12 Gene Clustering H. Yang

13 Clustering Methods Hierarchical clustering Pairwise comparison Cluster tree Partitional clustering Self-organizing maps (SOM) Several distinguishing clusters

14 Hierarchical Clustering Comparison of two genes (gene groups) with increasing distance or decreasing similarity y x Distance Similarity

15 Hierarchical Clustering Distance: Pairwise approach Ratio/log ratio Gene i d(r i,r j ) = r i -r j Gene j Similarity: (r i -r i )(r j -r j ) s(r i,r j ) = (r i -r i ) 2 (r j -r j ) 2 Array Sample r i =(y i1 /x i, y i2 /x i,. ) r i = 1 N n r i i=1

16 Hierarchical Clustering Complete linkage Simple linkage Average linkage 2 5

17 Hierarchical Clustering Using expression ratio of 20% to 5% O 2 Expression at 20% compared to 5% O 2 is: Down regulated Not altered Up regulated Exp 1 Day

18 Partitional Clustering Self-organizing maps are employed A expression vector has three elements Only two elements y Gene Cluster x

19 Self-Organizing Maps (SOM) Iterative training y Gene Cluster x

20 Cluster Determination using SOM f Iterative approach ( h) = f ( h) + ( x f ( h)) 1 i ij j i τ i+ Positions of cluster h at two consecutive steps Learning rate Expression of gene j h =1,, M (number of clusters); j=1,,n (number of genes) Learning rate τ ij = α i ( ) (, ) d m i d h j,

21 Other Clustering Methods K-means KNN (K-nearest neighbors) Principle Component Analysis Neural Network Fuzzy clustering

22 Practical H. Yang

23 Two Sets of Microarray Data 1. Five T-cell culture samples on 5 separate microarrays Nylon membrane array with radioactive labeling Data already corrected and normalized 3000 spots on chip with 1250 genes in duplicate spots and 11 housekeeping genes plus >400 negative controls 2. One microarray with 2 samples from C. acetobutylicum fermentation cdna glass microarray with labeling dyes Cy3 & Cy5 Raw data without subtracting background and nonspecific binding 4000 spots on chip with 1200 genes in >triplicate spots and 120 spots as negative controls

24 1 st Microarray Data Set Given normalized ratios with 631 genes left TO perform Hierarchical clustering SOM clustering with 6 clusters

25 Hierachical Clustering with 1 st Microarray Data Set Cluster (Brown s Lab at Stanford University) Website Cluster TreeView Load File Hierarchical Clustering Average Linkage Clustering Load File

26 SOM Clustering with 1 st Microarray Data Set Cluster2 (Whitehead Institute, Center for Genome Research ) Website Cluster2 File Open Data analysis Find Classes SOM rows: 3 SOM cols: 2 Run Data View View Clusters Compute View

27 2 nd Microarray Data Set Given: raw data To perform Correction of measured intensities by subtracting background and non-specific binding Prefiltering Normalization (global log mean) Identification of differentially expressed genes (with 2.5 fold change) Plot original data and normalized data in A-M plot

28 Processing of 2 nd Microarray Data Set Correction of gene spot intensities and negative control intensities by subtracting corresponding background (non-specific binding ) intensities 1. Corrected* gene spot intensity = gene spot intensity background intensity 2. Corrected negative control intensity = negative control intensity background intensity * Needs to be further corrected by removal of nonspecific binding

29 Processing of 2 nd Microarray Data Set Prefiltering 1. Calculate mean and standard deviation (Std) of 116 negative control intensities 2. Further corrected gene spot intensity = Corrected gene spot intensity corrected negative control mean intensity 3. Final corrected gene intensity x x for x 2 Std x = 2 Std for x<2 Std

30 Processing of 2 nd Microarray Data Set Normalization 1. Using global log mean and calculate the mean intensities of Cy3 and Cy5 dyes 2. Using the mean intensity ratio (Cy3/Cy5) to correct the Cy5 intensities

31 Processing of 2 nd Microarray Data Set Identification of differentially expressed genes 1. Logarithmic intensity ratio (Cy5/Cy3) 2. Identify genes with 2.5 fold change 3. Identify up-regulated genes 4. Identify down-regulated genes

32 Processing of 2 nd Microarray Data Set Plot data in log(x)-log(y) and M-A diagrams 1. x=cy3 & y=cy5 2. plot log(x) vs log(y) 3. plot M=log(y/x) vs A=log(xy)/2