BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis

Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology of the Cell, 4 th Edition

What is a microarray

A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Allow mrnas from a sample to hybridize Measure number of hybridizations per spot

How to make a microarray Method 1: Printed Slides (Stanford) Use PCR to amplify a 1Kb portion of each gene Apply each sample on glass slide Method 2: DNA Chips (Affymetrix) Grow oligonucleotides (25bp) on glass Several words per gene (choose unique words) If we know the gene sequences, Can sample all genes in one experiment!

Sample Data

Visualization Tools

Goal of Microarray Experiments Measure level of gene expression across many different conditions: Expression Matrix M: {genes} {conditions}: M ij = gene i in condition j Deduce gene function Genes with similar function are expressed under similar conditions Deduce gene regulatory networks parts and connections- system level description of biology

Analysis of Microarray Data Clustering (unsupervised learning) Use the primary data to group together measurements, with no information from other sources. Classification (supervised learning) Use known groups of interest (from other sources) to learn the features associated with these groups in the primary data, and create rules for associating the data with the groups of interest.

Analysis of Microarray Data Clustering Idea: Groups of genes that share similar function have similar expression patterns Hierarchical clustering k-means / fuzzy k-means Bayesian approaches Principal Component Analysis

Analysis of Microarray Data Classification Idea: A cell can be in one of several states (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to determine which state a cell is in? Support Vector Machines Decision Trees Neural Networks K-Nearest Neighbors

Why cluster? Place genes with similar expression profiles into clusters. What is the gene s function? Place experiments / samples with similar expression profiles into clusters. What is the expression profile of a particular disease or phenotype?

Eisen et al. (1998) PNAS 95:14863. Cholesterol biosynthesis Cell cycle Immediate-early response Signaling and angiogenesis Wound healing and tissue remodeling

Hierarchical clustering Technically, Eisen uses average-link agglomerative hierarchical clustering. Similarity between genes is measured using a modified Pearson correlation coefficient. Simple implementation is easy; fast implementation is hard.

Expression matrix experiments genes A B C D

Pearson correlation ( ) = n Y Y n X X n Y X X Y Y X r i i i i i i i i i i i i i i i 2 2 2 2,

Correlation matrix experiments genes genes A B C D genes

Find maximal correlation experiments genes genes A B C D genes 0.756

Merge and calculate correlations experiments genes genes A B CD genes A B C D As a side effect, the algorithm produces a dendrogram.

Hierarchical clustering Given: m n-element expression vectors Compute an m by m similarity matrix Iterate Find the maximal off-diagonal value Join the corresponding genes Average together their profiles The order in which the genes are joined determines the topology of a tree.

Similarity measures Pearson correlation Euclidean distance Others?

Clustering variants Agglomerative: start with all examples separate, then merge. Divisive: start with one big cluster, then split.

Linkage variants The similarity between two clusters is Single-link: The maximum of the pairwise similarities of the elements. Complete link: Same, but use minimum. Average link: The average of the pairwise similarities. Centroid link: The similarity of the centroids of the clusters.

Post hoc analysis Select clusters. Select ordering of genes for visualization. Determine cluster labels. Determine significance of clusters.

Eisen et al. (1998) PNAS 95:14863. Cholesterol biosynthesis Cell cycle Immediate-early response Signaling and angiogenesis Wound healing and tissue remodeling

K-Means Clustering Algorithm Randomly initialize k cluster means Iterate: Assign each genes to the nearest cluster mean Recompute cluster means Stop when clustering converges

K-Means Algorithm Randomly Initialize Cluster Means

K-Means Algorithm Assign data points to nearest clusters

K-Means Algorithm Recalculate Cluster Means

K-Means Algorithm Repeat

K-Means Algorithm Repeat until convergence

3 2 1 4 5

Randomly choose gene 3 and 5 as initial centers Assign genes to clusters: r 13 < r 15 => gene 1 in cluster A New center: ([22 21] + [19 20] + [18 22])/3 = [19.7 21] ([1 3] + [4 2])/2 = [2.5 2.5] Repeat Cluster B 3 2 1 Cluster A 4 5

Yeast Cell Cycle date clustered by K-means (k=10)

Yeast Cell Cycle date clustered by hierarchical clustering

K-Means Clustering Algorithm Really fast How to select k? random or prior knowledge How to choose initial cluster centers? Randomly pick a gene then find the most dissimilar ones as other centers

Readings: 1. Eisen MB, Spellman PT, Brown PO, Botstein D Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8. 2. Kim TH, Ren B. Genome-Wide Analysis of Protein-DNA Interactions. Annu Rev Genomics Hum Genet. 2006;7:81-102.