10.1 The Central Dogma of Biology and gene expression

Size: px
Start display at page:

Download "10.1 The Central Dogma of Biology and gene expression"

Transcription

1 126 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Microarrays (script by K. Nieselt) There are many articles and books on this topic. These lectures are based on Kay Nieselt s course on Microarray Bioinformatics. The following books are recommended reading: T. Speed, Statistical Analysis of Gene Expression Microarray Data, Chapman & Hall, D. Stekel, Microarray Bioinformatics, Cambridge University Press, Jones and Pevzner, An Introduction to Bioinformatics Algorithms, Chapter The Central Dogma of Biology and gene expression The expression of genetic information in a DNA molecule takes place in two steps: 1. transcription: DNA mrna 2. translation: mrna protein Gene expression ist a highly complex and precisely regulated process that allows the cell to dynamically react to changing environment as well as to its changing needs. This mechanism acts both as an on/off switch to control which genes are expressed in a cell and as a volume control that increases or decreases the level of expression of particular genes as necessary What is a microarray? This is a microarray: and this Microarrays are devices that allow one to measure the expression of many thousand of genes in parallel. They usually consist of a microscopic slide onto which DNA molecules have been chemically bonded. From a biological sample mrna is extracted and labelled. It will hybridize to the DNA of the array via Watson-Crick duplex formation. Microarray technology is used to understand fundamental aspects of growth and development as well as potential genetic causes for diseases. A microarray experiment is a typical example for a high-throughput experiment. Basically it consists of three steps:

2 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Material production: array design and array production 2. Data generation: preparation of tissue, mrna isolation, cdna labeling, hybridisation, scanning 3. Information retrieval: image analysis, data normalisation, advanced analyses. The goal of a microarray experiment is to compare expression of two more more cell types. Examples are: Analysis of tissue-specific gene expression Comparison of gene expression in healthy and tumor tissue Influence of environmental changes on expression Dependence of gene expression on cell cycle state In addition, applications range from sequencing to so-called ChIP-on-Chip procedures. Definition A microarray is a tool for analyzing gene expression that consists of a small membrane or glass slide containing samples of many genes arranged in a regular pattern. The following synonyms for microarrays are also used: chip, biochip, DNA-array, gene array. Definition The probe is the nucleic acid molecule on the chip that is known. The target is the free nucleic acid molecule in the solution that shall be identified Types of microarrays One distinguishes microarrays either by the type of probes present on the chip: cdna oligos or by the type of production of the chips: spotting in-situ Production of spotted microarrays Production of spotted arrays: probes are attached to the slide in three main steps: 1. Generation of DNA probes (cdna or oligos) 2. Printing of probes onto glass slide 3. Fixation of probes 1 According to the convention of MIAME (Minimal Information About a Microarray Experiment) the DNA on the array is called reporter and the DNA in the solution is the hybridisation extract.

3 128 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Production of in situ synthesis microarrays For these microarrays, whose most prominent representative is the GeneChip by Affymetrix, the DNA of a gene (or EST) is not put onto the array, but oligo nucleotides are directly synthesized on the chip. For this a photolithographical process is used, that is very similar to the usual semiconductor chip production. Three different technologies are currently in use: 1. photo deprotection with masks: Affymetrix 2. photo deprotection without masks: Nimblegen 3. chemical deprotection: Agilent Production of an Affymetrix array: Light Mask photo-chemically removable group Substrate Glass Mask many repetitions 10.4 DNA-microarray experiments Independent of the type of chip, each DNA-microarray experiment consists of the following steps: 1. Preparation of the array 2. Extraction of tissues

4 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Isolation of mrna from the tissue(s) 4. Generation of cdna/crna from mrna 5. Generation of the hybridisation solution that contains the fluorescently labeled cdna/crna (each target uses a different label) 6. Incubation of the hybridisation solution with the array 7. Scanning 8. Image analysis 9. Advanced analysis of the data Scanning Laser technology is used to detect the bound cdna/crna as follows: Exposed to laser excitation, the molecules emit light photons. The more target DNA is bound the higher the fluorescence signal. If a gene is highly expressed, many RNA molecules will stick to the probe, and thus the probe location will shine brightly when the laser hits it. If a gene is expressed at a lower level, less RNA will stick to the probe, and by comparison, that probe location will be much dimmer when it is hit with the laser One-color versus dual-color Spotted arrays allow the conduction of comparative microarray experiments, the expression of two targets is measured simultaneously, while in situ produced arrays yield absolute experiments. In the case of spotted arrays one also speaks of dual channel or dual color experiments, and in the case of in situ arrays one speaks of one channel or one color experiments From raw to primary data Generally three steps are necessary for the image analysis: 1. Adressing: Assign location of spot center Based on the gridding process the coordinates of each spot are assigned. The algorithms for this steps need to be robust and reproducible. 2. Segmentation: Classification of a pixel into foreground (signal) or background pixel (noise) 3. Information extraction: Now numerical values are computed For each spot on the array (and label if more than one is used) compute: (a) mean signal intensity, (b) mean background intensity, (c) quality value.

5 130 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, 2009 Each of the two labels has a typical excitation wave length. These should be of course different from the emission wave lengths. For each label (channel) a scan is produced. Then the measured intensities of the two channels are overlaid (compared) for each spot and a pseudocolored image is produced. Usually red/ green/ yellow/ black is used. This color choice symbolizes the choice of the labels, ie. Cy3 (green) and Cy5 (red). If in a spot both channels have the same intensity, then the spot is colored yellow. If the intensity in the green channel is higher, then green is chosen, otherwise red. Black spots symbolize missing intensity. Example: Expression values of two-channel arrays Though one assumes that only light is detected from cdna that hybridize with their complementary probes, also light from other sources is detected. These could be molecules that are bound to a wrong spot or unspecifically to glass, or from light reflection of dust etc. The signal from these sources are called background signals of a scan. All in all, the raw product of the scan are the pixel intensities. Let F X,j denote the set of foreground pixels in channel X (X = R for red, X = G for green) of the jth probe (spot, gene). Similarly, let B X,j denote the set of background pixels in channel X (X = R

6 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, for red, X = G for green) of the jth probe. Let r i and g i, respectively, be the intensity of pixel i in the red and green channel, respectively. Furthermore let R j f and Gj f, respectively, be the mean foreground signal of the jth spot in the red and green channel, respectively. Equivalently we set R j b and Gj b respectively, be the mean background signal of the jth spot in the red and green channel, respectively. These are computed as R j f = ( r i )/ F R,j i F R,j G j f = ( g i )/ F G,j i F G,j and R j b = ( r i )/ B R,j i B R,j G j b = ( g i )/ B G,j i B G,j Then for the final expression value of a spot the background signals are subtracted from the foreground signals: R j = R j f Rj b G j = G j f Gj b Here care must be taken, if R j b > Rj f and/or Gj b > Gj f. In this case, most image analysis programs return a flagged spot. Finally, both expression values are combined into a ratio or log ratio (commonly base 2): e(j) = log 2 ( Rj G j ) Thus e(j) is the log ratio expression value of the jth spot Images of in situ-microarrays The resulting picture of an in-situ array scan differs substantially from those of spotted arrays. Here the spots are not circular but quadratic, which makes the image analysis much easier. Example:

7 132 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Expression values of one-channel arrays The expression values for arrays with just one channel are computed similarly to those of the two channel experiments. Here we will define e(j) to be either the absolute expression intensity or the log 2 value of it The expression matrix Now that we have defined an expression value of a gene in a single array experiment, we will turn to assembling all values of several array experiments into a common matrix. Definition The expression matrix of a microarray experiment consisting of p arrays, where each array has n genes is an n p matrix, where the ijth cell contains the expression value of the ith gene on the jth hybridized array. Let us denote an expression profile of the ith gene g i by e(g i ), and the expression value of the ith gene in the jth experiment by e(g ij ). Then we denote the mean expression of g i by e(g i ) = 1 p p e(g ij ). j= Similarity and dissimilarity of expression data In the following we will look at distance measures to compute (dis)similarity of expression profiles. The computed (dis)similarity values will then be input of clustering algorithms.

8 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Using microarrays on a genome-wide level has also the intention to identify groups of genes or samples with similar expression profiles. From the biological point of view the comparison of gene profiles is different from the comparison of sample profiles. From the mathematical point of view it is essentially the same Metrics and semi-metrics for expression data Assume that we have an expression matrix with n genes and p arrays. Similarity between two profiles is often measured in terms of the distance of two vectors in a highdimensional (either n or p) space. The most often used distance is the Euclidean distance: v u p ux d(x, y) = t (xi yi )2 i=1 A semi-metric measure is the Pearson Correlation coefficient: Pp (xi x )(y y ) ppp ρ(x, y) = ppp i=1 2 2 i=1 (xi x ) i=1 (yi y ) It is ρ(x, y) [ 1, 1] and ρ(x, y) = 1 implies perfect similarity and ρ(x, y) = 0 randomness. Examples of Pearson correlation coefficients: The Pearson correlation coefficient is a similarity measure, thus one needs to transform it into a distance parameter: dρ (x, y) = 1 ρ(x, y) Note that the Euclidean distance is not scale invariant: two profiles with the same shape (ie. large Pearson correlation similarity score) but different magnitude will have a large Euclidean distance parameter and thus appear to be dissimilar. In addition Euclidean distance can not detect negative correlations. On the other hand if the magnitude of change is of importance then it is the appropriate distance measure.

9 134 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Clustering - Introduction In gene expression analysis to analyse expression profiles often clustering methods are applied. Generally we distinguish Unsupervised Clustering Supervised Clustering = Classification While a classification analysis assigns objects to predefined groups / classes, cluster analysis computes groups of objects (which are here either genes or samples). Unsupervised Clustering helps to identify genes that might be involved in the same functional process in the cell helps to identify and annotate unknown genes helps for example to identify subtypes of cancer We distinguish two general types of cluster methods: partitioning methods hierarchical methods Cluster analysis needs two ingredients : Distance measure Cluster algorithm: groups objects based on their distance with the goal to achieve small distances within the clusters and large between clusters k-means clustering The goal of the k-means clustering is to find a partition C of the set X in k (pre-chosen) clusters, such that a given measure for homogeneity is maximised. High homogeneity implies that elements in the same cluster are very similar. k-means clustering belongs to the so-called partition clustering methods: The input set of elements is partitioned into disjoint clusters, such that each element belongs to exactly one cluster. Algorithm (k-means) 1. Choose k 2. Choose randomly k centers µ 1,..., µ k that are the mean values for the clusters 3. For each gene compute the nearest cluster center: C(i) = argmin 1 l k d(x i, µ l ) 2 4. Compute new mean for each cluster: 5. Repeat steps 3-4 until algorithm converges µ i = 1 C i x j C i x j

10 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Example: The k-means method minimizes the total intravariance sum: k l=1 C(i)=l d(x i, µ l ) 2 i.e. the sum of the quadratic distances between each gene expression profile to its respective cluster center. An important parameter for the method is the choice of k, the number of clusters. A possibility to optimize this choice is to run algorithm several times with different ks, compute each time the total intravariance sum and plot the result An application of k-means clustering A k-means clustering conducted with the data of the so-called Spellman microarray experiment (a yeast cell cycle experiment) 2. In that experiment a yeast whole-genome expression experiment was conducted in order to prove the hypothesis that genes might be regulated in a periodic manner coincident with the cell cycle. For the clustering here only the cell cycle genes (about 800) of all approx yeast genes were taken. Different k-means clusterings computed using Mayday 3 for k = 4, 6, 8: 2 Spellman et al., Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol Biol Cell 9, Microarray analysis software developed in the group of Dr. Nieselt

11 136 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Hierarchical clustering The result of hierarchical clustering are nested clusters which can be visualized by means of a tree or dendrogram. This is similar to tree building from distances, e.g. UPGMA, and so we will skip this topic Visualisation of gene expression data An important aspect of microarray data analysis is visualization. Visualization tools are primarily used to gain biologically important insights into the data. There are a number of approaches to the problem of visualizing microarray data, ranging from viewing the raw image data, viewing profiles of genes across experiments, to using one of the many scatter plot variants. In this section a short overview of common visualisation methods is given Box plot The box plot visualizes a one-dimensional distribution. It is based on 5 numbers of a distribution: minimum, first quartile, median, third quartile and maximum. A box plot is drawn as follows: Box plots can be used to compare distributions: normal skewed uniform The box plot is especially useful for the comparison of replicated array experiments.

12 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Scatterplot In a scatterplot one distribution is plotted against another one. Let log(x) and log(y ) denote the log-values of distribution X and Y. Then one plots log(y ) against log(x). A typical application for dual-channel microarray data is to plot intensity values (log 2 ) of the green channel against those of the red channel MA-Plot In an MA-plot, rather than plotting Y against X and/or log(y ) against log(x), one plots M = log(y/x) = log(y ) log(x) against A = (log(x) + log(y ))/2 For the two channels we thus get M = log 2 (R/G) = log 2 R log 2 G

13 138 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, 2009 is plotted against A = (log 2 R + log 2 G)/2 The MA-plot is just the original scatter plot turned 45 clockwise with subsequent scaling. especially useful for the detection of intensity-dependent effects in the log-ratio. Note that the A axis generally covers the range from 0 to 16, while the M (y-) axis is centered around 0 (zero, for equal ratio). It is The above example shows the differences in incorporation of the label: here the molecules in the green channel have higher intensities than their respectives ones in the red channel Heatmap One of the most popular tools for microarray data visualization are heatmaps (Eisen, 1998). Heatmaps: Are also known as intensity or matrix plot Represent data in form of table: typically genes are in the rows, experiments in the columns Each cell of matrix is filled with a color representing the logarithmic expression ratio Use 3 colors, typically green black red Example:

14 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, Profile Plots Profile plots show the expression profile (along the experiments): Visualisation of clusters Both profile plots as well as heatmaps are especially useful for the visualisation after clustering. Plot either all profiles of each cluster or only the profile of the cluster representative.

15 140 Grundlagen der Bioinformatik, SS 09, D. Huson (this part by K. Nieselt) July 6, 2009 Example: here is the result of a hierarchical clustering on the cell cycle experiment shown together with the associated heatmap Summary Microarrays are used to measure expression levels in cells. Clustering is used to detect common patterns of expression of genes. Visualization of expression data is an important tool. A main topic that we did not cover is how to normal signals. New sequencing technologies are poised to replace microarrays in many applications.