Analysis of microarray data

Size: px
Start display at page:

Download "Analysis of microarray data"

Transcription

1 BNF078 Fall 2006 Analysis of microarray data Markus Ringnér Computational Biology and Biological Physics Department of Theoretical Physics Lund University

2 Contents Preface 3 1 Microarray technology Probes and slides Hybridization and scanning Analysis Microarray data Data representation Normalization Quality filtering Distances Dimensional reduction and clustering Principal component analysis K -means clustering Hierarchical clustering Self-organizing maps Clustering - so what? Supervised learning Discriminatory gene statistics Training and validation Nearest neighbor Machine learning approaches Integrating expression data with other data Different types of array data Functional annotation data Genome sequence data

3 Preface The course BNF078 Expression analysis consists of two parts: Analysis of microarray data and Proteomics. The course is dominated by compulsory exercises. These lectures cover the first part of the course and serve as an elementary introduction to microarray data analysis and the computer exercises in this part. There are four such computer exercises: 1. Cluster analysis (0.5 day) 2. Setting up and analyzing an experiment in BASE (1.5 days) 3. Supervised classification (2 days) 4. Finding regulatory elements in microarray data (2 days) By the end of this first part of the course, students should have an understanding for: DNA microarray technology and what microarrays can be used for. Normalization and quality control of microarray data. Unsupervised and supervised analysis methods. Analyzing microarray data together with other data. The work-flow of microarray experimentation and using databases to store microarray data. How to implement an analysis method as a computer program. The second part of the course (proteomics) will be given by Peter James, will be covered by separate lecture notes, and contains additional exercises. To pass the course, all exercises in both parts and a final exam have to be passed according to the following rules: If reports are required from an exercise they should be handed in within 2 weeks after the exercise. If not all exercises are handed in within two weeks the exam will be marked fail. All exercises have to be passed before a re-exam can be taken. Re-exams will typically be oral exams. If not all parts of the course are completed one year after the course was given all results are void and the course has to be completely retaken. 3

4 1 Microarray technology A microarray is an experimental technique capable of generating genome-wide data. Microarrays are extensively used in research (Fig. 1). The first and most common application of microarrays is to measure genome-wide gene expression patterns: for each gene the mrna abundance is measured (Fig. 2). The focus here will be on gene expression, although we will mention other applications. Number of publications microarrays proteomics Year Figure 1: Explosion of microarray and proteomics papers. The number of papers in Pubmed published each year that matched the search terms microarray* and proteomic* are shown separately. 1.1 Probes and slides To measure genome-wide mrna abundances one needs a probe for each gene. To design such probes the base pairing property of nucleic acids, such as DNA, can be utilized. The pairing is specific and in DNA adenine (A) pairs with thymine (T) and guanine (G) with cytosine (C). Base pairing leads to the formation of a DNA double helix from two complementary single strands. Hence a probe can be any nucleic acid that is complementary to a specific gene. Such probes will join to gene sequences forming a double stranded molecule. This process of joining two complementary sequences of nucleic acid is called hybridization. Probes can be generated in many ways, including by making complementary DNA (cdna) clones of genes or expressed sequence tags, by synthesizing DNA oligomers (typically bases), or by making clones of genomic DNA sequences. A microarray slide is a solid surface onto which very many different probes are immobilized in spatially separated spots (Fig. 3). The slides can for example be ordinary microscope 4

5 Protein expression RNA expression Protein mrna DNA DNA copy number Cell Figure 2: Gene expression. Genes are DNA sequences that are transcribed into RNA. A gene can occur in multiple copies in a cell (DNA copy number). If a gene is protein coding, it is transcribed into messenger RNA (mrna), which in turn is translated into protein. glass slides. There are two ways to get probes onto slides. Probes can either be spotted onto slides, or be synthesized in situ on slides. The first approach can be accommodated using array robots in a typical biology lab, while the latter can not. Instead such slides are available from a growing number of companies that use different methodologies to generate them. 1.2 Hybridization and scanning Armed with an array of probes attached to a microarray slide, the next step is to use the slide to measure the mrna abundances in a sample. This step involves two aspects. First, the probes should be able to bind to their respective mrna targets in the sample; the probes should capture the targets. Second, a system is needed for detecting how much mrna was bound to each probe. This entire procedure is outlined in Fig. 4. RNA is extracted from a sample. The RNA is reversely transcribed into cdna. In the reverse transcription, nucleotides that have fluorescent molecules attached, or have modifications that enable later attachment of fluorescent molecules, are used. This cdna is referred to as labeled extract or labeled target. The labeled target is a mixture of DNA molecules with different sequences that should have two properties. First, all sequences are labeled with fluorochromes to enable later detection. Second, the abundance of each sequence should reflect the abundance of the corresponding mrna in the sample. Next, the labeled target is eluted over the slide in a procedure that enables probes to bind to their complementary sequences in the target. This procedure is called hybridization: the target is hybridized to the probes. Before detecting the amount bound to each probe, the slide is washed to remove all unbound material. The amount of target bound to each probe on the microarray slide is measured by scanning 5

6 T C G T G C G G A T A T A A Figure 3: Microarray slide. A microarray is a solid surface onto which probes are immobilized and spatially separated in spots (left). In this image there is µm between spot boundaries and 130 µm between spot centers allowing up to 55,000 spots on an ordinary microscope glass slide. Each spot consists of very many probes, all ideally being identical DNA sequences (right). the slide with a laser scanner. The scanner has a laser that emits light at a wavelength absorbed by the fluorescent molecules used to label the target. This absorption triggers the fluorescent molecules to emit light at other wavelengths. The scanner has a photo-multiplier tube to detect this fluorescent light. The slide is scanned by moving the laser across the slide and at each position the intensity of the detected light depends on the amount of labeled target bound to the position. Scanning results in an image of the slide showing spots with varying intensity. The intensity of each spot reflects the expression level in the sample of the gene probed by the sequence in the spot. Each probe can be viewed as a device measuring the gene expression level for a specific gene. Measuring devices have to be calibrated to provide absolute measurements. To avoid having to calibrate all spots on an array one often resorts to making relative measurements by using a reference sample that is labeled with a different fluorochrome and co-hybridized together with the sample. In this way the expression level of each gene in the sample can be measured relative to its level in the reference. This procedure is referred to as two-color microarrays. Commonly, the fluorescent molecules Cy3 and Cy5 are used with two-color microarrays. The microarray slides are scanned at two different wavelengths, one for each fluorochrome, and the two scanned (gray-scale) images are merged into one image pseudo-colored so that typically red and green are used for the two different intensities. 6

7 Sample1 / Sample2 Scan Sample Extract Labeled extract Image 1 Image 2 Ratio image Sample1 / Sample2 Hybridization Scan Wash Figure 4: Using microarrays. In a sample, the expression levels are measured for all genes probed by the arrays. Each expression level is typically given as the expression ratio between two samples: often a query sample (sample 1) and a reference (or control) sample (sample 2). The image of the scanned microarray slide is analyzed by methods that identify the spots, match them to information about the probes printed on the array, and for each spot calculate a lot of statistics, including average red and green intensities. This procedure is referred to as image analysis. The major objective of the image analysis is to quantify the expression level in the sample for each gene probed by the microarray. The expression level is typically reported as an expression ratio. The expression ratio is the expression level for a particular gene in the sample divided by its expression level in the reference sample. 1.3 Analysis A typical microarray data set contains many experiments measuring patterns of gene expression across, for example, different phenotypes, cellular responses to various treatments, or different conditions. Such data sets can be of two types: Static Each experiment (hybridization to one array slide) corresponds to e.g. a tissue sample. Clustering and supervised learning techniques are used to classify data and find 7

8 relations between gene expressions and categories. Dynamic Each experiment corresponds to a time-point for e.g. a cell line. Clustering techniques are used to find common behavior among genes as function of time. Can probe causal structures. Regardless of measurement techniques one typically has the steps: 1. Image analysis: extract gene expression levels from scanned microarrays. 2. Pre-processing: clean up the data remove noise. 3. Map out underlying structures in the data. 4. Link results to databases, regulatory motifs, ontologies, pathways, functionality, etc. In what follows we will give an overview of microarray data analysis. We will focus on higher level analysis and not discuss image analysis. We will focus on spotted DNA microarrays, but the procedures are generic. For a review that provides an overview of the cdna array technology see D. Duggan et al., Nat Genet vol. 21, suppl (1999) (freely available at: 10.html). For a review of analysis methods see J. Quackenbush, Nat Genet Reviews (2001). 8

9 2 Microarray data 2.1 Data representation We represent each experiment by a high-dimensional column vector x,k and a set of experiments by the following matrix: x,1 x,2.. x,k.. x,m = =.. =.. = x 1, = x 1,1 x 1,2.. x 1,k.. x 1,M x 2, = x 2,1 x 2,2.. x 2,k.. x 2,M x i, = x i,1 x i,2.. x i,k.. x i,m x N, = x N,1 x N,2.. x N,k.. x N,M where x i,k is the logarithm of the expression ratio for gene i and sample k. The expression level for gene i across the experiments is represented by the high-dimensional row vector x i,. The advantage of using the logarithm of the expression ratio is that up and down-regulation are treated symmetrically. Genes that are up-regulated by a factor of 2 have an expression ratio of 2, whereas those down-regulated by the same factor have an expression ratio of 0.5. Down-regulated genes have expression ratios between 0 and 1, whereas up-regulated genes have between 1 and infinity. Using logarithms, a gene up-regulated by a factor of 2 has a log 2 (ratio) of 1, whereas a gene down-regulated by a factor of 2 has a log 2 (ratio) of -1, and an unchanged gene (with a ratio of 1) has a log 2 (ratio) of 0. Typically M is and N is 10,000-50,000 and importantly M << N. Indices i and j will be used for genes and k and l for experiments. In this picture each of the M experiments is a data point in N-dimensional gene space. One could also look at it the other way around; each gene is a point in M-dimensional sample space. The latter will be the case when studying time course experiments. (Note: Quackenbush defines genes in expression space and experiments in experiment space ). 2.2 Normalization The matrix above only contains ratios of red versus green intensities and not the intensities themselves. However, before the intensities can be used to calculate ratios they have to be normalized to adjust for differences in efficiencies of fluorescent dyes and quantities of initial RNA for the two samples used in the hybridization. There are many widely used methods for normalization. It is important to remember that each method is based on assumptions that may not be valid for any given data set. Two common methods are: 9

10 Total intensity normalization. Here one assumes that the total amount of initial mrna should be the same for the sample of interest and the reference sample. Furthermore, one assumes that most of the genes are unchanged in expression in the sample as compared to the reference. Hence, the average log ratio is expected to be zero. Under these assumptions, a normalization factor can be calculated and used to re-scale the log ratio for each gene in the array. LOWESS normalization. Here one goes one step further and assumes that a significant fraction of the genes should have log ratios close to zero for any intensity interval. If one plots the log ratio versus the logarithm of the product of the two intensities for all genes in an array, one would then expect the points to follow a straight line with slope zero. The normalization is then carried out by fitting a line to these data using a regression technique called LOWESS (LOcally WEighted Scatterplot Smoothing) regression, and adjusting the intensities by using this fitted line so that the calculated slope becomes zero and the line corresponds to log ratio zero (Fig 5). M A B C A A A Figure 5: LOWESS normalization. Instead of plotting the log of the red intensity (int1) versus the log of the green intensity (int2) for all the spots on an array, one often rotates the entire plot 45 degrees and plots a so-called M-A plot (A). After the rotation the vertical axis will correspond to the log ratio, M = log 2 (ratio), and the horizontal axis to the average of the log intensities, A = (log 10 (int1)+log 10 (int2))/2 = log 10 int1 int2). In LOWESS normalization, one fits a line to this plot (B) and uses the fitted line to normalize the data such that the line has slope zero and corresponds to a typical log ratio of zero (C). 2.3 Quality filtering In addition to red and green intensities, the typical array image analysis software will provide a lot of statistics for each spot found on the array. These spot statistics can be used to filter out a spot and replace its value in the data representation matrix with a missing value, or to calculate a quality weight associated with each expression value. Quality weights can be used in the further analysis, such that data with high quality will have a larger impact on the results than data of lower quality. One can for example compare the intensity in a spot (foreground) with the intensity sur- 10

11 rounding the spot (background) in terms of a signal-to-noise ratio (SNR) SNR(c) = I fg(c) I bg (c) σ bg (c) (1) where c denotes red or green, I average intensity, σ standard deviation of intensity, fg foreground, and bg background. An example of a simple quality filter is to replace all spots for which SNR(red) or SNR(green) is smaller than three with a missing value, and filter out all genes for which there are more than 20% missing values across the experiments. In some analysis methods it may be difficult to handle missing values, but requiring each gene to have no missing values will lead to the elimination of very many genes. One way to circumvent this problem is to use impute methods that replace missing values with some calculated value. Simple examples of impute methods include replacing each missing value x i,k with 0, or with the average value of x i, across experiments. 2.4 Distances The matrix we use to represent the data can be converted into a distance matrix. This conversion is useful because one can address if two experiments are similar, or if two genes behave concordantly in expression across experiments. Also, as we will see, many analysis methods use distance matrices as their starting point. The idea is to convert the data to one number for each pair of experiments (or genes) by using a distance between pairs. We need to define distances (introduce a metric). In gene space, which is common in static applications (how close are two experiments?), Euclidean distances appear natural, but are by no means the only choice. The Euclidean distance between two experiments k and l is: d k,l = (x 1,k x 1,l ) (x i,k x i,l ) (x N,k x N,l ) 2 (2) Correspondingly in sample space the Euclidean distance between two genes i and j is: d i,j = (x i,1 x j,1 ) (x i,k x j,k ) (x i,m x j,m ) 2 (3) Euclidean distances might sometimes be misleading. Consider the genes i and j in a time course experiment with the behavior shown in Fig 6. The Euclidean distance d ij is However, we may think in the biological context that the two genes should be close because they behave in a related way and thus may be involved in the same pathway. If so, the Euclidean distance is misleadingly large. A different distance metric for which the two genes would be close is the correlation-based distance metric. To calculate this distance we must first compute the Pearson correlation C i,j between genes i and j. The average of genes i and j are denoted x i and x j respectively. One has 11

12 Gene expression level (log ratio) Time Gene i Gene j Figure 6: Time courses for two anti-correlated genes. For the two genes the gene expression levels are shown for 5 experiments. Each experiment corresponds to a different time. The time can, for example, be time since since cells were synchronized to start the cell cycle. C i,j = = Mk=1 x i,k x j,k ( M k=1 x Mk=1 i,k x j,k )/M ( M k=1 x 2 i,k ( M k=1 x i,k ) 2 /M)( M k=1 x 2 j,k ( M k=1 x j,k ) 2 /M) x i x j x i x j ( x 2 i x i 2 )( x 2 j x j 2 ) (4) where M is the number of experiments. C i,j is a number between 1 and 1 that in the case of complete correlation between the two genes is unity, for complete uncorrelation is zero, and for complete anti-correlation is 1. To treat correlation and anti-correlation identically, we disregard the sign of C i,j and consider the absolute value C i,j. We define the correlation-based distance between the two curves as d i,j = 1 C i,j (5) In our example above, one gets with C i,j = 1 (verify this) and hence d i,j = 0: the two genes are considered close as desired. In summary, the two genes would be considered far apart using the Euclidean distance and close using the correlation-based distance. There is no given distance measure to use and one has to think about distances for each application. See for a numerical example of a calculation of correlation. 12

13 3 Dimensional reduction and clustering Armed with distance measures we proceed to dimensional reduction and clustering algorithms. Here the agenda is to turn the large number of numbers we have in gene expression matrices or distance matrices into something more summarizable: either to visualize data points (in gene or experiment space) or to group them together into clusters. The methods described in this section are examples of unsupervised methods: only the microarray data is used in the analysis. 3.1 Principal component analysis Suppose we want to display each experiment as a point in gene space. We can imagine each of the M experiments as a point in the N dimensional gene space. However, if we want to get a visual overview, we have to reduce the number of dimensions from N down to say two or three. Principal component analysis (PCA) is one method developed for this task. It is likely that the M experiments will not be spread out hyper-spherically in the N dimensional space, rather there will be one direction along which the experiments are more spread out. This is the axis of the first principal component, and it is the direction that captures the maximum amount of variation in the data. It is likely that this axis will not coincide with one of the gene axes; it will be a sum of gene axes. Next one can look for the direction orthogonal to the first principal component that captures the maximum of the remaining variation in the data; this is the second principal component, and so on. In PCA the coordinate system of gene space is rotated such that the variance is maximized along a few axis. For a 2-dimensional example see Fig. 7. The experiments are of two types: stars and circles. Star experiments have high expression of gene 1 and gene 2, whereas circle experiments have low expression of gene 1 and gene 2. In Fig. 7 the experiments essentially lie along a line and the two-dimensional data can be reduced to one dimension (the first principal component) without loosing a lot of variance: each experiment is described by one value (a combination of the expression levels for genes 1 and 2: G1) instead of two values (the expression levels for genes 1 and 2). We also note that the information needed to discriminate between the two types of samples (stars and circles) is kept in the first principal component. Star experiments have high values of G1 and circle experiments have low values of G1. For the general case when the number of genes is much larger than two, one can, for example, keep the two first principal components and the dimension of the data has been reduced from N to 2. One can now plot all the experiments in these two dimensions. Whether such a plot will be informative or not depends on if the variation in the data is dominated by an interpretable effect. In Fig. 8, PCA analysis of microarray experiments for 63 samples of four kinds of small round blue tumors of childhood (SRBCT) is shown. 13

14 Gene 2 (x 2, ) G2=(x 2, x 1, ) / 2 G1=(x 1, +x 2, )/ 2 Gene 1 (x 1, ) Figure 7: Rotation of a 2-dimensional gene space to maximize the variance along a new axis. There are 10 experiments of two types: circles and stars. The experiments are plotted in gene space based on their expression levels for two genes: gene 1 and gene 2. PCA rotates gene space to find the direction along which the experiments have the largest variance in expression and this is given by the first principal component: G1. The second principal component is denoted by G Component Component Component 2 Figure 8: Projection of 63 SRBCT samples onto the 3 first principal components. The samples belong to four diagnostic categories, NB (circles), RMS (filled circles), BL (pluses) and EWS (diamonds). There is a tendency of separation between the categories. Of note, the first principal component essentially separates tumor samples (on the right) from cell lines (on the left). From Ringnér et al. in A Practical Approach to Microarray Data Analysis (2002). 14

15 3.2 K -means clustering In K-means clustering one needs to preset the number of clusters (K) that experiments or genes should be assigned to. As an example consider Fig. 9 where genes measured in two experiments x i,k (k = 1, 2) are shown. Denote the cluster centers y m, where m=1,...,k. One algorithm for K-means clustering is as follows: 1. Initialize the K y m close to the center of gravity of all x i,. 2. Pick one x i, at random. 3. Move the y m closest to x i, towards x i,. More precisely, move the winning cluster center according to: y m = η d m,i (6) where y m is the change in y m and d m,i is the distance according to some metric between y m and x i, and η is a step size typically between 0 and 1. Large distances imply large moves and small distances imply small moves. 4. Redo 2 and 3 until convergence. That is until all cluster centers make moves smaller than some cutoff. The procedure is schematically illustrated in Fig. 9. In this way every x i, gets assigned to the cluster y m that is closest. One can also have fuzzy cluster assignments, where each x i, in principle belongs to all clusters with different weights computed from the distances: so-called fuzzy K-means. A potential drawback with this method is that one has to know the number of clusters (K) in advance. To some extent this will be remedied in the next subsection, which deals with hierarchical clustering. X i,2 X i,2 X i,2 Y 1 Y 2 Y 3 X i,1 X i,1 X i,1 Figure 9: A K-means clustering example. K = 3 is illustrated for 17 genes with expression levels measured in two experiments. 15

16 3.3 Hierarchical clustering There are many variants of hierarchical clustering. We first describe one that closely relates to K-means clustering. Briefly, one proceeds as follows: 1. Perform K-means clustering with K=2, i.e. divide the data set into two halves. 2. Within each cluster separately redo Redo the above until number of clusters is equal to number the of data points. 4. Based upon sub-cluster assignments reconstruct a tree, a dendrogram, that has each data point as a leaf. The above procedure represents a top-down procedure, often called divisive. Alternatively, one might start at the bottom, form a pair of the data points with the shortest distance between them, merge the pair into a common representative, and iterate until only one cluster representing all the original data points remains. This bottom-up procedure is called agglomerative. In an agglomerative approach there are many ways to replace a pair of experiments with a common representative. In array analysis it is common to use average linkage which means each experiment (or gene) gets a distance to a merged pair that is the average distance to the two experiments (or genes) that make up the pair. It is important to be aware that different choices may lead to different dendrograms. In Fig. 10 a dendrogram is shown, where static SRBCT samples are clustered in gene space. Looking from the top down, we see that first the BL samples are separated from the other samples, next the other samples are split into RMS samples versus NB and EWS samples, and so on until each sample is separate. Looking from the bottom up, we see that the two closest samples, EWS-T13 and Test-21 (an EWS), are merged first, and so on until all samples are merged into one root. We note that with hierarchical clustering the number of clusters is not predefined. Instead the clustering results in a tree that can be turned into a set of clusters by cutting the tree and the number of clusters depends on at which height the tree is cut. 3.4 Self-organizing maps Self-organizing maps (SOM) is yet another approach that starts off from K-means clustering. In this case one defines a topology among the clusters, e.g. upon a 2-dimensional grid. This is done such that adjacent grid points represent feature similarities. See Fig. 11 for a 2-dimensional example. The idea is to define which clusters are neighbors and keep this information throughout the clustering in such a way that neighboring clusters will end-up containing data points that are more similar across clusters than non-neighboring clusters. This concept is implemented such that when updating Eq. (6) in the K-means algorithm 16

17 Figure 10: Hierarchical clustering of SRBCT samples. The four SRBCT types analyzed are Ewing s sarcoma (EWS), rhabdomyosarcoma (RMS), Burkitt s lymphoma (BL), and neuroblastoma (NL). In addition to 83 SRBCT samples there are five additional non- SRBCT samples. From Khan et al. Nature Medicine (2001). Figure 11: Schematic picture of a self-organizing network on a 2 3 grid with 2-dimensional data. Initial geometry of nodes in 3 2 rectangular grid is indicated by solid lines connecting the nodes. Hypothetical trajectories of nodes as they migrate to fit data during successive iterations of SOM algorithm are shown. Data points are represented by black dots, six nodes of SOM by large circles, and trajectories by arrows. From Tamayo et al. PNAS (1999). not only the winner is updated but also its neighbors on the grid [in the 2-dimensional case there are 4 of them (except boundary problems)]. Typically the neighbors of the closest cluster are updated with a smaller step-size (η) than the closest cluster. In this way, the clusters are related to one another, which is not the case for K-means clustering. This method is particularly useful for time course studies. In Fig. 12 a 5 6 SOM is 17

18 Figure 12: A 5 6 SOM of the yeast cell cycle. (a) The 828 genes that passed a variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. From Tamayo et al. PNAS (1999). shown for the yeast cycle. Here the yeast genes are compressed into 30 clusters; the curves represent averages and the error bars the standard deviations. As can be seen, the 30 screens are behavior-wise connected with its neighbors; other than that it is just standard clustering. 18

19 3.5 Clustering - so what? Clustering and PCA provide visual insights into data. More importantly: 1. What biology is learned from cluster exercises on static data [typically in gene space]? For example: Class discovery, for example, disease categories among microarray experiments for samples from different tumors. Investigate which genes are responsible for the classification. 2. What biology is learned from cluster exercises on time course data [typically in experiment space]? For example: Which genes behave similarly across conditions? Do genes that behave similarly have a common upstream motif? If so, one may find regulatory motifs in their regulatory regions. 19

20 4 Supervised learning So far we have discussed analysis approaches that utilize only the gene expression data. Such approaches are called unsupervised methods. However, one often has more information about the experiments than the gene expression data. If so it may be of value to use this information to guide the analysis and this approach is called supervised analysis. For example, in static measurements of tumors one may already know a classification of the tumors in advance (e.g. into the two classes poor versus good survival). Such pre-defined classes can be used together with the gene expression data in supervised learning techniques. The goals here are two-fold. Based on historical data develop a classifier that handles test data (supervised classification). Determine the genes that are responsible for the classification. 4.1 Discriminatory gene statistics The aim is find genes that based on their expression levels discriminate between known classes of experiments. Typically, one define a score that quantifies if a gene is a good discriminator. The score should reflect whether a gene is differentially expressed between classes of experiments or not. Here we will describe an example of such a score: the signalto-noise statistic (Golub score). Based upon averages ( x i ) and standard deviations (σ i ) of a gene s expression levels in classes 1 and 2, compute for each gene i the score w i (see Fig. 13) w i = x i(1) x i (2) (7) σ i (1) + σ i (2) were gene i s average expression level, x i (m), in class m is calculated by summing over all N m experiments in class m, x i (m) = 1 x i,k (8) N m and the standard deviation is, k m σ i (m) = 1 (x i,k x i (m)) 2 (9) N m k m After all w i are calculated one can generate a ranked list of genes by sorting them according to decreasing values of the score. For any choice of classes there will always be a top-ranked gene. How do we know that the top-ranked genes for the classes of interest are significant? How many top-ranked genes, if any, should be included in a list of important discriminatory genes? Here we will describe using random permutation tests to address this question. We start with describing a permutation test to assess the significance of each gene. 20

21 3 Gene Expression (x ) i, 2 1 Class 1 Class 2 σ (1) i x i(1)-x i(2) σ (2) i Experiment (k) Figure 13: Illustration of the Golub score. The expression level for gene i is shown for 20 experiments belonging to two classes. The numerator and denominator in the Golub score are indicated with arrows. A large difference between the average expression in the two classes results in a large numerator and large score. Smaller standard deviations within the classes result in a smaller denominator and a larger score. 1. Permute experiment labels (class 1 and class 2) 2. Calculate a weight for each gene for the random labels. 3. Redo items 1 and 2 many times. 4. For each gene (g), calculate P (g) the probability the gene got a weight equal to or larger than the weight it got for the true labeling. The Golub score for one gene, g, for true and random labelings, respectively, are shown in Fig. 14A. From these one can compute the P -value for the gene: P (g). There are many other statistics that can be used to quantify if a gene is significantly differentially expressed between classes of experiments, for example, the t-test and the Mann-Whitney test. For many statistics it is possible to analytically calculate a P -value without having to resort to permutation tests. However, for such analytical P -values to be valid, typically assumptions regarding the expression levels have to fulfilled. Often they are not fulfilled for microarray data. Moreover, there are many cases in the analysis of genomic data when one has to resort to permutation tests. Because permutations tests are widely used, we have decided to use them here in our discussion of assessing the significance of discriminatory genes. However, even if the statistical significance of each gene is assessed, there are very many genes. We are performing very many P -value calculations and the probability to get one P -value that is smaller than say by random chance may be large. Even if the chance 21

22 A Number of permutations P< Gene score B Number of genes class 1 vs. class 2 random permutation FDR= 1% Gene score Figure 14: Random permutation tests to calculate significance of discriminatory genes. (A) For a single gene the score for the true labeling (vertical line) and the distribution of scores for random labelings are shown. The gene has a Golub score of almost 1.5 for the true labeling, whereas the gene gets a larger score less than once per thousand permutations (P < 0.001). (B) Distributions of scores of all genes for true and random labelings. From the random distribution we expect less than five genes on average for a random labeling to have a score larger than unity. For the classes of interest we find more than 250 genes with a score larger than unity. Therefore, if we select the 250 genes with scores larger unity, the estimated FDR in this gene list is less than 2.5%. of winning on a lottery ticket is small, you are likely to be the winner if you buy most of the tickets. A standard approach to correct the significance when multiple tests are performed is the Bonferroni correction. For the Bonferroni correction, you multiply P -values with the number of genes analyzed: P Bonferroni = max(p N, 1). Among all the genes for which P Bonferroni < the chance of any false positive is However, in gene expression analysis we are not interested in rejecting a set of 100 genes thought to be of interest, even if there is a large probability that one of the genes is irrelevant: the Bonferroni correction is too strong for our purposes. Therefore, false discovery rates (FDR) are often used in gene expression analysis. Instead of controlling the chance of any false positives (as the Bonferroni method does), FDR controls the expected proportion of false positives. The FDR for a given P -value is FDR = N P/N P, where N P is the number of genes with P -values smaller than P. Hence, if we for a data set with genes find 100 genes with P -values smaller than then the FDR is 10%. Now we can decide on how many top-ranked genes to include in our discriminatory gene list by requiring the FDR to be relatively low. An advantage with calculating a P -value for each gene separately is that it allows for each gene to have a different distribution of expression values. However, the number of experiments used in the permutation analysis is typically small, which may result in a 22

23 poor estimation of the distributions. Therefore it may be worthwhile to perform a global permutation test as shown in Fig. 14B. Note that incorporating a selection of discriminatory genes into an analysis procedure will result in a supervised procedure. For example, an unsupervised method such as hierarchical clustering can be a component in both unsupervised and supervised analysis procedures. Using hierarchical clustering to cluster experiments based on all genes measured is an unsupervised procedure: only the microarray data is utilized. Using hierarchical clustering to cluster experiments based on the top-10 genes found to separate two known classes of experiments using the Golub score is a supervised procedure: a known classification is used to rank the genes and the top genes are used to drive the clustering. Importantly, clustering will result in different clusters depending on which genes are used in the analysis. 4.2 Training and validation If you have a set of experiments belonging to classes 1 and 2, can you build a classifier that classifies these experiments into the two classes? Since there are many more genes than experiments (N >> M), this will most likely be easy. First, go through all the genes and find one with higher expression in all samples belonging to class 2 as compared to class 1. Second, find a threshold expression level for this gene such that if the expression level is above the threshold the experiment belongs to class 2. See Fig. 13 where an experiment belongs to class 2 if the expression level of the gene is above 1.5 and class 1 otherwise. Now you have a classification rule that classifies your experiments correctly. The reason this approach will typically produce good results is that the number of genes is so large that it is likely that there will be a good discriminatory gene by chance. The problem with this method is that you have selected a gene that fits your data, and it is unlikely that you have found a general rule that will produce good results on new experiments not used for constructing your classification rule. You are over-fitting! To build a general method and avoid over-fitting, one has to consider two things: The classifier should use fewer estimated parameters than the number of experiments used to construct the classification rule. Validate the classifier by testing it on an independent data set that was not used to construct the classification rule. 4.3 Nearest neighbor A simple form of a classifier is the k nearest neighbor classifier. It works as follows: for each experiment, find the k nearest neighbors according to some distance metric and classify the experiment according to the majority class among the k neighbors. Let us consider an example with three experiments. For each experiment we have measured the expression levels for two genes. The first experiment is known to belong to class 1 and 23

24 the second to class 2. The class of the third experiment is not known. We want to predict the class of the third experiment using a nearest neighbor classifier with k = 1 and Euclidean distance. The gene expression data matrix is: x,1 x,2 x,3 = = = x 1, = x 2, = To find the nearest neighbor of experiment 3 we have to calculate its distance to the other two experiments. The Euclidean distance between experiments 1 and 3 is d 1,3 = (1 ( 2)) 2 + ( 3 1) 2 = 5 and between experiments 2 and 3 d 2,3 = ( 1 ( 2)) 2 + (1 1) 2 = 1 We conclude that experiment 2 is the nearest neighbor of experiment 3. Because experiment 2 belongs to class 2, we predict that experiment 3 also belongs to class 2. The parameters to optimize in the classification rule include k, choice of distance metric, and perhaps which genes to include in the distance calculation. In this optimization process it is beneficial to split the data into three sets: training, validation, and test. First classify the experiments in the validation set based on their k nearest neighbors in the training set and optimize the classification performance by tuning the parameters. Select the optimal values for the parameters and use this classifier to predict the classes of the experiments in the test set. Since the validation set was used to optimize the classifier, it is not an independent set. Hence, the need for a test set to evaluate the general performance of the classifier. 4.4 Machine learning approaches The analysis above assumes that the classification depends upon the genes one-by-one. How about collective dependencies? A number of machine learning approaches have been used to classify gene expression profiles in ways that can incorporate collective effects of genes. In particular, support vector machines (SVMs) and artificial neural networks (ANNs) have been used. Here, we employ Multilayer Perceptrons (MLPs) to give an example of how to model the data to account for more general dependencies among genes. MLPs belong to the family of ANNs (see Mattias notes in BIM083). For schematic 2-dimensional examples of linear and non-linear networks see Fig. 15. MLPs provide a way of modeling a mapping from input to output. In the case of gene expression data, the input is the gene expression levels for an experiment and the output is the class the experiment belongs to. In MLPs this mapping is modelled in terms of a mathematical function that contains a lot of adjustable parameters. 24

25 Linear Case Non-linear Case X 2, X 2, class class X X 1, 2, X X 1, 2, X 1, X 1, Figure 15: Examples of linear and nonlinear separations of 2-dimensional spaces. Experiments belonging to two classes (blue and red colored, respectively) are shown in 2- dimensional gene space. If the classes can be separated by a line the classification rule can be learned with a simple perceptron, no hidden layer is needed (left). More general classification rules required when the classes are not linearly separable can be learned using perceptrons with an additional hidden layer (right). output Weights (parameters)... hidden MLP... inputs O(10) PCA components... PCA O(10000) gene expression values Figure 16: MLP + PCA architecture The values of these parameters is determined using training data. In the networks in Fig 15, each link represents one such parameter. To avoid over-fitting when calibrating (that is determining parameter values based on training data) such models, the number of parameters (links in the network) should not exceed the number of experiments. With O(10000) genes (number of inputs) per experiment, we are in trouble. Hence, one needs a way to reduce the number of genes to a smaller number prior to calibrating the models. Here it is convenient to use principal component analysis. The architecture of such an integrated PCA + MLP approach is shown in Fig. 16. Once the MLP has been calibrated, it can be used to classify independent test experiments. If an MLP is a good classifier it may be interesting to investigate which genes were important 25

26 Experiments ANN output ANN output Figure 17: MLP prediction of ER-status. For each experiment the average ANN output (with standard deviation as error bar) is plotted. ER+ experiments are blue and ER experiments are yellow. The horizontal lines separate training experiments from test experiments. The vertical lines separate experiments classified as ER+ from experiments classified as ER by the MLPs. The MLPs were calibrated using the top 100 ranked genes (left) and the top ranked genes (right), respectively. for the classification performance. We can use that the MLP is a mathematical function from inputs to output. By computing the derivative of the output (o) with respect to x i, a sensitivity S i can be defined. S i = samples(k) do dx i, (k) (10) A large S i is obtained for a gene i for which changes in the expression level results in changes in the output, an important gene. Based upon S the genes can be ranked. The above procedure can be generalized to multi-class instances (multiple outputs). Example Estrogen Receptor Status. Estrogen is an important regulator in the development and progression of breast cancer and regulate gene expression via the estrogen receptor (ER). Microarray images of node-negative sporadic breast tumors were investigated with respect to ER status (+ or ) using microarray data for 23 ER+ and 24 ER tissues and 11 blind test samples. 8 PCA components and 200 MLP models (committee) were used, sensitivities were computed, and genes were ranked. The ER status of blind test samples was predicted with 100 % accuracy using all genes on the arrays. For results confined to top-100 and top genes see Fig 17. We note that even if we exclude the information in the top 300 genes the ER-status can still be predicted with some accuracy. We conclude that ER status influences the expression of very many genes in breast cancer. 26

27 5 Integrating expression data with other data 5.1 Different types of array data In addition to hybridizing the total mrna contents of cells in a sample to investigate expression profiles, one can use microarrays to hybridize other sequences. For example, both the DNA sequences to which a given transcription factor binds (chip-on-chip) as well as the total DNA contents of cells can be hybridized onto arrays (comparative genomic hybridization; CGH). Why would one do this? In chip-on-chip the idea is to have arrays where each spot contains the regulatory sequence of a gene. Next, the DNA sequences onto which a specific transcription factor is bound are extracted from a sample and hybridized onto the arrays. In this way, a genome-wide profile of which genes are potential targets of the transcription factor is obtained. In the case of CGH, would we not expect a ratio of 1 corresponding to that we have the same number of copies of each gene in every cell? Perhaps this is often the case, but not for tumor cells and not when comparing closely related genomes for which gene duplications may be an important evolutionary mechanism. In these cases, array-based CGH is a valuable tool for investigations. In solid tumors, there is uncontrolled growth and many of the control mechanisms to ensure that DNA is copied properly is broken down. This results in regions of DNA that are deleted or amplified into multiple copies. In Fig. 18, the copy number ratios obtained from arraybased CGH using a breast cancer cell line are shown for all clones on a microarray sorted in genomic order (along chromosomes) and genomic regions of copy number changes are visible. As an example of an analysis approach that integrates different types of array data, let us investigate the impact of copy number changes on expression levels. We consider a set of breast cancer cell lines for which both gene expression and copy number profiles were obtained using the same microarrays. In the lower part of Fig 18, the expression levels are utilized to color the clones in the copy number ratio plot. We note that large copy numbers tend to correlate with high expression: amplification of genes often leads to high expression. To investigate this in more detail we can calculate the fraction of genes with high copy numbers that also are highly expressed and vice versa (Fig 19). In particular, we find that more than 40% of the genes with copy number ratios above 2.5 are among the most highly expressed genes in these samples. Finally, let us describe an approach to discover genes having a significant correlation between copy number and expression level in these samples. First we use the copy number data for a gene to divide the experiments into two classes: one class containing the experiments where the gene is amplified (has a copy number ratio larger than some cutoff) and one class containing the remaining experiments. For these copy number classes, we calculate the Golub score for the gene based on its expression levels. A random permutation test is performed for this single gene to see how often a larger Golub score can be obtained than for the copy number classes: an P -value is calculated (see Fig 14B). Note that each gene is analyzed individually and that the partition of the experiments into two classes may be different for every gene. In Fig. 20, the expression levels and copy numbers across 14 breast cancer cell lines are shown for 50 genes with P <

28 Figure 18: Genome-wide copy number and expression analysis in the CF-7 breast cancer cell line. The copy number ratios were plotted as a function of the position of the cdna clones along the human genome. Top figure. Individual data points are connected with a line, and a moving median of 10 adjacent clones is shown. Red horizontal line, the copy number ratio of 1.0. Bottom figure. Individual data points are labeled by color coding according to cdna expression ratios. The bright red dots indicate the upper 2%, and dark red dots, the next 5% of the expression ratios (over-expressed genes); bright green dots indicate the lowest 2%, and dark green dots, the next 5% of the expression ratios (under-expressed genes); the rest of the observations are shown with black crosses. The chromosome numbers are shown at the bottom of the figure, and chromosome boundaries are indicated with a dashed line. Adapted from Hyman et al. Cancer Research (2002). 5.2 Functional annotation data Once a list of genes of interest is generated (for example significant discriminatory genes separating two classes of experiments), a natural follow-up is to investigate the functions of the genes. One approach is to associate genes with terms in the Gene Ontology (GO; The GO is three structured, controlled vocabularies (ontologies) that describe how gene products behave in a cellular context and in a speciesindependent manner. The three vocabularies of defined terms to describe gene product attributes are biological process, cellular component and molecular function. Controlled vocabulary means that each part of the GO contains a set of well-defined terms. Examples of terms in the GO vocabulary are the biological process metabolism, the cellular component cytoplasm, and the molecular function DNA binding. Structured vocabulary means that the relationships between terms are defined (as a directed acyclic graph). For example, both retinol metabolism and metabolism are terms in the biological process vocabulary but they are related: retinol metabolism is a metabolism. This relationship can be seen in Fig

29 Figure 19: Impact of gene copy number on global gene expression levels. A. percentage of over- and under-expressed genes according to copy number ratios. (Threshold values used for over- and under-expression were > 2.184, global upper 7% of the cdna ratios, and < , global lower 7% of the expression ratios). B. percentage of amplified and deleted genes according to expression ratios. Threshold values for amplification and deletion were > 1.5 and < 0.7. From Hyman et al. Cancer Research (2002). In addition to providing the ontologies, the GO provides annotations of gene products: gene products are associated with terms in the GO. To investigate if K top ranked genes are significantly associated with a GO term of interest, one can generate a 2x2 contingency table containing the numbers of genes as follows: Associated with GO term Not associated with GO term Top-ranked genes n 1 = n n 2 = K n Not top-ranked genes n 3 = G n n 4 = N G K + n where the total number of genes in the analysis is N and there are G genes associated with the GO term analyzed. Given this table one can calculate the odds-ratio OR = n 1/n 2 n 3 /n 4 (11) 29

30 An odds-ratio that is larger (smaller) than unity indicates that the GO-term is over (under) represented among the top ranked genes. Random permutation tests can be used to assess the significance of this indication by, for example, randomly selecting K genes many times and calculate the odds-ratio for every selection to see how often one gets an oddsratio as extreme as the original one. One tool that can perform this analysis is GOMiner ( Consider the following example. We have an array with 10,000 genes and are investigating tumor samples. Our analysis found 100 genes that were significant discriminators between two classes of experiments. Suppose that 90 of these 100 genes were associated with the GO term invasive growth. With 90% of our interesting genes annotated as involved in invasive growth, it is tempting to conclude that one of our classes of tumors is more aggressive than the other. However, whether 90% is a surprisingly large or an expected number depends on the total number of genes on the arrays associated with invasive growth. If this number is very large, for example 9,000, then OR=(90/10)/(8910/990)=1. In this case getting 90% invasive growth genes among our discriminators is expected. However, if the number is smaller, for example 1,080, then OR=(90/10)/(990/8910)=81. In both cases, 90% of our discriminatory genes are associated with invasive growth, but only in the second case would we conclude that this is important for our two tumor classes. 5.3 Genome sequence data For follow-up analysis of gene lists it may be of interest to, for each gene, extract its position in the genomic sequence of the organism studied ( complete sequences of genomes are now available for a large number of organisms). A simple example of such analysis is to investigate if discriminatory genes are significantly associated with a particular genomic region in the same way as described for GO terms in the previous section. This analysis may be of most interest when analyzing array-based CGH data: to find classes of experiments significantly associated with regions of gains and losses. In many analysis methods, genes are discovered that share the same expression pattern across a set of experiments. One explanation for co-expression of genes may be that the genes are regulated in a similar way and share a binding site for a transcription factor in their upstream genomic region. One can extract all sequences upstream of genes and see if there is an over-representation of a regulatory element (a pattern in sequence to which a transcription factor binds) among the genes of interest as compared to a negative control group. See Fig. 22, for an example of such analysis where six regulatory elements are investigated for association with five different cell-cycle regulated clusters of genes in yeast. There are many difficulties in this kind of analysis, including that the extracted upstream sequence may not contain the promoter region (transcription start sites are not always known ) and that regulatory elements are very degenerate (not fully conserved sequence motifs). 30

31 Figure 20: List of 50 genes with a statistically significant correlation P -value < 0.05 (denoted by Alpha in the figure) between gene copy number and gene expression (from permutation tests using the signal-to noise-statistic and defining two classes (gene amplified in a sample or not-amplified) separately for each gene). Name, chromosomal location, and the P -value for each gene are indicated. The genes have been ordered according to their position in the genome. The color maps on the right illustrate the copy number and expression ratio patterns in the 14 cell lines. The key to the color code is shown at the bottom of the graph. Gray squares, missing values. From Hyman et al. Cancer Research (2002). 31