Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data

Similar documents
THE development of microarray technology has supplied

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

Paper DP04 Statistical Analysis of Gene Expression Micro Arrays John Schwarz, University of Louisville, Louisville, KY

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Stefano Monti. Workshop Format

Chapter 3. DNA Replication & The Cell Cycle

COMPUTATIONAL ANALYSIS OF MICROARRAY DATA

A Propagation-based Algorithm for Inferring Gene-Disease Associations

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Engineering Genetic Circuits

EECS730: Introduction to Bioinformatics

Exploring Similarities of Conserved Domains/Motifs

Introduction to Molecular Biology

Analysis of Microarray Data

Examination Assignments

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Functional Genomics Overview RORY STARK PRINCIPAL BIOINFORMATICS ANALYST CRUK CAMBRIDGE INSTITUTE 18 SEPTEMBER 2017

O C. 5 th C. 3 rd C. the national health museum

Identifying Candidate Informative Genes for Biomarker Prediction of Liver Cancer

Methods of Biomaterials Testing Lesson 3-5. Biochemical Methods - Molecular Biology -

Study On Clustering Techniques And Application To Microarray Gene Expression Bioinformatics Data

Introduction to Bioinformatics and Gene Expression Technologies

Recent technology allow production of microarrays composed of 70-mers (essentially a hybrid of the two techniques)

7 Gene Isolation and Analysis of Multiple

Modeling of Protein Production Process by Finite Automata (FA)

Lesson Overview DNA Replication

Microarray Gene Expression Analysis at CNIO

Microarray Technique. Some background. M. Nath

Genetics and Heredity. Mr. Gagnon

Metabolic Networks. Ulf Leser and Michael Weidlich

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

The Cell Theory: A Brief History

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

GENETICS. I. Review of DNA/RNA A. Basic Structure DNA 3 parts that make up a nucleotide chains wrap around each other to form a

Introduction to Bioinformatics. Fabian Hoti 6.10.

Classification and Learning Using Genetic Algorithms

Recombinant DNA Technology. The Role of Recombinant DNA Technology in Biotechnology. yeast. Biotechnology. Recombinant DNA technology.

Analysis of Microarray Data

Section 10. Junaid Malek, M.D.

AGRO/ANSC/BIO/GENE/HORT 305 Fall, 2016 Overview of Genetics Lecture outline (Chpt 1, Genetics by Brooker) #1

Genetics Lecture 21 Recombinant DNA

3.1.4 DNA Microarray Technology

V 1 Introduction! Fri, Oct 24, 2014! Bioinformatics 3 Volkhard Helms!

Sequence Analysis Lab Protocol

Péter Antal Ádám Arany Bence Bolgár András Gézsi Gergely Hajós Gábor Hullám Péter Marx András Millinghoffer László Poppe Péter Sárközy BIOINFORMATICS

ANALYSISING OF DNA MICROARRAY DATA USING PRINCIPLE COMPONENT ANALYSIS (PCA)

Biotechnology. Chapter 20. Biology Eighth Edition Neil Campbell and Jane Reece. PowerPoint Lecture Presentations for

A PROTEIN INTERACTION NETWORK OF THE MALARIA PARASITE PLASMODIUM FALCIPARUM

Lecture 1. Bioinformatics 2. About me... The class (2009) Course Outcomes. What do I think you know?

Zool 3200: Cell Biology Exam 2 2/20/15

Bioinformatics 2. Lecture 1

6.2 Chromatin is divided into euchromatin and heterochromatin

1999 Nature America Inc. Systematic determination of genetic network architecture

Molecular Cell Biology - Problem Drill 11: Recombinant DNA

Exploration and Analysis of DNA Microarray Data

Discovering Distinct Patterns in Gene Expression Profiles

AP Biology Gene Expression/Biotechnology REVIEW

8/21/2014. From Gene to Protein

Cells and Tissues. Overview CELLS

Introduction to BioMEMS & Medical Microdevices DNA Microarrays and Lab-on-a-Chip Methods

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

BIOTECHNOLOGY. Course Syllabus. Section A: Engineering Mathematics. Subject Code: BT. Course Structure. Engineering Mathematics. General Biotechnology

Copyright. Daifeng Wang

Humboldt Universität zu Berlin. Grundlagen der Bioinformatik SS Microarrays. Lecture

Data Mining and Applications in Genomics

Outline. Array platform considerations: Comparison between the technologies available in microarrays

Lees J.A., Vehkala M. et al., 2016 In Review

Fundamentals of Bioinformatics: computation, biology, computational biology

Lecture 1 Introduction to Micorarrays and Concepts of Molecular Biology

Genome Sequence Assembly

If Dna Has The Instructions For Building Proteins Why Is Mrna Needed

Convolutional Kitchen Sinks for Transcription Factor Binding Site Prediction

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

REGULATION OF GENE EXPRESSION

Network System Inference

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Protein Synthesis. OpenStax College

Biology 252 Nucleic Acid Methods

Biomedical Big Data and Precision Medicine

Cell Division. Use Target Reading Skills. This section explains how cells grow and divide.

CHAPTERS , 17: Eukaryotic Genetics

Quality Measures for CytoChip Microarrays

Enhanced Biclustering on Expression Data

CELLULAR PROCESSES; REPRODUCTION. Unit 5

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

DNA Microarrays Introduction Part 2. Todd Lowe BME/BIO 210 April 11, 2007

Pre-Lab: Molecular Biology

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Introduction to gene expression microarray data analysis

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

Technical University of Denmark

DNA Structure & the Genome. Bio160 General Biology

DNA Microarrays and Computational Analysis of DNA Microarray. Data in Cancer Research

Lecture 20: Drosophila melanogaster

DNA Microarray Technology

Nima Hejazi. Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi. nimahejazi.org github/nhejazi

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Machine learning in neuroscience

Biology 201 (Genetics) Exam #3 120 points 20 November Read the question carefully before answering. Think before you write.

Transcription:

2011 International Conference on Information and Electronics Engineering IPCSIT vol.6 (2011) (2011) IACSIT Press, Singapore Estimating Cell Cycle Phase Distribution of Yeast from Time Series Gene Expression Data Jhoirene Clemente 1, Julie Ann Salido 1 and Raissa Relator 2 1 Algorithms and Complexity Laboratory, Department of Computer Science; 2 Institute of Mathematics, University of the Philippines Diliman Abstract. Two standard methods for estimating the cell cycle phase distribution of a yeast population are budding index analysis and one that uses fluorescence-activated cell sorter (FACS). However, both of these methods will require wet lab procedures. In this research, we propose a method that estimates the cell cycle phase distribution from the time series gene expression data. Keywords: Cell cycle phase distribution, FACS, budding index analysis, time series gene expression data, spectral clustering, kernel k-means clustering. 1. Introduction The development of microarray technology has supplied a large amount of data to the field of Bioinformatics. This technique is a key technology that facilitates genome-wide analysis of gene expression levels for gene function discovery and biomedical applications. This data in its raw form is difficult to understand, thus there is a need to apply data mining techniques to aid in analysis. Furthermore, analysis without biological significance is useless. So to assert the results made by in silico procedures, information about the samples are needed. This information is usually obtained by performing specific laboratory techniques. Gene expression data is highly dependent on the state of the sample. The state may be the current cell cycle phase, phenotypic trait, or the tissue where the samples were taken. A single sample may have different gene expressions through time, thus this leads to the analysis of time series gene expression data. Examples of time series data analysis are the studies made by [1, 7] where their purpose was to identify sets of genes that are periodically expressed at specific phases of a cell cycle in yeast. It was also necessary for both studies to specify the cell cycle phase at each timepoint. To identify the said information, the method used in [1] considered the size of the buds, the cellular position of the nucleus, and standardization of more than 20 transcripts whose mrna fluctuations have been previously reported. Identification of the cell cycle phase distribution of yeast cells is also presented in [6] where two methods were proposed: one based on FACS and the other using budding index analysis, both of which require wet lab procedures. Details of these two methods will be discussed in Section 2.3. Gene expression analysis results are highly dependent on basic information about samples, and not all available time series gene expression data include these information. In this paper we intend to develop a method of approximating the cell cycle phase distribution based on the time series gene expression data. 2. Definitions and Basic Notations 2.1. Gene Expression Data Genes are the basic hereditary unit of living organisms. These are encoded in the chromosomes of an individual and dictate the biological processes which are carried out by proteins in a cell. Protein synthesis is dependent on the gene expression of an organism and gene expressions are measured using DNA microarrays. A microarray is a tiny square array with thousands of probes, whose expression values are measured through their luminosity. Each probe corresponds to a specific gene of interest. A microarray chip 105

contains the expression levels of approximately 20,000 gene transcripts [3]. The amount of gene expressed dictates how much proteins are synthesized and therefore responsible for the biochemical interactions taking place inside the cell. Data mining techniques for gene expression data has long been studied in the literature. However, interpretations are still based from biological significance and are dependent on biological information of the sample. 2.2. Yeast Cell Cycle The eukaryotic cell cycle is an ordered and periodic set of events that includes growth and synthesis of biochemical substances essential for living. The stages of the cycle are ordered as follows: pre-synthetic gap (G1), Synthesis (S), post-synthetic gap (G2), and Mitosis (M). During the G1 phase, the cell grows and takes in nutrients in preparation for the synthesis stage, and produces enzymes needed for DNA replication. The synthesis, also called the DNA synthesis phase, is marked by DNA replication, transcription and protein synthesis. The amount of DNA is effectively doubled after the synthesis phase. At the second gap phase, the cell grows and takes in nutrients as preparation for mitosis. Enzymes necessary for the production of microtubules are synthesized, along with other proteins needed for the cell division. The cell division stage is further characterized by 4 events: prophase, metaphase, anaphase and telophase. After the last phase, two daughter cells are formed. 2.3. Estimation of Cell Cycle Phase Distribution The first of the two computational methods for estimating the cell cycle phase distribution of a budding yeast cell population presented in [6] uses a device called fluorescent-activated cell sorter (FACS). This method is based on the fact that the cell cycle phase is dependent on the amount of the DNA present in the cell. For instance, if the amount of DNA during the G1 phase is n, then the DNA amount during the G2 is 2n, while the amount of DNA during S is somewhere in between the amounts present during G1 and G2. The main purpose of the FACS device is to measure the DNA content in a cell [6]. Hence, this method produces an age distribution of the cell population as an output. Based from a histogram, the cell cycle phase is identified by manually marking the range for each phase. Thus, the variability of the DNA amount in the S phase makes it difficult to find a good estimate. The second distribution estimation method, budding index analysis, applies an automated image analysis method. The goal is to detect cells that are focused relatively well, and to completely ignore cells that are poor in focus. This method determines the total number of cells and the size of the bud of each cell. There are 3 classifications of the cells: cells without buds, cells with small buds, and cells with large buds. These classes correspond to the cell cycles G1, S and G2/M respectively. Based on the image analysis, there should emerge from the estimation distribution a similar alignment in the cell cycle phase of a cell. 2.4. Data Set The Reduced Yeast Cell Cycle (RYCC) data set used in this research is from [8]. It is a data matrix with 384 rows and 17 columns. Each row represents a gene with 17 dimensions where each dimension corresponds to a point in the time series. It contains 384 genes that are grouped based on the five phases of the cell cycle: G1/M, G1, S, G2, and M. It is shown in [1] that the data set exhibits periodicity and some relational patterns with respect to the cell cycle. The microarray samples, collected at 17 timepoints taken in ten-minute intervals, cover nearly two full cell cycles (170 min). 2.5. Clustering Methods In this paper, we make use of two relatively new approaches in data clustering: kernel and spectral methods. For data which are not linearly separable, using these methods should allow us to cluster data using classifiers in the form of hypersurfaces. We give an overview of the two clustering algorithms used. For a deeper understanding, readers are referred to [4, 5]. Clustering methods that use spherical or elliptical metric to group data may not work well when clusters are non-convex. Spectral clustering was introduced to address this problem. The main idea in spectral clustering is to construct similarity graphs that represent the local neighborhood relationships between 106

observations. To start with, we consider an n n similarity matrix whose entries s ij is the similarity between observation i and observation j. The data is then represented in an undirected similarity graph G = V, E, where the n vertices represent the observations and two vertices v i and v j are connected by an edge if s ij 0 or is greater than some predefined threshold. These edges are weighted by the s ij s. The goal is to partition the graph such that the edges between different groups have low weights, i.e. data points have high dissimilarity, and edges within a group have high weights, i.e., data points have high similarity. As an alternative, we can also consider a fully connected similarity graph whose edges are weighted by w ij = s ij and construct the adjacency matrix W = {w ij }. The degree g i of vertex i is given by the sum of the weights of the edges connected to it, that is g i = Σ j w ij. If G is the diagonal matrix whose entries are g i, the graph Laplacian is given by L = G - W Spectral clustering method finds the respective eigenvectors corresponding to the m smallest eigenvalues of L to form a matrix Z n m. Then the rows of Z n m are clustered, in this case using kernel k-means, resulting in a clustering of the original data points. A kernel is a function K such that for all data points x 1, x 2 in an input space X, K(x 1, x 2 ) = < (x 1 ), (x 2 ) >, where is a mapping from X to an inner product space F called a feature space. Hence, kernel methods are a class of algorithms that perform by mapping the input data into a high dimensional feature space. This is done using the so-called kernel trick ' which is primarily based on Mercer's theorem. According to the theorem, any continuous, symmetric, positive semidefinite kernel function K(x,y) can be expressed as a dot product in a high dimensional space. Therefore, any algorithm that employs the inner product operation can be applied with the kernel trick. Kernel-based algorithms have come a long way since their introduction. Aside from the fact that kernel functions have provided algorithms a bridge between linearity and nonlinearity, their performance have been proven comparable to, if not better than, existing algorithms in various areas where they have been applied. Thus, recent trends involve kernelizing algorithms, that is these algorithms have been extended to use kernels by applying the kernel trick. As of present, many kernel methods for clustering have already been developed, most of them are kernel versions of known clustering algorithms, such as k-means, SOM, and neural gas. However, we will only consider using kernel k-means in clustering our data. Given a set of unlabeled data and the number of groups we wish to partition the data, k-means aims to find the clusters and their centers by moving the cluster centers iteratively until the total variance within a cluster is a minimum. Thus, the algorithm iterates the following steps until convergence: 1. for each center, find the set of data points closer to it than any other center; 2. identify the new center of each cluster by computing for the means of the data points it contains. Applying the kernel trick in this algorithm results to kernel k-means. For a set of observed data, kernel k- means involves projecting this data set in some feature space F by means of a nonlinear map. Then, the algorithm for the usual k-means is employed. This means that clustering is done in the feature space F. As a result, the data mapped in F are now separated by hyperplanes, which when mapped back to the original input space form nonlinear partitionings. 3. Methodology 3.1. Data Preprocessing The RYCC data set from [8] were initially derived from [1], where a set of 416 genes in yeast were identified to be dependent on the cell cycle. Further filtering of the data was done in [8] by eliminating genes that are associated to more than one phase of the cell cycle and genes that have negative gene expression values, thus resulting to a 384 17 (genes sample) data matrix. In addition to these, we disregarded the set of genes identified to be outliers in [2]. 3.2. Clustering of Samples 107

We divide the data into their respective groups as indicated in [1] and clustering using the two algorithms discussed in the previous section is performed for each group. Thus, five clusters per group were obtained, each cluster corresponding to a one of the cell cycle phases. However, indexing of the clusters of a group may not be the same for all groups. To address this problem, center means of each resulting cluster of each group are computed and sorted, thus obtaining a cluster index that can be applied to all groups and allowing us to obtain visualizations of the clustering results. It is important to note that a clustering method such as the k-means is dependent on initialization, which in this case involves finding the 5 initial centers of each cluster using heuristics. Therefore, to approximate the true clustering of each group, several trials are made and the final visualizations are obtained by getting the average of the center means of the most expressed cluster at a specific time interval for each group. These values are then compared to the average of the center means of each cluster over all trials. The expressed group at a certain timepoint is assigned to a cluster if the distance between their center means is the minimum. 4. Analysis and Results We graph the time domain data for each group and align the cell cycle phase distribution from FACS and budding index analysis. Based from the alignment, the following behaviours were observed. Groups M/G1, G1 and M have expression levels that peak in their corresponding phases identified by the reference cell cycle distribution. Since the S phase is characterized by synthesis of a lot of proteins needed by the cell, we can expect a peak of the gene expression level that doesn't conform to the corresponding cell cycle phase. The results of the two clustering methods used are shown in Figure 1 and Figure 2, along with the estimated cell cycle phase distribution and the reference distribution based from using FACS and budding index analysis. The estimation of the cell cycle phase distributions are obtained by computing for the correlation of the center means for each pair of adjacent timepoints. We assigned a similar cell cycle phase to two adjacent timepoints if their cluster means are significantly correlated. Since the alpha factor-based method used to synchronize the sample population starts with M/G1 as the initial phase, assignment of the cell cycle phases are done in a similar manner. However, different confidence levels may lead to different distributions. To address this problem, we tested a range of confidence levels (70% - 95%), and computed an index that will measure the goodness of the estimates. We used the Hamming distance or the edit distance to measure the consistency of the estimates with respect to the reference distribution. As much as possible we want our edit distance to be minimum. A summary of the edit distances obtained for each confidence level is given in Table 1. Figures 1 and 2 are visualizations of the clustering results and best estimates obtained using the two algorithms. Table 1: Edit distances obtained for each confidence level Fig 1: Spectral clustering result with an estimate cell cycle phase distribution and the reference distribution based from FACS and budding index analysis 108

Fig 2: Kernel k-means clustering result with an estimate cell cycle phase distribution and the reference distribution based from FACS and budding index analysis 5. Conclusion Given the time series gene expression data of a synchronized population of yeast, we obtained an estimate of the cell cycle phase distribution that approximates the result of FACS and budding index analysis by up to 82.35%. The method employs kernel k-means to cluster the set of samples of each group and a correlation confidence level of 80% is used for the cell cycle phase assignment. Based from the computed edit distances, kernel k-means has a better approximation in all levels of confidence compared to the spectral clustering algorithm. 6. Recommendations For further studies, we recommend using our method in a different time series gene expression data with smaller time interval. We also wish to extend the study to asynchronous population and prokaryotic data. 7. Acknowledgements The authors would like to thank Ms. Jasmine Malinao for her insights and advise, Erlo Robert Oquendo for the statistical background, Dr. Anactleto Argayosa and Angelo Dela Tonga for preliminary biological interpretations. Ms. Clemente would like to thank ERDT for funding her Master s degree in UP Diliman. Ms. Salido would like to thank Aklan State University for funding her studies in UP Diliman. 8. References [1] Cho, R.J., et al. (1998). A Genome-wide Analysis of the Mitotic Cell Cycle. Molecular Cell 2 65-73. [2] Clemente, J. and Salido, J.A. (2010). Non-Metric Multidimensional Scaling and Vector Fusion Visualization of Time Series Gene Expression Data for Gene Function Analysis. Proc. of the Phil. National Conference on Information Technology Educators. [3] Domany, E. (2003). Cluster Analysis of Gene Expression Data. Journal of Statistical Physics 110 Nos. 3-6 1117-1139. [4] Filippone, M., et al. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition 41 176-190. [5] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag Series in Statistics, New York. [6] Niemisto, A., et al. (2007). Computational Methods for Estimation of Cell Cycle Phase Distributions of Yeast Cells. EURASIP Journal of Bioinformatics and System Biology. [7] Spellman, P., et al. (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell 9 3273-3297. [8] Yeung, K.Y. (2001). Cluster Analysis of Gene Expression Data. Dissertation, Department of Computer Science and Engineering, University of Washington. 109