ELLI HEIKKINEN ALGORITHM FOR IDENTIFYING GENETIC MODIFICATIONS FOR OPTIMIZATION OF MICROBIAL PRODUCTION STRAINS. Bachelor of Science Thesis

Size: px
Start display at page:

Download "ELLI HEIKKINEN ALGORITHM FOR IDENTIFYING GENETIC MODIFICATIONS FOR OPTIMIZATION OF MICROBIAL PRODUCTION STRAINS. Bachelor of Science Thesis"

Transcription

1 ELLI HEIKKINEN ALGORITHM FOR IDENTIFYING GENETIC MODIFICATIONS FOR OPTIMIZATION OF MICROBIAL PRODUCTION STRAINS Bachelor of Science Thesis Examiner: Associate Professor Heikki Huttunen, Supervisor: Research Fellow Tommi Aho Submitted for examination on

2 i ABSTRACT TAMPERE UNIVERSITY OF TECHNOLOGY Degree Programme in Biotechnology HEIKKINEN, ELLI: Algorithm for identifying genetic modifications for optimization of microbial production strains Bachelor of Science Thesis, 23 pages May 2012 Major: Computational Systems Biology Examiner: Associate Professor Heikki Huttunen Supervisor: Research Fellow Tommi Aho Keywords: metabolic network, metabolic engineering, algorithm, in silico, strain design, optimization Recently, metabolic engineering techniques have evolved from traditional experimental selection and random mutagenesis to more advanced recombinant DNA technologies. This has encouraged researchers to engineer microorganisms to production systems. These, so called cell factories, can be exploited to efficiently produce valuable chemicals, such as drugs and fuels, from sustainable sources. However, finding an adequate set of modifications, that enhance the production of a desired metabolite, is not a trivial task owing to the complexity of the cell metabolism. Therefore, to understand the characteristics and capabilities of the cell metabolism, several genome-scale models of the host organisms have been reconstructed. Additionally, computational methods have been developed for analyzing these genome-scale models and to perform more systematic strain design in silico. Most of the currently available algorithms are limited in a sense that they identify only gene deletions or reaction activation/inhibition, and not gene or reaction additions. In this thesis, a new algorithm approach for in silico strain design is presented. Instead of only identifying gene deletions, the algorithm identifies both gene deletions and nonnative reaction additions that increase the production rate of the objective metabolite. Also, a description of the algorithm parameter construction is provided. First, a set of nonnative candidate reactions is collected from Kyoto Encyclopedia of Genes and Genomes (KEGG). Second, a method to reduce a central metabolism from a full metabolic model is presented and demonstrated. Finally, a selection of various small metabolites, called currency metabolites, is collected. The algorithm is demonstrated by designing modifications to Acinetobacter baylyi ADP1 metabolism with the aim of increased acetate production rate. The algorithm found together 27 non-native pathways to be added into the model for enhanced acetate production. The results show that the algorithm has potential as a tool in computational strain design.

3 ii PREFACE This thesis is about my work in the Computational Systems Biology Research Group in the Department of Signal Processing. I thank the head of the CSB-group Olli Yli-Harja for the opportunity to work in his group. I am especially thankful to my supervisor Tommi Aho for his indispensable guidance in my work and for this thesis. I am grateful to Antti Larjo and Ville Santala for their useful comments and advices. The Academy of Finland is acknowledged for funding my work. I want to thank the examiner of this thesis Heikki Huttunen for his guidance and comments in the writing process. Lastly, I want to thank Antti for his comments and for proofreading this work, and most importantly, for his continuous support. Elli Heikkinen Tampere

4 iii CONTENTS 1 Introduction Theoretical background Cell, metabolism and genome Metabolic engineering Metabolic models and model reconstruction Model analysis Algorithm Algorithm procedure Reduction of central metabolism Curation of candidate reactions Identification of non-native pathways Confirming the functionality of the possible reaction additions and identifying favorable gene knockouts Results Discussion Validity of the results Future directions Conclusion References

5 iv ABBREVIATIONS AND NOTATION ATP BFS EMILiO FBA GDLS glpk GSC in silico KEGG knockout LP MATLAB Adenosine-5 -triphosphate. A small high energy coenzyme, a primary chemical energy transporter for intracellular metabolism. Breadth-first search. A graph search algorithm that finds the shortest path between two nodes. Enhancing Metabolism with Iterative Linear Optimization. An algorithm for production strain design, that both identifies a subset of reactions with the potential to improve growth-coupled biochemical production, and simultaneously quantitatively predicts optimal flux rates that maximize production. Flux balance analysis. Mathematical analysis method that finds the optimal solution space. Genetic Design through Local Search. An algorithm method that is based on a local search with multiple search paths and is for identification of genetic design strategies. GNU Linear Programming Kit. A package for solving large-scale linear programming and other related problems. Giant Strong Component. A subset of metabolites and reactions of a metabolic network that make up the functionally most important part of the metabolic network. Refers to something performed on computer or via computer analysis. Kyoto Encyclopedia of Genes and Genomes. A bioinformatics database consisting of databases for genomic, chemical, network and other types of biological information. A gene deletion from a cell. Knocking out a gene disables the cells capability to produce the gene product of that gene. Thus reactions in which the gene product took part in might not occur. Linear Programming, also known as linear optimization, is a mathematical problem or a method to solve a problem of maximizing or minimizing a linear function based on a set of constraints. Software and programming language for technical computing developed by MathWorks. Abbreviation from matrix laboratory.

6 v microorganism objective flux SBML Toolbox strain a b f k max S v m n A microscopic organism, for example bacterium, fungi, and archaea. The production rate of a desired metabolite in flux balance analysis simulations. System Biology Markup Language Toolbox. A toolbox that facilitates amongst other things importing and exporting models represented in the System Biology Markup Language (SBML) in and out of the MAT- LAB environment. SBML is an XML-based format for representation of computational models of biochemical networks. In biology, strain refers to a genetic variant or subtype of a microorganism such as bacterium or virus. A vector of the lower bounds of the reaction fluxes. Its jth element is the lower bound of the reaction j. A vector of the upper bounds of the reaction fluxes. Its jth element is the upper bound of the reaction j. An objective flux vector. Its jth element is the weight of reaction j in the objective. A maximum number of reactions in pathway searched in algorithm. A stoichiometric matrix of size m x n, of the reaction network. The ijth element of the stoichiometric matrix is the coefficient of metabolite i in the reaction j. A vector of reaction fluxes. Its jth element is the flux through the reaction j. The number of metabolites in the reaction network. The number of reactions in the reaction network.

7 1 1 INTRODUCTION Metabolic engineering is defined as targeted improvement of metabolic pathways of microorganisms to enhance the production capabilities of a desired substance [1]. Microbial production systems, so called cell factories, have received great attention in recent years. This is because microorganisms offer sustainable sources of diverse compounds such as fuels, valuable chemicals, and drugs. Also, large scale production using fermentation techniques has become more advanced. Now that people are becoming more and more concerned about energy and environmental issues, obtaining fuels efficiently from renewable sources is a potential alternative to fossil fuels. Moreover, the possibility of better economic efficiency has encouraged researchers to deploy metabolic engineering for industrial production [2]. Already, several compounds, such as insulin, are being produced in microorganisms [3] and great efforts are put to developing several others [4, 5]. This is obviously not a trivial task because of the complexity of the cell metabolism. Therefore, traditional methods such as experimental selection and random mutagenesis are attempted to be replaced by computational methods. The aim is to automate and make the targeted strain development more systematic. Furthermore, advances in molecular biology and recombinant DNA technology provide the ability to construct the gene content and expression levels for microbial productions strains in a controlled fashion [6]. To understand the characteristics and the production capabilities of the cell under certain genotypic and environmental conditions, computational models and simulations play an important role [6]. The greatest advantages in computational analyses are economical efficiency and faster experiments. In silico analyses, that are analyses performed on a computer, are fully controllable, cheaper and easier to repeat than culturing and other analyses in wet laboratory. Still, computational analyses alone are not sufficient enough and need to be confirmed by experimental analyses. Simultaneously with a progress of microbial techniques, a number of genome-scale metabolic models are being constructed for simulation of microorganisms in different genetic and environmental conditions. Constructing models is becoming faster and more accurate due to development in model construction procedures and a growing amount of annotated sequence information [7]. Already, there are reconstructed and validated models available for more than 30 microorganisms [7], such as Escherichia coli [8], Saccharomyces cerevisiae [9], and Acinetobacter baylyi [10]. Furthermore, various computational methods to simulate, analyze and predict metabolic phenotypes from these models

8 1. Introduction 2 have been developed [11]. Currently, both the models themselves and the methods for analyzing the metabolism are becoming more and more accurate, and have been able to predict the consequences of genetic/metabolic modifications [12]. In addition to the complete genome-scale metabolic models, individual reactions and annotated genes are collected in databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [13], Metabolic Pathway Database (MPW) [14], and several others. These reactions make up a growing library of the biotransformations for which we have evidence of existence in different species. To summarize, now there is the required information of the production hosts and the modifications available. There are models that can provide accurate abstractions of microbial metabolisms, and at the same time, the biotransformation databases offer information of both, the non-native and the native functionalities that we might need to add into or remove from the model in order to attain the desired production capability. It follows that the problem lies now in how to select the modifications, that improve the product yield as well as maintain the viability of the cell, systematically from thousands of different alternatives. Over the last decade, several computational algorithms have been developed to overcome this problem, for instance OptKnock [15], Genetic Design through Local Search (GDLS) [16], OptStrain [17], and Enhancing Metabolism with Iterative Linear Optimization (EMILiO) [18]. In addition to strain design, these tools also provide methods to visualize, modify, and analyze models. This includes mainly visualizing the solution space of a flux balance analysis (FBA), adding and removing reactions and genes, robustness analysis, and many others [11]. These algorithms and their suggestions for strain design have been validated and used as the basis of experimental genetic modifications successfully in several studies [19]. Each of these algorithms have different purposes and conditions of simulations, which makes the selection of an appropriate algorithm for each experiment very important. Most of them provide strain design only through identification of gene knockouts or reaction activation/inhibition, and not through gene or reaction additions [15, 20]. Also, the analysis of cellular physiology is facilitated by adding constraints, such as regulation, thermodynamics and molecular crowding. This is to achieve more accurate simulations of in silico models due to several mechanisms occurring concurrently [6]. Therefore, approaches that integrate several algorithms and constraints are required to describe complex metabolic systems that are affected by multifaceted interactions of various mechanisms. In this thesis, a new algorithm approach for in silico strain design, that employs an algorithm method GDLS together with a search of feasible reaction additions from a set of non-native candidate reactions, is described. In chapter 2, a short overview of the biology and the tools behind the thesis topic is presented. The MATLAB implementation and the details of the algorithm are described in chapter 3. Then, we emphasize the algorithm

9 1. Introduction 3 functionality in a case study of designing acetate production strain in the genome-scale model of A.baylyi ADP1 [10]. The results are described in chapter 4 and discussed in chapter 5. Ideas for future development of the algorithm are presented also in chapter 5. Recently, the author of this thesis has written a paper on the same topic as this thesis [21]. It is a shorter and narrower presentation of this thesis, targeted for readers more expertized in computational systems biology. Otherwise the content is essentially the same. The paper was accepted to be presented at the 9th Workshop on Computational Systems Biology 2012.

10 4 2 THEORETICAL BACKGROUND Even though this thesis is fully computational, it is necessary to be familiar with the basic concepts of the underlying biology in order to understand what is done and why. As most engineers have little knowledge of biology, the following sections will briefly introduce the basic biological concepts and techniques essential in this thesis. Lastly in this chapter, metabolic models and analyses are reviewed in more detail, because they act a central role in this work. 2.1 Cell, metabolism and genome All living beings are made of cells that are small, membrane-enclosed units with cell organelles and a solution of chemicals, cytoplasm, inside them. Cells can be fundamentally divided to prokaryotics and eukaryotics. Prokaryotics are simpler and lack some cell organelles that eukaryotics have, such as nucleus and mitochondria. The more complex organisms, such as plants and animals, consist of countless different types of cells, but the majority of living organisms are single cells, such as bacteria. [22] One definition of a living matter is that it has its own metabolism and an ability to reproduce. It is owing to the metabolism, that cells can grow, reproduce, and convert energy from one form to another. Hence, for example viruses that have a few same components as cells, such as RNA and enzymes, but have no metabolism and functionality to reproduce by themselves by their own effort, are not considered as living things. The metabolism of a cell consists of numerous biochemical reactions that together form a complex network. Some of the most essential pathways in all living organisms are glycolysis, citric acid cycle (TCA), and pentose phosphate pathway. They are a part the pathways responsible for the energy production of a cell. Glycolysis pathway is illustrated in Figure 2.1. [22] It is to be noted that, most of the reactions in the cell metabolism do not occur in the prevailing temperature of the cell without enzymes. Enzymes are proteins, that catalyze biochemical reactions, usually acting as a series, forming reaction sequences, in other words. In addition to proteins that act as reaction catalysts, they dominate the cell behavior also by serving as structural supports, molecular motors, chemical transport channels and so on. Proteins are polymer chains of 20 different amino acids that are the same in all cells. These amino acids are linked in different sequences and folded in different threedimensional shapes that give each protein their special functionality. [22] Cells can synthesize proteins that they need using genetic instructions, genes, that are stored in DNA molecules. DNA is a double-stranded polymer chain that is composed of

11 2. Theoretical background 5 Figure 2.1. An illustration of an example of a metabolic pathway: glycolysis. Glycolysis is a central metabolic pathway in which sugar molecules are converted to chemical energy. varying sequences of four monomers, called nucleotides. The nucleotides are composed of a deoxyribose sugar, one phosphate group, and one of the four bases: adenine (A), cytosine (C), thymine (T) or guanine (G). The complete set of DNA molecules of an organism is called a genome. In the genome, there are genes, that are the segments of a DNA that contain the information for the synthesis of functional biomolecules. The process, in which a DNA is transcribed into an RNA, and further translated into a protein, as a whole is known as gene expression. The process is part of a central dogma of biology that is illustrated in Figure 2.2. Not all genes are produced continuously in the cell but only when their gene products are needed. Together with the environment, the gene products, proteins, determine the characteristics and behavior of the organism. [22] Figure 2.2. The fundamental principle of the flow of genetic information in all cells is termed the central dogma of molecular biology. Genetic information is replicated and apportioned between two descendants at each cell division. This process, called replication, is highly accurate. In addition to this, a cell has a special mechanisms for repairing DNA when it is damaged. Despite that, permanent changes, called mutations, sometimes occur. Often mutations are fatal, but occasionally, they are beneficial for the organism. For example, a certain mutation can make the bacteria resistant to antibiotics that used to kill them. Mutations are the source of evolution, variety in species, and smaller variations between individuals of the same

12 2. Theoretical background 6 species. In addition to random mutations, it is possible to modify a genome using DNA engineering techniques, for example by removing a gene. This technique is called a gene knockout. [22] 2.2 Metabolic engineering In this section, the concept and the significance of metabolic engineering in development of cell factories are discussed. Metabolic engineering is targeted and intentional modification of metabolic pathways using modern DNA techniques in order to improve cellular properties. The goal may be to increase process productivity, as in the industrial production of amino acids, or to extend metabolic capability by adding of non-native activities for chemical production or degradation. Another goal may be to inhibit the formation of some unwanted metabolite that can, for instance, inhibit the growth in higher concentrations. [1] Several DNA techniques can be used to achieve the desired goal. These include for example the elimination of a competitive pathway or a toxic by-product by the removal or the inhibition of the gene that encodes the enzyme catalyzing the reaction. Another method could be, the overexpression of genes by gene amplification or the expression of a heterologous gene, i.e. a gene encoding a corresponding enzyme activity, to enhance the production of existing products. Various other approaches have also been used. Often, a combination of different kinds of modification strategies may be needed to result in the desired phenotype. Additionally, the manipulation of the central metabolism may also be necessary to generate the precursors, cofactors and, most importantly, energy required to sustain the modified pathway and the viability of the cell. [1] Advances in molecular biology and genetic engineering empower researchers with an increasing ability to develop any desired cellular modification. Accordingly, the global focus shifts from modification techniques to target identification. The cellular phenotype reflects the net of intracellular conditions for instance gene expression levels, nutrient availability, cellular stress, and so on. Thus, local metabolic changes or individual gene modifications most likely will not result a drastic outcome. To identify beneficial modifications, detailed knowledge of the enzyme kinetics, the metabolic network and gene regulation is required. [1, 23] 2.3 Metabolic models and model reconstruction Metabolic models are mathematical representations of the metabolism of the cell. Metabolism is typically modeled as a set of metabolites existing within the cell and reactions connecting them. Reactions are usually annotated with the gene or genes whose gene products catalyze them. These metabolic maps can be presented graphically, but are for convenience presented as a stoichiometric matrix comprised of reaction coefficients. In

13 2. Theoretical background 7 the stoichiometric matrix, each row or column represents one metabolite or one reaction respectively. The reactions have their reactants annotated as negative coefficients and their products as positive coefficients. An example of a fictional reaction network and its stoichiometric matrix are illustrated in Figure 2.3. The rate at which a reaction converts a set of metabolites into another is called a metabolic flux of that reaction and typically has units of material, moles or mass, per time per cell, for example millimoles per gram of cell dry weight per hour (mmol/gdw/hr). [24] Figure 2.3. Example of a metabolic network and its stoichiometric matrix. The MATLAB simbiology-tool was used to draw the network. Models usually contain several constraints, such as reversibility of reactions, energy requirement for cell maintenance, lower and upper bounds of each reaction flux and each metabolite concentration, ph, and temperature. Hence, the term constraint-based model. [25] Recent availability of the genome sequences of microorganisms has greatly assisted the construction of cellular models by enabling systematic approaches to gene target identification. The constructed models are mainly stoichiometric models, where only the reaction fluxes, or reaction rates, with their gene annotations are determined. However, it is to be noted that the cell is also governed by several other factors including gene expression, signaling pathways, and reaction kinetics, perhaps even more strongly than plain reaction stoichiometry. Feasible incorporation of these constraints are still missing from the models, in spite of intensive studies. Currently, the lack of information of the regulatory mechanism and the molecular interaction kinetics is an issue that hampers the construction of more accurate models. Furthermore, obtaining the kinetic characteristics of reactions feasible enough for us to set up physiologically meaningful rate equations requires laborious and time consuming experiments. There have been various approaches to solve this problem from brute force to compromising solutions. [24, 26] Genome-wide metabolic network reconstructions have been generated already over 10 years. Now, even a protocol with 96 steps for the reconstruction of high-quality genome-scale metabolic models has even been formulated Thiele and Palsson (2010) [7]. Some parts of it can be done automatically using softwares such as Pathway tools [27] or metashark [28] but they do not fully replace manual curation. As a result, metabolic models can be reconstructed systematically from the annotated genome sequence and the

14 2. Theoretical background 8 analysis of their global structure. The developing process is usually an iterative cycle where experimental and computational analyses take turns and help to further develop and validate each other as illustrated in Figure 2.4 [29]. In practice, a model can be simply a MATLAB structure with several fields containing information about the metabolites, reactions, genes, reaction flux bounds, and objective reaction(s). Figure 2.4. The cycle of iterative metabolic model reconstruction process. The reconstruction begins with annotated genome of the organism. Together with information from biochemical databases and literature, a draft reconstruction is generated. Flux balance analysis (FBA) is often used for simulation of the reconstructed network, and predictions of various conditions can be made. These predictions can be then experimentally validated. The experimental data is used to improve and refine the original model. Modified from [30]. 2.4 Model analysis In this section, an overview of the algorithms used in this study is presented. The genomewide models allow researchers to see the cell as a whole. However, because of the complexity of the metabolic systems, there is a need for computational methods to analyze the models in order to get new information out of them. Since the genome-scale model of A. baylyi ADP1 used in this study is a stoichiometric model, the discussion here is confined to constraint-based flux analysis that is commonly used in analyzing stoichiometric models. This static approach of analysis provides information about specific states of an organism under certain genotypic and environmental conditions. There are other methods to analyze other kinds of models, such as kinetic models, but due to the large number of parameters, they are applicable to considerably smaller than genome-wide models. Several algorithms have been developed to help us to identify target genes to be modified. [6] One of the most widely used method is the flux balance analysis (FBA). FBA calculates the steady state flows of metabolites through a metabolic network. In other words, each metabolite is produced and consumed at equal rate. From the solution space, also called

15 2. Theoretical background 9 flux cone, it possible to infer, for example, the theoretical maximum growth rate of an organism and the theoretical maximum production rate of a desired metabolite. [31] An FBA model consists of three components. First, the stoichiometric matrix S of size m x n constructed from the m metabolites and n reactions of the model. The ijth element of the stoichiometric matrix is the coefficient of metabolite i in reaction j. The flux through all of the reactions in a network is represented by a vector v of length n, whose jth element is the flux through reaction j. Second, the fluxes in vector v are constrained by a lower bound vector a and a upper bound vector b, i.e. the jth element of the vector a and the vector b are the lower bound and the upper bound of the jth reaction respectively. Finally, the maximized or minimized objective reaction flux is formed by multiplying the fluxes by an objective vector f, whose jth element is the weight of reaction j in the objective. Hence, the mathematical representation of the optimization problem is [31]: max f T v subject to Sv = 0 a v b (2.1) There are linear programming (LP) solvers that can solve this kind of optimization problem. LP solvers are mathematical tools for solving the optimal outcome of a given mathematical problem. In this work, we have used the GNU Linear Programming Kit (glpk) solver [32]. Usually in large-scale metabolic models, there are more reactions than metabolites (n > m), i.e. there are more unknown variables than equations in the equation group Sv = 0. Therefore, there is no unique solution to this group of equations. The set of solutions is called the solution space, and the graphical representation is called the flux cone. An illustration of a flux cone can be seen in Figure 2.5. [31] Figure 2.5. The allowable solution space of the flux balance analysis which is defined by the constraints imposed by the stoichiometric matrix and the lower and upper bounds of the reaction rates. The solution space is also called the flux cone. FBA can determine the optimal flux distribution that maximizes the objective flux. In the flux cone the optimal solution lies on the edge of the allowable solution space. [31]

16 2. Theoretical background 10 MATLAB implementations of FBA and other similar algorithms, such as minimization of metabolic adjustment (MOMA) [33] and OptKnock [15] are provided by the Constraint-based reconstruction and analysis (COBRA) Toolbox [11]. In addition to the already mentioned algorithms, COBRA Toolbox provides computational predictions of the effects of gene deletions, robustness analyses, as well as methods to modify and refine the models [11]. The majority of the model handling in this work was done using COBRA Toolbox. Another MATLAB-based algorithm method used in this thesis is the Genetic design through local search (GDLS) [16]. GDLS is a gene knockout designing tool that is capable of handling large models in a reasonable computing time. It first reduces the model to an equivalent FBA model with fewer genes, reactions, and metabolites. Then, using the reduced model it starts from a initial knockout selection and iteratively searches the neighborhood for the best genetic manipulation strategies that couple the growth and the production of desired metabolite. Iteration stops when no further improvement can be obtained with the maximum number of allowed gene knockouts. [16]

17 11 3 ALGORITHM In this chapter, the overview of the new strain design algorithm procedure is described. After that, a few of its most essential steps and parameter curation are viewed in more detail. The algorithm was implemented using MATLAB. In addition to strain design, it provides predictions of the theoretical maximum and minimum rates of the growth and the objective reactions. A few databases and ready-made software packages running in the MATLAB environment were utilized. First, a System Biology Markup Language (SBML) Toolbox was used to import metabolic model from SBML representation to a MATLAB structure [34]. Second, the model handling and the maximal cellular growth rate simulations were done using FBA implemented in the COBRA Toolbox [11]. Third, we employed GDLS to recognize favourable gene knockouts that couple cellular growth and objective metabolic production [16]. Last, a collection composed of 6626 individual candidate reactions was compiled from KEGG reaction database [13]. 3.1 Algorithm procedure The main idea of the algorithm is that it iteratively attempts to find non-native reaction pathways and gene deletions that improve production formation and growth rate and couple them together. The goal is to iteratively enhance the production capabilities of the strain by making favorable modifications and improve the production step-by-step simultaneously maintaining the viability of the organism. The outline of one iteration cycle is presented here. Initial parameters are the metabolic model, its central metabolism, the candidate reactions, a list of currency metabolites, and the objective metabolite. Parameter preparation procedures and the most important steps of the algorithm are explored in more detail in the following sections. One algorithm iteration proceeds as follows: Step 1. A Search of non-native pathways from the set of candidate reactions using breadth-first search (BFS) algorithm. The pathways search starts from and ends to the central metabolism metabolites. The pathways were constrained to consist of maximum k max number of reactions (usually k max = 1..3). KEGG identifiers were used as identifiers to match the metabolites of the non-native pathways to those of the model. The so called currency metabolites are ignored in the pathway search, in order to find biologically meaningful reactions. In each iteration, starting metabolite for the search is different.

18 3. Algorithm 12 Step 2. Screen out pathways that are non-functional in the host. This can be the case, for example, because of dead-end metabolites in the pathway that prevent the flux in that reaction in a steady-state, or the pathway can be stoichiometrically unbalanced. Pathway functionality is inspected by adding the pathway to the full model, not the central metabolism, and simulating the maximal material flux through pathway using FBA. Step 3. From the found functional non-native pathways, find a set of pathways that, when acting together with gene knockouts identified by GDLS, couples the production of desired metabolite with biomass formation. This is done by adding each pathway to the full model and running it through GDLS. GDLS predicts the maximal growth and product formation rates and the gene knockouts that are needed to produce that result. Step 4. Select the best solution identified by GDLS and compare it to the current solution. The fundamental requirement for the predicted solution is that the predicted growth rate is at least 10% of that of the wild type. If GDLS has also succeeded in improving and coupling the biomass and objective flux, in other words, the flux solution space has shifted so that maximal growth and objective flux are coupled and improved in comparison to the current solution then implement the suggested modifications to the full model. Otherwise, no modifications are done to the model. Step 5. Take another starting metabolite for the non-native pathway search and start again from step 1 until all central metabolism metabolites have been gone through. As the result, the algorithm returns the modified full model, the latest simulation results of biomass and product formation rates, and details of the performed modifications. In addition, statistics of non-functional pathway ratio over all found paths and all favorable modifications are provided. 3.2 Reduction of central metabolism The non-native pathway search could also be performed starting from and ending to any model metabolite in the full model. However, to reduce the search space and to ensure a sufficient effect of the reaction additions, the reaction additions are targeted to the central metabolism. If the reaction additions were to take place in the perimeter of the metabolism, their effect might be negligible and have no connection to the synthesis of the desired metabolite. The central metabolism of a metabolic network is conventionally considered to be similarly structured as the bow-tie type structure of the world wide web described by Broder et al [35]. It is the most complex and core part of metabolites and reactions in a metabolic network, also called the giant strong component (GSC). Central metabolism usually includes the majority of the most essential metabolic pathways and metabolites of the metabolism. These include, for example, glycolysis pathway, see figure 2.1, pentose phosphate pathway, and the pyrimidine synthesis pathway. These pathways produce

19 3. Algorithm 13 the energy and the basic building blocs essential for the cell maintenance and growth. Usually, the central metabolism is more tightly connected than the whole network on average, i.e. the metabolites are connected to each other through fewer reactions than in the network on average. The central metabolism often consists of less than one-third of the nodes of the whole network. [36] In this work, the central metabolism is defined to be the set of the most vital reactions that function always independent on the growth medium used in the simulations. The maximal growth was simulated using 72 different growth media and the reactions that were active, i.e. reactions exhibited a material flux in steady state, in all simulations were considered belonging to the central metabolism. Other reactions and unused metabolites were left out. The reduced central metabolism then consisted of 389 reactions and 438 metabolites while full model had 996 reactions and 828 metabolites. Another approach of the central metabolism of A. baylyi ADP1 curated by Hiekkanen consisted only of 206 reactions and 187 metabolites [37]. 3.3 Curation of candidate reactions The set of individual reactions, the candidate biotransformations to be introduced into the strain, was compiled from KEGG reaction database [13]. Because FBA requires a definite structure of the stoichiometric matrix, some reactions were excluded from the candidate reaction list. For instance, polymerization reactions were not included in candidate reactions. Thus, reactions that, for instance, had an unspecified coefficients or polymer metabolites with unspecified number of units were screened out. An example of a reaction, that was left out from the candidate reactions for having unspecified reaction coefficients, is shown below. n UDP-N-acetyl-D-glucosamine + n UDP-glucuronate <=> Hyaluronate + 2n UDP Similar curation of reactions has been done, for example, by Pharkya, Burgard and Maranas in 2004 for OptStrain algorithm [17]. 3.4 Identification of non-native pathways A breadth-first search (BFS) is a basic graph search algorithm that begins at the root node and examines all the neighboring nodes. Then for each of those neighboring nodes, it examines their unexamined neighbor nodes, and so on, until it finds the goal node. BFS finds the shortest path between two nodes and therefore it is useful as a search algorithm in our approach. Metabolic networks can be represented as directed graphs as follows: the metabolites represent the nodes in the graph and reactions between them represent the edges. Suitable

20 3. Algorithm 14 non-native pathways are series of reactions where all total reactants and products of the pathway are metabolites in central metabolism as in Figure 3.1. This ensures that reaction additions take place in central metabolism and not in the perimeter of the metabolism. Furthermore, it eliminates the dead-end reaction paths, where metabolites are either only consumed or produced. Mainly, such reactions have a purpose of transportation reactions, that transport metabolites from the cell to the extracellular space or the other way around, but otherwise they hinder the formation of a propitious steady state for production purpose when using FBA. Still, this condition does not remove the possibility of dead-end metabolites in non-terminal reactions of the path. Figure 3.1. An illustration of a reaction pathway starting from and ending to the central metabolism of the host organism. The blue circle around the host represents the metabolites in the central metabolism. Connections through various small molecules, also called currency metabolites, such as ATP, NADH and H 2 O are ignored in the BFS. These small molecules usually act as carriers for transferring electrons or certain functional groups. Consequently, they take part in numerous reactions that most metabolites can be connected through them. However, these connections do not represent real feasible product formation pathways [36]. It should be noted, that the method of excluding currency metabolites was determined straightforwardly by compound, not by reaction. Thus, also reactions where currency metabolites are primary metabolites might also be excluded. For example, glutamate and 2-oxoglutarate act as currency metabolites in several amino acid transferring reactions but in the L-glutamate deaminating reaction they are the primary metabolites. The reaction is shown below. L-glutamate + NADP+ + H 2 O <=> 2-oxoglutarate + NH 3 + NADPH + H+ On the other hand, in this context such essential reactions can be assumed to be already present in the organism and thus not needed as candidate reactions to be added.

21 3. Algorithm Confirming the functionality of the possible reaction additions and identifying favorable gene knockouts All found paths must be further examined to eliminate reactions with dead-end metabolites in non-terminal reactions. See the previous section 3.4 for further explanation of dead-end metabolites. Simultaneously, the overall functionality of the pathways is confirmed. The methods provided by COBRA Toolbox were used to inspect this. A found path are added into the model and one of the path reactions is maximized using FBA. If the reaction has even a minimal flux through it, the pathway are accepted as functional. Otherwise, the pathway is discarded from further analysis. Approximately 20 % of the found paths were functional in the simulations. Finally, GDLS is used to identify pathways and gene deletions that improve the current solution and couple the growth and product formation. GDLS is run for each found functional pathway. The resulting growth rate is required to be over 10% of the wild type growth rate. Then, if the coupling is successful, i.e. the desired metabolite production flux has increased from previous maximal solution in a situation where biomass formation rate is at its maximum, the solution (reaction additions and/or gene deletions) resulting the greatest production fluxes in GDLS is chosen and the modifications are implemented into the full model. Otherwise, no modifications are done. This model modification step concludes an iteration and the algorithm starts again, with a different starting metabolite for the reaction pathway search. As a result, a model with an enhanced production rate of desired metabolite and maintained biomass formation and a list of modifications needed to achieve that is obtained.

22 16 4 RESULTS The algorithm was demonstrated by designing modifications to Acinetobacter baylyi ADP1 metabolism with the aim of increased acetate production. Acetate was chosen because there are previous studies and experimental data about acetate production strains [16, 38]. Additionally, acetate is produced in chemical industry in large amounts, and competitive biological production routes to replace petrochemical synthesis routes are searched for. Acetate, also called acetic acid, has a chemical formula CH 3 COOH. In households, it can be found in vinegar, whose main component it is and in other food products as a food additive code E260. In industry, it is commonly used as a reagent, for instance, in the production of plastics, dyes and cosmetics. Acetate is produced mainly from oil. Still, bacterial fermentation remains important as many laws require that the acetate used in foods must be of biological origin. [39] With the modifications suggested by the algorithm, a 2-fold increase in the maximal production rate was obtained according to FBA analyses. In other words, the maximum obtainable flux increased from 500 to 1000 mmol/gdw/hr (millimole per gram dry weight per hour). Additionally, there was almost a 3-fold improvement in the maximal obtainable biomass formation rate. The biomass formation rate increased from 12.3 to 30.7 (1/hr). The uptake rate of the carbon source, succinate, was constrained between 0 and 1000 mmol/gdw/hr. The results are illustrated by the production envelopes of both the wild-type and the mutant model in Figure 4.1. In a production envelope, the multidimensional flux cone is projected to two of its dimensions, namely biomass and objective metabolite fluxes. From the production envelope, the coupling of the two fluxes and their possible flux distributions can be seen. If at the point of maximum biomass formation in the production envelope, there is a flux of the desired product formation, the fluxes are coupled. There were no gene knockouts but only reaction additions in the optimization strategy found by the algorithm. Together, 27 reaction pathways, each consisting of 1 3 reactions, were added to obtain the solution. The additions included pathways, for example, in amino acid, pyrimidine, and coenzyme-a (CoA) metabolism.

23 4. Results 17 Figure 4.1. Acetate production envelopes of both the wild-type and mutant model of A. baylyi ADP1. The acetate production maximum and minimum rates are presented as a function of biomass formation rate. In the wild type model, the two fluxes are coupled. The curves were produced using a COBRA Toolbox [11] function in MATLAB.

24 18 5 DISCUSSION 5.1 Validity of the results The added reaction pathways included reactions in the biosynthesis of many biomolecules essential for the growth, development and reproduction of the microorganisms. For example, pathways in amino acid, pyrimidine, and coenzyme-a metabolisms were included. They are all important processes to the cell. Amino acids are needed as the building blocks of proteins, the nucleotides in DNA and RNA are derivatives of pyrimidine, and CoA acts as a substrate in several processes, such as citric acid cycle and the synthesis of fatty acids. It was quite surprising that no gene deletions were suggested by the algorithm. Comparison with previous studies could not be done since no previous reports on acetate production using reaction additions were found. Also, the number of reaction additions suggested is quite high as the number of modifications that can be done experimentally is limited. In theory, all except the essential genes could be knocked out but in practice, additions and deletions in the cell are technically difficult to perform. Additionally, only a limited number of antibiotic resistance genes, that are needed to screen out the successfully modified cells, are available. Therefore, at the moment, the maximum of about 10 modifications can be performed into a bacterium with reasonable effort. The validity of the metabolic engineering strategies suggested by the algorithm depends on the validity of the metabolic model used and the correctness of the reactions in the non-native pathways. First, the metabolic models analyzed by the algorithm lack for example kinetic and regulatory constraints, which may result in flux distributions that are unfeasible in practice. There have been attempts to incorporate these factors into the models and develop production system designing tools for them [40]. That would make the models better representations of the organisms and allow more varied metabolic engineering strategies to be identified. It also reduces the solution space which makes the results easier to analyze [41]. Second, the validity of the results could be improved by refining the reactions in nonnative pathways. For example, all reactions in KEGG are marked as reversible and that assumption was also used in this work. However, a careful inspection of the reactions would probably show that some of the reactions are irreversible in practice. This refined information would help the algorithm to produce more accurate results. Furthermore, elementally unbalanced reactions in reaction databases have been reported, for example,

25 5. Discussion 19 by Pharkya et al. (2004) [17], especially in respect to hydrogen atoms. This can lead to false flux distribution results. 5.2 Future directions The computing time of the algorithm is depends on the number of paths found in breadthfirst search (BFS) and the size of the central metabolism. In this work, using a central metabolism of 438 metabolites, the total runtime varies from 8 hours to 40 hours depending also on the objective metabolite. The majority of the time is used by the GDLS and its SCIP solver [42]. The computations were done using 3.00 GHz Intel Core processors. The algorithm could be further developed for even more successful coupling of target product and biomass formation and a good computational efficiency at the same time. This could be done by increasing the number of iterations and analyses on the model. For example, with a method to identify and screen out reaction pathways analogous to those of already analyzed, the number of reactions to analyze could be decreased. Computational efficiency could be improved with more efficient LP and MILP solvers. In the future, more detailed models and computational methods to analyze them will become available. The algorithm could be developed to utilize them for even more accurate predictions. Now, tests with various microorganisms and products are to be performed and their results to be validated experimentally. These experimental results might help to develop the algorithm. In addition, more detailed analysis of the performance of the algorithm is to be conducted. 5.3 Conclusion In this thesis, an algorithm for designing production strains in silico was developed and demonstrated by acetate production in A. baylyi ADP1. Such algorithms have been the of great interest recently. In contrast to many earlier strain design algorithms that focus only on identifying gene deletions, this algorithm also searches for non-native pathway additions. The acetate production demonstration showed that the approach is capable in production strain design, although further testing and experimental validation are to be conducted. Furthermore, the algorithm has potential for future development.

26 20 REFERENCES [1] Y.-T. Yand, G. N-Bennet, and K.-Y. San, Genetic and metabolic engineering, Electronic Journal of Biotechnology, [2] M. Gavrilescu and Y. Chisti, Biotechnology - a sustainable alternative for chemical industry, Biotechnology Advances, vol. 23, pp , Nov [3] G. Chotani, T. Dodge, A. Hsu, M. Kumar, R. LaDuca, D. Trimbur, W. Weyler, and K. Sanford, The commercial production of chemicals using pathway engineering, Biochimica et biophysica acta-protein structure and molecular enzymology, vol. 1543, no. 2, SI, pp , [4] E. M. Green, Fermentative production of butanol - the industrial perspective, Current Opinion in Biotechnology, vol. 22, pp , Jun [5] Z.-J. Zhao, C. Zou, Y.-X. Zhu, J. Dai, S. Chen, D. Wu, J. Wu, and J. Chen, Development of L-tryptophan production strains by defined genetic modification in Escherichia coli, Journal of Industrial Microbiology & biotechnology, vol. 38, no. 12, pp , [6] J. M. Park, T. Y. Kim, and S. Y. Lee, Constraints-based genome-scale metabolic simulation for systems metabolic engineering, Biotechnology Advances, vol. 27, no. 6, pp , [7] I. Thiele and B. O. Palsson, A protocol for generating a high-quality genome-scale metabolic reconstruction, Nature Protocols, vol. 5, no. 1, pp , [8] J. Edwards and B. Palsson, The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities, Proceedings of the National Academy of Sciences of the United States of America, vol. 97, pp , May [9] A. R. Zomorrodi and C. D. Maranas, Improving the imm904 S. cerevisiae metabolic model using essentiality and synthetic lethality data, BMC Systems Biology, vol. 4, [10] M. Durot, F. Le Fevre, V. de Berardinis, A. Kreimeyer, D. Vallenet, C. Combe, S. Smidtas, M. Salanoubat, J. Weissenbach, and V. Schachter, Iterative reconstruction of a global metabolic model of Acinetobacter baylyi ADP1 using highthroughput growth phenotype and gene essentiality data, BMC Systems Biology, vol. 2, Oct