Regulatory Dynamics in Engineered Gene Networks

Size: px

Start display at page:

Download "Regulatory Dynamics in Engineered Gene Networks"

Harry Blake
6 years ago
Views:

Applications to Systems Biology Mads Kærn Boston University Center

1 Regulatory Dynamics in Engineered Gene Networks The Physico-chemical Foundation of Transcriptional Regulation with Applications to Systems Biology Mads Kærn Boston University Center for BioDynamics Center for Advanced Biotechnology Department of Biomedical Engineering

2 Disclaimer. This document contains material that has been reproduced from various sources without permission from the copyright owners. As a result, the document may only be distributed to participants of the 4th International Systems Biology Conference, Washington University, St. Louis. All other material is c Mads Kærn, The document is intended for educational purposes only. Should any copyrights have been infringed, please contact the author and the material will be removed immediately.

3 Contents 1 The Biology of Gene Expression The Genetic Code Genes and Gene Expression Transcription and Translation Prokaryotic Cells Eukaryotic Cells Regulation of Gene Expression The Lactose Operon of E. coli The Genetic Switch in Bacteriophage λ The Galactose Regulon in S. cerevisiae Engineered Gene Networks Some Tools of the Trade Cutting and Pasting DNA Plasmid Vectors Extracting DNA Sequences Engineering Regulatory Modules Genetic Switches in E. coli Genetic Switches in S. cerevisiae Mammalian Switches Engineering Regulatory Circuits How Transcriptional Regulation Works Modeling Small Gene Networks Biochemical Reaction Kinetics Elementary Reactions Law of Mass Action Generalized Mass Action Chemical Equilibrium The Michaelis-Menten Reaction Hill-type Kinetics Modeling Gene Expression Modeling cis-regulatory Systems Repressor-Operator Binding Alternative Reaction Paths Cooperative Binding of Dimers

4 Synergism in RNA Polymerase Binding DNA looping Models of Gene Regulatory Systems The Lactose Operon in E. coli The Galactose Regulon in S. cerevisiae Models of Engineered Gene Networks Concluding Remarks

5 Foreword The future success of System Biology requires the establishment of general principles and the development of methodologies that can be used to link the behavior of individual molecules to system characteristics and functions. In order to achieve this goal, we need to study systems that have been characterized in minute detail and are sufficiently small to be manageable. The central theme in this tutorial is the use of engineered gene networks to deduce principles that govern gene transcription and to develop reasonably accurate system level models from qualitative molecular level information. The tutorial consists of three parts: (1) The Biology of Gene Expression, (2) Genetic Network Engineering and (3) Modeling Small Gene Networks. No previous knowledge of molecular biology is assumed. The purpose of Part (1) is to provide a brief introduction to the fundamental biology of gene expression and a discussion of the current theories of gene regulation in bacteria and in yeast. Part (2) will provide a basic introduction to some experimental techniques. The main emphasis, however, is a discussion of how genetically engineered systems have provided support for the theories of transcriptional regulation introduced in Part (1) and how they are used to investigate system level characteristics and function. Part (3) discusses the physico-chemical basis of gene regulatory systems and provide a detailed and rigorous methodology that can be used to convert qualitative molecular level models into quantitative system level descriptions. Particular emphasis will be given to the limits and dangers of quantitative modeling of which any researcher in Systems Biology should be aware. Discussions and comments from Michael Driscoll and Michael Thompson have been very valuable during the writing of these notes. The reader is kindly reminded that the notes serve only as a brief introduction to a very large subject area. They include material that I believe is most relevance to readers that are involved in the mathematical and computational aspects of Systems Biology, and are looking for a brief summary of important aspects from molecular biology and physical chemistry. I have attempted to present this material in a way that is accessible to a non-specialist audience. Despite my best efforts, this has undoubtfully resulted in descriptions that in many aspects are oversimplified. My sincere apologies go to the authors of the books and the articles that for one reason or the other did not make the Suggested Readings lists. Please report errors and mistakes to mkaern@bu.edu. Suggestions that can be used to improve the quality of future versions are most welcome.

7 Tutorial Part1 The Biology of Gene Expression The sophistication of biological control systems is extraordinarily rich and regulation takes place on many different levels simultaneously. Novel surprising details are constantly revealed as our experimental methods continue to improve and new technologies are invented. As a result, it is difficult to organize a comprehensive presentation of general aspects without getting mired in details that may not appear to be important, but probably are. Since most engineered cellular control systems currently involve manipulation of the information stored in the cell s DNA, this tutorial will focus on regulatory processes at the level of gene transcription. In this section, I will briefly summarize some basic concepts from molecular level biology and then discuss some of the general principles involved in the regulation of gene transcription. This discussion will be augmented by a walk-through of some of the best studied natural gene regulatory systems. 1.1 The Genetic Code Most regulatory processes that take place within cells involve proteins whose structure and function is determined by information stored in the cell s DNA. Genetic engineering and the engineering of gene networks involve the manipulation of this information and of the conditions under which it is used to synthesize proteins. The DNA molecule encodes information in the four nucleotides containing the bases adenine (A), guanine

8 8 The Biology of Gene Expression Figure 1.1: (A) The molecular structure of ribonucleic acid (RNA) and deoxyribonucleic acid (DNA). RNA and DNA has a hydroxyl group and a hydrogen atom at position X, respectively. In DNA, the base bound to the carbon at position 1 is adenine (A), guanine (G), cytosine (C) or thymine (T). In RNA, the thymine is replaced by uracil (U). (B) Double stranded DNA. Hydrogen bonds (broken line) are formed between the bases A and T or G and C and links together two complementary single stranded DNA molecules. (C) The helical structure of double stranded DNA. (G), cytosine (C) and thymine (T). The molecular structure of the nucleotides is illustrated in Fig In the figure, carbon atoms are indicated as solid circles, lines indicate covalent bonds between atoms and sticks indicate a covalent bond that ends in a hydrogen. RNA and DNA differ in the identity of the atom bound to the carbon at position 2 in the sugar ring (marked by an X in Fig. 1.1A). RNA has a hydroxyl group bound at this position while DNA has a hydrogen atom. Polynucleotide chains are formed by individual ribonucleotides being linked to each other through a phosphodiester bond. This bond is between the phosphate group bound to the carbon at position 5 and the oxygen bound to the carbon at position 3 and establishes the 5 3 directionality of the polymer chain. Under normal conditions, the DNA is in a double stranded form that consists of the 5-3 strand and its complement where the direction of the DNA backbone is reversed (Fig. 1.1B). Bases on opposite strands are paired with each other through hydrogen bonds such that A pairs with T and C pairs with G. The double stranded DNA forms a helical structure (Fig. 1.1C). The synthesis of a protein based on the DNA-encoded amino acid sequence requires at least two steps. First, the genomic information must be transcribed from the DNA sequence into a messenger RNA molecule (mrna). This is done by an RNA polymerase, which, in analogy to DNA polymerase, catalyzes the formation of phosphodiester bonds between individual nucleotides (Fig. 1.1B). The structure of RNA molecules is similar to that of DNA molecules with the exception that the backbone consists of ribose rather than deoxyribose and the base thymine is replaced by the base uracil (U). Furthermore, the mrna is usually single stranded. After transcription, the message contained in the mrna must be translated into a protein. This is done by the ribosome, which is a molecular machine made of both

9 1.1 The Genetic Code 9 Figure 1.2: (A) The molecular structure of amino acids. The identity of the amino acid is determined by its side chain. (B) Peptide bond formed between the amino- and the carboxyl-groups of two amino acids. (C) The correspondence between the DNA sequence, mrna sequence and that sequence of the first eight amino acids of the LacR repressor protein. RNA and protein. The process of translation involves two additional types of RNA molecules, ribosomal RNA (rrna) and transfer RNA (trna). The rrna molecules are components of the ribosome. The trnas provide the specificity that enables the insertion of the correct amino acid into the protein that is being synthesized. Proteins consist of a chain in which individual amino acids residues are linked to each other through peptide bonds. The general structure of the amino acids is illustrated in Fig. 1.2A. In analogy with DNA and RNA, they consist of a common element that enables the formation of a polymer chain. The identity and the property of the individual amino acids is determined by the side chain. There are 20 naturally occurring amino acids. In the polymer chain that forms the backbone of proteins, the individual amino acids are linked to each other through peptide bonds formed between the carboxyl-group of one amino acid and the amino-group of another (Fig. 1.2B). This creates a chain that at one end has a free amino-group, the N-terminal (NH + 3 ), and the other end has a free carboxyl-group, the C-terminal (COO ). The DNA molecule stores the information required to synthesize proteins in terms of a string of codons. A codon consists of three nucleotides, each selected from of the four available bases (A, T, G or C), which are read from the DNA molecule in the 5 to 3 direction. In Fig. 1.2B, the codon encoded on the left strand is AGT while the codon encoded on the right strand is ACT. Of the 64 possible codons, 61 encode for one of 20 amino acids (Table 1.1). The genetic code is thus redundant and different codons may identify the same amino acid. The last three codons (TAA, TAG and TGA) are stop codons. They define the end of the protein encoding region of the DNA. In addition, the order of the amino acids in the polypeptide chain is determined by the sequence in which the codons appear in the DNA sequence. In most cases, there is a linear relationship between the DNA sequence and the amino acid sequence within the protein that the sequence encodes. This is illustrated in Fig. 1.2C, which shows the first 24 base pairs of the gene that encodes the LacR repressor protein, the corresponding mrna sequence and the sequence of the first 8 amino acids in the LacR repressor

10 10 The Biology of Gene Expression 1st 2nd 3rd (5 ) T C A G (3 ) A Isoleusine Threonine Lysine Arginine A A Isoleusine Threonine Asparagine Serine T A Isoleusine Threonine Asparagine Serine C A Methionine Threonine Lysine Arginine G T Leucine Serine STOP STOP A T Phenylalanine Serine Tyrosine Cysteine T T Phenylalanine Serine Tyrosine Cysteine C T Leucine Serine STOP Tryptophan G C Leucine Proline Glutamine Arginine A C Leucine Proline Histidine Arginine T C Leucine Proline Histidine Arginine C C Leucine Proline Glutamine Arginine G G Valine Alanine Glutamic acid Glycine A G Valine Alanine Apartic acid Glycine T G Valine Alanine Apartic acid Glycine C G Valine Alanine Glutamic acid Glycine G Table 1.1: The correlation between the sequence of bases in the codons and the amino acids. The codons TAA, TGA and TAA signals termination of translation. polypeptide chain. The N- and C-terminal regions are encoded by the codons in the 5 and the 3 end of the DNA-encoding sequence, respectively. Once translation is completed and the full length DNA-encoded polypeptide has been formed, the function of many proteins requires the completion of additional steps. This may involve, for example, covalent modification, such as phosphorylation, acetylation or glycosylation, i.e., the addition of a phosphate, an acetyl or a glycosyl-group, the incorporation of the protein into multi-protein complexes or the transportation of the protein to its appropriate cellular location, for instance, in the cell membrane. 1.2 Genes and Gene Expression The term gene is usually used to refer to the DNA sequence that is transcribed into mrna and subsequently translated into a protein. However, there are important exceptions to this rule. For example, DNA sequences that encode for molecules like rrna and trna are genes even though the RNA molecule is never translated into a protein. Genes are usually carried on the cell s chromosomes. Each chromosome carries at least one origin of replication. These regions determine the location where the DNA polymerase initiates the duplication of the genetic material. The location of a specific gene on the chromosome is called the gene s locus. Haploid cells carry a single copy of each

11 1.2 Genes and Gene Expression 11 Figure 1.3: (A) Schematic illustration of DNA wrapped around a nucleosome. (B) The primary component of the nucleosomes consists of four histone proteins H2A, H2B, H3 and H4. The nucleosomes can be remodeled and rearranged spatially by covalent modification of the protruding histone tails. (C) Illustration of potential organizations of nucleosomes in spatial structures (Reproduced without permission from Bednar et al., c National Academy of Sciences). chromosome and the locus thus uniquely determines the location of the gene. Diploid cells have homologous chromosome pairs. Two different forms of the same gene are known as alleles. The chromosomes are organized very differently in prokaryotic, which lack a cell nucleus, and in eukaryotic cells. In bacteria (a prokaryote), such as Escherichia coli, all of the genes are located on a single, circular chromosome while the genes in eukaryotic cells are located on several linear chromosomes. There are 16 chromosomes in yeast. In addition, the eukaryotic DNA is complexed with nuclear proteins and compacted into a structure called chromatin. Central to this structure is the wrapping of approximately 200 base pairs of DNA around protein complexes known as nucleosomes (Fig. 1.3A). The organization of chromatin and of the nucleosomes can be used as an instrument to regulate which genes are accessible for transcription by RNA polymerase (discussed in section 1.3.2). The primary constituent of the nucleosomes is the four histone proteins H2A, H2B, H3 and H4, which combine to form a histone tetramer (Fig. 1.3B). A nucleosome consists of two histone tetramers. Each histone subunit has a protruding N-terminal tail that serves important regulatory functions. There, covalent modifications, such as acetylation, can greatly influence the accessibility of the DNA. The nucleosomes are, together with other nuclear proteins, arranged into chromatin fibers. Examples of potential spatial arrangements of the nucleosomes are shown in Fig. 1.3C. In addition to the chromosomes, genes can be carried on plasmids. Plasmids are in many ways similar to the bacterial chromosome. They are circular pieces of DNA that typically replicate independently of duplication of the chromosomal DNA prior to cell division. As a result, plasmids are often present in multiple copies within each cell and the plasmid copy number usually changes as cells progress through the cell division cycle. The average copy number of plasmid per cell depends on the type of the origin

12 The Biology of Gene Expression Figure 1.4: Typical organization of a gene containing the information required for the synthesis of a protein.

12 12 The Biology of Gene Expression Figure 1.4: Typical organization of a gene containing the information required for the synthesis of a protein. The promoter is the region where the RNA polymerase initially binds. The terminator is the region where the RNA polymerase is released from the DNA. The DNA also contains regions that, when transcribed into mrna, controls translation initiation (5 UTR) and termination (3 UTR). of replication that it carries. Some plasmids are stringently controlled and are present only in a single copy while others are loosely regulated and present in 60 copies per cell or higher. Plasmids are used widely in genetic engineering (Tutorial Part 2). In addition to the sequences that encode for genes, the DNA contains regions that are involved in the regulation of gene transcription. The RNA polymerase reads the genetic code in the 5 to 3 direction and the location where it initially binds to the DNA is located upstream of the gene, i.e., farther in the 5 direction (Fig. 1.4). The region where the RNA polymerase initially contacts the DNA is called the promoter of the gene whose expression it facilitates. The expression of a gene may occur from more than one promoter, i.e., the region upstream of the gene may contain distinct binding sites for the RNA polymerase. The first nucleotide that is transcribed is usually labeled +1 and nucleotides are counted relative to this transcription start site in the 5 to 3 direction of the DNA. The nucleotides in the gene-encoding region are thus labeled with positive numbers while nucleotides within the promoter region are labeled with negative numbers. In bacteria, the promoter region is about 60 base pairs in length and spans roughly 40 base pair upstream and roughly 20 base pairs downstream of the +1 site. In yeast, the promoter region spans roughly 200 base pairs. Generally speaking, no two promoters are identical. Statistical analysis has however shown that there are regions that are highly conserved within different promoters. In bacteria, one of these regions is located at position -10 and has the consensus sequence TATAAT. This region is called the TATA-box and is in many cases essential for the proper alignment of the RNA polymerase holoenzyme with respect to the geneencoding sequence. Mutations of the TATA-box sequence, i.e., the substitution of one nucleotide with another, can greatly affect the the rate at which the DNA is transcribed into an mrna. A sequence that is similar to the TATA-box is also important for the transcription of many eukaryotic genes. In addition to the TATA-box, the promoter region often contains sites where transcription factor proteins can bind and directly or indirectly affect the rate of transcription. In bacteria, transcription factor binding sites are often referred to as operators.

13 1.3 Transcription and Translation 13 However, such regulatory elements may also be located far from the promoter region or even within the gene-encoding region of the DNA. In eukaryotes, it is quite common to find enhancer sequences that affect the transcription from a promoter located very far from it in the DNA sequence. This action-at-a-distance can arise from the rearrangement of chromatin structure and/or close spatial proximity of transcription factors bound to the enhancer sequence due to bending and looping of the DNA. Transcription factor binding sites are referred to as cis-regulatory elements while the transcription factor proteins that binds to them are referred to as trans-regulatory elements. In addition to the promoter and cis-regulatory elements, there are sequences within the DNA that determine the termination of transcription (a terminator sequence) and, for protein-encoding genes, sequences that determine the region of the mrna that is to be translated into protein (Fig. 1.4). The codon that indicates the location where translation is to start, the translation start codon, is often ATG. The DNA sequence located between the start site of transcription and start of translation is referred to as an untranslated region (UTR). UTRs can greatly influence the efficiency of gene expression, for example by determining how well the ribosomes can bind to the mrna and initiate translation. The translation stop codon that indicates the location where translation is terminated is either TAA, TAG or TGA. The sequence of the DNA between the stop codon and the site where transcription is terminated can also have an effect on the efficiency of gene expression. This region is referred to as the 3 UTR. 1.3 Transcription and Translation Similarly to the definition of a gene, the meaning of gene expression is not always clearly defined. Some use the term gene expression to refer to the biological manifestation in terms of alteration in phenotype, that is, an observable change in the characteristics of the cell. The gene that is responsible for a specific cellular trait can be said to be expressed when the phenotype is observed and not expressed otherwise. In other words, gene expression can be viewed as being a binary on/off process. Others use gene expression to refer to the process that starts when the transcription of the DNA that encodes the gene is initiated and ends when a biologically functional molecule is formed, regardless of whether this is accompanied with a detectable change in the cell s phenotype. In this view, gene expression can be graded and quantified based on measurements of the activity of the end product of the gene expression process. Since many proteins require some post-translational modification to be fully functional, e.g., the attachment of a phosphate group or the incorporation of the protein into a larger complex, it can be argued that such events are part of the process in which the genomic information is expressed. Generally speaking, however, there will be a positive correlation between the rate at which a gene is transcribed and the abundance (and hence the activity) of the end product of the gene expression process. Typically, if a gene s mrna is abundant within a cell, there will be a high level of the corresponding protein product. Transcription is usually a prerequisite for gene expression and the

14 14 The Biology of Gene Expression control of transcription is one of the most important regulatory instruments available to the cells. In prokaryotes as well as eukaryotes, the transcription of a gene into a corresponding mrna occurs in three general steps: transcription initiation, elongation of the mrna and termination of transcription. Gene expression can be regulated on all of these levels. Regulation of gene expression at the levels of transcription initiation is, however, the most common Prokaryotic Cells The general steps involved in the transcription of prokaryotic genes are illustrated in Fig The RNA polymerase core enzyme, which is a multi-component complex consisting of two α, one β and one β subunit, must first bind to the DNA at an appropriate position relative to the gene that is to be transcribed. Molecules known as sigma factors facilitate the appropriate positioning of the core enzyme. The sigma factor combines with the core enzyme to form the RNA polymerase holoenzyme (Fig. 1.5A). The sigma factor provides specificity to the RNA polymerase holoenzyme and ensures that transcription occurs only from promoters. In addition, sigma factors serve as global regulators of gene expression and are used to direct transcription on a genome-wide scale. For example, by recognizing specific cis-regulatory elements, the sigma factor σ 54 can direct transcription of a set of genes that are not transcribed when the RNA polymerase holoenzyme contains the sigma factor σ 70. The discussion below addresses transcription from promoters by the holoenzyme containing the most common sigma factor σ 70. The binding of the holoenzyme to a promoter is usually considered to be reversible with many association and dissociation reactions taking place before transcription is initiated. The binding affinity depends on the promoter sequence and is, as mentioned, particularly sensitive to variation in the bases in the TATA-box. The binding of the RNA polymerase holoenzyme to the promoter leads to the formation of a closed complex in which the DNA remains in its double-stranded duplex form (Fig. 1.5B). The initiation of transcription involves a transition from the closed complex to an open complex. In the open complex, the helical structure of the DNA is disrupted to expose a single-stranded region of the DNA near the transcription initiation site (Fig. 1.5C). The transition from closed to open complex is usually considered irreversible and transcription will usually occur once the open complex has been established. Once the open complex is formed, synthesis of the mrna molecule begins with the formation of a phosphodiester bond between the first two ribonucleotides that are base-paired complementary to the DNA template. The result is the formation of a ternary complex consisting of the holoenzyme, DNA and RNA (Fig. 1.5D). Further ribonucleotides are then added to the RNA chain, up to a length of 9 bases. During this stage, a transition back to the open complex (Fig. 1.5C) can occur by the release of the short nucleotide chain from the ternary complex. This is known as abortive initiation. In order to synthesize RNA chains longer than 9-10 bases, the RNA polymerase must move along the DNA in the 5 to 3 direction. This requires that the sigma factor be

1.3 Transcription and Translation 15 Figure 1.5: Prokaryotic transcription initiation. (A, B) The binding of the RNA polymerase holoenzyme to the promoter is facilitated by the sigma factor.

15 1.3 Transcription and Translation 15 Figure 1.5: Prokaryotic transcription initiation. (A, B) The binding of the RNA polymerase holoenzyme to the promoter is facilitated by the sigma factor. (C) The DNA is opened to expose a region of single stranded DNA. (D) The single stranded DNA is used to synthesize short RNAs. (E) The release of the sigma factor allows the RNA to move down the gene and produce a full-length mrna. The mrna is translated into protein as soon as it emerges from the elongating complex. released from the holoenzyme and the formation of the elongation complex composed of the RNA polymerase core enzyme, DNA and RNA (Fig. 1.5E). This complex can move along the DNA and synthesize full-length mrna. After transcription is initiated, it usually takes some time (1-2 seconds) before the core enzyme clears the promoter region and another holoenzyme can bind. Translation of the mrna is facilitated by the ribosomes and, in bacteria, occurs simultaneously with the elongation of the mrna transcript (Fig. 1.5E). The translation start site is located downstream of the transcriptional start site and the sequence between them defines the 5 UTR. The 5 UTR contains the site where the ribosomes initially bind to the mrna, the ribosome-binding site (RBS), and is part of the RNA molecule that first emerges during transcription elongation. A ribosome binds to the RBS as soon as it emerges from the elongating ternary complex and immediately starts to synthesize the polypeptide chain encoded in the mrna. The efficiency of translation can vary substantially depending on the RBS sequence and/or the distance between the transcription and the translation start sites. As mentioned above, the RNA polymerase holoenzyme mediates transcription from a different set of promoters when it contains the sigma factor σ 54 rather than σ 70. It

16 16 The Biology of Gene Expression Complex Subunits Complex Subunits TFIIB 1 RNA polymerase II 12 TFIIA, TFIIE 2 TFIID 12 TFIIF 3 Mediator 26 TFIIH 9 SAGA 15 Table 1.2: The number of subunits for some common complexes in the transcriptional machinery. Spt-Ada-Gcn5 acetyltransferase is abbreviated as SAGA. The 12 subunits of TFIID include TBP (one subunit) and TAFs (11 subunits). Numbers are from Ptashne and Gann. also initiates transcription in a manner that is different from the scenario illustrated in Fig The holoenzyme is able to bind to the promoter region to form the closed complex. However, the transition to the open complex does not happen spontaneously unless the σ 54 subunit is in the correct conformation. The required modification of the holoenzyme can be provided by DNA binding proteins that have ATPase activity. These activators bind to the DNA upstream of the closed complex and, subsequently, make contact with the σ 54 subunit through a DNA loop. ATPases are able to couple the energy released by the hydrolysis of the energy-storage molecules ATP to a specific process, in this case causing a conformational change in the N-terminal region of the σ 54 subunit. Transcription mediated by the σ 54 -holoenzyme is thus more complex than that mediated by the σ 70 -holoenzyme. In fact, the requirements of an transcriptional activator acting at a distance and of an activator-mediated conformational change prior to open complex formation makes σ 54 -mediated transcription appear as a hybrid of prokaryotic and eukaryotic mechanisms of transcription initiation (see below). In addition, transcription mediated by σ 54 may involve a process known as transcription reinitiation (discussed in section 1.3.2) Eukaryotic Cells Eukaryotes have three different RNA polymerases. RNA polymerases I and III transcribe genes that encode for rrna and trna (and other small RNAs), respectively. RNA polymerase II, which consists of 12 subunits, transcribes protein-encoding genes (class II genes). The expression of eukaryotic genes is significantly more complex than in prokaryotes and involves a large number of proteins, many with functions that are not fully understood. The transcriptional machinery in yeast can involve as many as 50 different proteins in addition to the core polymerase. These include general transcription factors (TFs), TATA-box binding protein (TBP) and associated factors (TAFs), the so-called Mediator complex, nucleosome remodelers, histone acetylases (HATs), histone deacetylases (HDACs) and others. The Mediator is a complex that is believed to be one of the components that interacts with DNA-bound transcriptional activators, i.e., it mediates the activation signal. Most of these components are complexes that consist of multiple protein subunits (Table 1.3.2).

17 1.3 Transcription and Translation 17 It has proven difficult to determine the different steps involved in the initiation of transcription in eukaryotes. Despite our somewhat blurred understanding of the details, general parts of the picture are clear. A number of protein complexes, the general transcription factors (TFs), bind to the DNA and form a scaffold to which the polymerase holoenzyme can bind. This group includes TFIIA and TFIID. The TFIID complex consists of the TATA binding protein (TBP) and TBP associated factors (TAFs) and binds to TATA-box like sequences found about 30 base pairs upstream of the transcription start site of many genes. TATA-box mediated expression is common for class II genes. Transcription from promoters that do not contain a TATA-box is usually mediated by a so-called initiator sequence located at the transcription start site. Some promoters contains a TATA-box as well as an initiator site. The general picture of the recruitment and proper positioning of the RNA polymerase II holoenzyme to a TATA-box-containing promoter is illustrated in Fig The illustrated scenario is based on knowledge obtained from the major late (ML) promoter of adenovirus, and is believed to capture the basic logic of TATA-box mediated transcription in eukaryotes. In the first step, the TFIIA complex and the TBP-containing TFIID and complexes binds to the DNA. The TFIID complex and the TBP protein contained in this complex associates with the DNA near the TATA-box (Fig. 1.6B). The binding of the TBP/TFIID is generally considered to be the rate-limiting step during transcription initiation and often requires the presence of additional factors in the vicinity of the promoter. This is discussed further in section 4.3. The TFIIB complex is then recruited (Fig. 1.6C) to form a scaffold that can bind the RNA polymerase II holoenzyme, including the Mediator complex, and its partner TFIIF (Fig. 1.6D). Then TFIIE and TFIIH are added to form the closed complex (Fig. 1.6E). It is not entirely clear which components act as individual complexes and which are part of the RNA polymerase II holoenzyme. In some cases, it may be that most of the factors are recruited simultaneously together with the polymerase corresponding to a direct transition from the pre-initiation scaffold (Fig. 1.6B) to the closed complex (Fig. 1.6E). The TFIIH complex has helicase activity and can unwind the DNA. It also has kinase activity and can add phosphate groups to the C-terminal region of the largest subunit of RNA polymerase II. This phosphorylation is likely to be critical for the initiation of transcription and appears to trigger open complex formation, the start of transcriptional elongation and RNA synthesis (Fig. 1.6F). The transition from transcriptional initiation to elongation involves the release of TFIIE and TFIIH. TFIIF remains bound to the RNA polymerase as it clears the promoter and moves down the gene. Interestingly, TFIIA and TFIID may remain bound to the promoter after the polymerase has cleared (Fig. 1.6F). These complexes can can act as a scaffold for the binding of the transcriptional apparatus and may allow for repeated rounds of transcription. Since the presence of the scaffold circumvents the rate limiting binding of the TBP-containing TFIID complex, transcription can take place at an increased rate. Transcription reinitiation has also been reported for genes that are transcribed by RNA polymerase I and III. σ 54 -mediated transcription in E. coli may also involve reinitiation as the sigma fac-

18 The Biology of Gene Expression Figure 1.6: Steps in eukaryotic transcription initiation. (A, B, C) The first factors to bind are TFIIA, TFIID and TFIIB.

18 18 The Biology of Gene Expression Figure 1.6: Steps in eukaryotic transcription initiation. (A, B, C) The first factors to bind are TFIIA, TFIID and TFIIB. The TFIID complex contains the TATA-box binding protein (TBP). (D) The binding of the TFIIB complex facilitates the recruitment of the RNA polymerase (RNAP) holoenzyme and its partner TFIIF. This is followed by the binding of TFIIE and TFIIH to form the pre-initiation complex. (E) TFIIH can trigger open complex formation and the initiation of transcription. (F) The elongating complex contains RNAP and TFIIF. TFIIA and TFIID may remain on the promoter after transcription initiation. It is possible that the RNA polymerase holoenzyme contains most of the complexes such that some of the steps after the binding of TFIID are circumvented.

19 1.4 Regulation of Gene Expression 19 tor seems to remain attached to the promoter when the polymerase core enzyme moves down the gene. Transcription from TATA-less promoters that contain an initiator site appears to be very similar to the scenario illustrated in Fig While the TBP component of the TFIID complex, for obvious reasons, is unable to assist in its binding to the DNA, the TBP protein is still required for transcription. It is believed that the recruitment of the TFIID complex is due to the subunits TAF250 and TAF150, which together can recognize the initiator sequence. In addition, TFIID may be recruited to the DNA by factors that bind to the initiator site and the RNA polymerase itself has some affinity for the initiator sequence. The histones and the nucleosomes have important implications for the transcription of eukaryotic genes. The RNA polymerase is approximately the same size as a nucleosome and the latter must be replaced during transcription elongation. In addition, tight organization of the nucleosomes and the chromatin fibers may prevent DNA binding proteins from binding to the DNA. The strength of the barrier, which depends on how strongly the DNA and the nucleosome interact, might be modified by enzymes that have histone acetylase (HAT) or deacetylase (HDAC) activity and by enzymes that are able to alter the spatial organization of the nucleosomes. In addition, certain proteins that are part of the transcriptional apparatus are able to bind to acetylated histones with a relative high affinity and it is possible that modification of the nucleosomes may assist in the recruitment of at least some of the components required for transcription initiation. Once transcription is complete, the mrna must be modified and transported from the nucleus to the cytosol where it can be translated into a protein. There are generally three different types of mrna processing. The first type of modification occurs shortly after the RNA emerges from the elongating complex and leads to the addition of a methylated guanine nucleotide in reverse linkage, i.e., 5-5 rather than 4 to 3, to the protruding end of the RNA. This 5 methylated cap assists in the further processing of the RNA molecule and in translation. Eukaryotic genes often contain a mixture of noncoding regions, the introns, and coding regions, the exons, and the non-coding regions must be eliminated to form a protein-encoding mrna. This is done by a process called RNA splicing, in which the introns are cut out and the protein-encoding exons of the RNA are put back together. Splicing of the RNA occurs while the nascent RNA is being transcribed. The last modification of the RNA is the addition of a an chain of 200 adenine nucleotides (the poly-a tail) to its 3 end. This yields the mature mrna that is subsequently transported to the cytosol where it is translated by ribosomes. 1.4 Regulation of Gene Expression The means employed to regulate gene expression are remarkable and many. The most obvious method of control, and the one that can be most readily manipulated, is the modulation of the frequency of transcription initiation. The next sections will discuss

20 20 The Biology of Gene Expression Figure 1.7: (A) The genes laczya of the lactose operon share the same promoter, P lac, which is repressed by the repressor encoded by the adjacently located laci gene. (B) Regulatory elements of the P lac promoter. The LacR repressor can bind to the three laco operators O1, O2 and O3. The CAP protein can bind to the CAP operator. (C) Activation of transcription by CAP. (D) Repression involves DNA looping facilitated by LacR repressor tetramers bound to different operator sites. how this method of gene expression control is utilized in three well-studied systems; the lactose operon in E. coli, the λ CI repressor in bacteriophage λ and the galactose utilization network in Saccharomyces cerevisiae The Lactose Operon of E. coli The lactose operon in E. coli consists of three genes, lacz, lacy and laca, whose transcription is initiated from a single promoter region, P lac (Fig. 1.7A). The rate of transcription of the laczya genes is regulated by the LacR repressor protein and by a protein called CRP (cyclic AMP receptor protein) or CAP (camp activating protein). CAP can act as a transcriptional activator. It binds as a dimer to an operator site centered at position -61 relative to the transcription start site (Fig. 1.7B). It affects the process of transcription initiation by interacting directly with the α-subunit of the RNA polymerase holoenzyme. It has been observed that the presence of CAP increases the amount of the open complex some 13-fold, but that its presence does not change the rate of the transition between the closed and the open complex. This indicates that CAP may act at the first step in transcription initiation (Fig. 1.5B) by increasing the rate at which the holoenzyme binds to the promoter and/or by decreasing the rate at which the holoenzyme dissociates from the promoter. The LacR repressor protein is, as the name implies, an inhibitor of transcription of the genes in the lactose operon. It is expressed constitutively, i.e., at a constant rate, from the P i promoter and is located adjacent to the lactose operon (Fig.1.7 A). The LacR protein binds as a tetramer to three laco operators, O1, O2 and O3, centered at

21 1.4 Regulation of Gene Expression 21 Figure 1.8: Feedback regulation of the lactose operon. Allolactose inhibits the activity of the repressor and relieves its effect on the transcription of the lactose operon genes. This causes upregulation of lacz and lacy, which, in turn, causes an increased rate of allolactose production and lactose uptake, respectively. positions +11, -82 and +410, respectively (Fig. 1.7B). The operators have nearly palindromic sequences and are composed of two half-sites that each make contact with one LacR monomer in the tetrameric repressor complex. It is believed that the binding of the LacR repressor to O1 prevents the binding of the RNA polymerase holoenzyme to the promoter through steric hindrance; the repressor tetramer may simply act as a space-excluding barrier for the incoming holoenzyme. Elimination of the auxiliary laco operators O2 and O3 does not abolish the inhibitory function of LacR, but reduces its effect. While elimination of either O2 or O3 causes a 3-fold reduction in repression, eliminating both causes a 70-fold reduction. Thus, the auxiliary operators appear to serve redundant roles in the inhibition of transcription by the LacR protein. The efficient repression observed in the presence of two or three of the operators is believed to be due to looping of the DNA. The binding of the repressor tetramer to a single operator involves only two of its four subunits, which leaves two subunits capable of binding a second operator site provided that the DNA is twisted into a loop structure (Fig. 1.7D). These loop structures may act as barriers that limit the accessibility to the promoter region and/or as a roadblock of its movement along the DNA. The above discussion of the regulation of the lactose operon addresses the interaction between cis- and trans-regulatory elements in the promoter region. In addition to this, the activity of the trans-factors, i.e., CAP and the LacR repressor, are extensively regulated. First of all, the activity of CAP depends on the presence of camp. The concentration of camp in turn depends on the presence of glucose. The transcription of the genes in the lactose operon is negatively correlated with the concentration of glucose in the growth medium. CAP affects the transcription of a large number of genes and is a central player in the global gene regulatory system known as catabolite repression. This system ensures that the cell does not wastefully express the genes required for metabolizing other sugars when the energy-rich glucose is available. The activity of the lactose operon is modulated via a feedback loop involving the proteins LacR, LacZ and LacY (Fig. 1.8). The genes lacz and lacy encodes for the enzyme β-galactosidase and the membrane-bound lactose permease, respectively. While the lactose permease enables the transport of extracellular lactose into the cell, the β-

22 22 The Biology of Gene Expression galactosidase converts intracellular lactose into glucose and galactose. It also converts some of the lactose into allolactose. Allolactose in turn binds to the LacR tetramer and causes a conformational change, or allosteric transition, to a state that has a significantly reduced affinity for the operator sites. As a result, the presence of small amounts of the allolactose, the inducer of LacR, causes an up-regulation of the expression of the laczya genes in the lactose operon. This causes an increased rate of lactose uptake (by LacY) and conversion of lactose into allolactose (by LacZ), which, in turn, lowers the activity of LacR even further. The lactose operon is thus regulated through a positive feedback loop and catabolite repression. This enables an energy-efficient switch. The laczya genes are expressed at low (basal) levels when glucose is present and are only activated when needed, i.e., when glucose is absent and lactose is present. Many other operons are regulated in a manner that resembles that of the lactose operon and it is a textbook example of a simple gene regulatory circuit The Genetic Switch in Bacteriophage λ The λ CI repressor isolated from the bacteriophage λ of E. coli is indeed a remarkable example of a transcription factor protein. Depending on the operator to which the protein binds, and depending on which promoter is considered, the λ CI repressor can act both as a transcriptional repressor and as a transcriptional activator. The λ CI repressor is part of the regulatory circuitry that enables the λ phage to change its lifestyle from a dormant (lysogenic) state where it co-exists with its bacterial host and an active state (lytic) where the phage rapidly replicates, bursts the host cell and releases a massive number of offspring into the environment. This switch is part of the survival strategy of the virus. In the lysogenic state, the phage DNA is stably integrated into the chromosome of the host cell and is replicated every time the cell divides. However, the virus is able to detect if the life of the host cell is threatened, for instance following DNA damage, and can switch to the lytic state where the host cell is abandoned and the released phage viruses go in search for suitable hosts to infect. This switch from lysogenic to lytic growth is controlled at the level of gene expression, particularly the two promoters P R and P RM depicted in Fig The P R and P RM promoters regulate the expression of the genes ci and cro, which encodes the λ CI repressor protein and the Cro transcriptional regulator, respectively. The two promoters share a common regulatory region, called the right operator (OR), and direct transcription in divergent directions along the DNA. Transcription of the ci gene is required for maintenance of the lysogenic growth state. In fact, the RM in P RM refers to repressor maintenance. The transition from the lysogenic to the lytic state is caused by the repression of ci expression and activation of cro expression. The phage DNA contains a second promoter, the left promoter P L, and the left operator (OL) that is able to bind the CI protein. The OR region contains three adjacent sites OR1, OR2 and OR3, each of which can bind both Cro and CI. The binding affinities of these sites are such that at increased concentrations CI initially binds to OR1, then OR2 and finally OR3. The binding affini-

23 1.4 Regulation of Gene Expression 23 Figure 1.9: (A) The divergent P R and P RM promoters regulates the expression of cro and ci. The shared regulatory OR region contains the three binding sites, OR1, OR2 and OR3, to which the CI and the cro proteins can bind. (B) Repression of ci transcription in the lytic state by the binding of the Cro dimer to OR3. (C) Repression of cro transcription by the binding of CI dimers to OR1 and OR2. The transcription of ci is maintained by a positive feedback loop in the lysogenic state. ties for Cro are such that it will first bind to OR3 and then to OR1 and OR2. It is the occupancy of these sites that determines which one of the two promoters that are active (Figs. 1.9B and 1.9C). The binding of the dimeric Cro protein to OR3 (Fig. 1.9B) causes the repression of ci transcription, but does not alter the transcription of the cro gene. Autorepression by Cro only occurs when it is present in sufficiently high concentrations to occupy OR1 and/or OR2. The binding of the CI protein, which is also a dimer, to OR1 prevents the binding of the RNA polymerase to the P R promoter, thus preventing the expression of the cro gene (Fig. 1.9C). In addition, once the CI protein is bound to OR1, the affinity of the OR2 site for CI is increased approximately 10-fold. This is due to a direct interaction between the two CI dimers located next to each other with the appropriate relative orientation. Binding of CI to OR2 also prevents the RNA polymerase from binding to the P R promoter and thus causes a further repression of transcription of cro from P R. Interestingly, CI two dimers bound to OR1 and OR2 may interact with two CI dimers bound to the OL region within the P L promoter located several thousands base pairs away and may facilitate efficient repression of P R and P L at intermediate concentrations of CI. Remarkably, when bound to the OR2 site the CI protein can interact with the σ- factor in the RNA polymerase holoenzyme and increase the rate of transcription from P RM. This CI-induced activation of ci transcription facilitates an increase in the concentration of the CI protein. Hence, a positive feedback loop ensures that sufficiently

24 24 The Biology of Gene Expression high levels of the CI protein are present to maintain the phage in the lysogenic state. This feedback structure causes the state to be remarkably stable with fewer than 1 in 10 million lysogenic cells spontaneously switching to the lytic state per generation. When present in high concentrations, the CI protein will bind to the low-affinity OR3 site and repress the transcription of its own gene. Binding of CI to OR3 causes the transcription from P RM to decrease. Efficient repression at high concentrations of CI is believed to be due to the formation of a CI tetramer between a CI dimer bound to OR3 and a CI dimer bound to the OL region in addition to CI tetramer linking OR1 and OR2 to OL. This negative feedback loop could ensure that the rate of transcription of the ci gene is never to high to respond to endogenous signals produced by the host cell. The transition from the lysogenic to the lytic state occurs in response to damage to the DNA of the host cell. This switch is very robust with an efficiency close to 100%. A central regulatory protein in the DNA-damage response system in E. coli, the SOS response system, involves the RecA protein. This protein has protease activity and is able to cleave certain proteins in the presence of single-stranded DNA. The activation of the RecA co-protease is a key component of the SOS response system and is essential to the survival of the cell. The λ phage ingeniously exploits the SOS response to its own advantage, but with dire consequences for the host cell. The CI dimer contains a site that is recognized by RecA and activation of RecA causes cleavage and inactivation of the CI dimers. This causes a decrease in the concentration of the CI dimers and the OR1 and OR2 sites become vacant. This in turn diminishes the expression of ci from P RM and increases expression of cro from the P R. Cro then binds to the sites within OR which further represses ci expression and ensures a robust switching from the lysogenic (ci on) state to the lytic (cro on) state The Galactose Regulon in S. cerevisiae The galactose utilization pathway in the yeast Saccharomyces cerevisiae is one of the most well studied eukaryotic gene regulatory circuits. Detailed investigations of one of the key regulatory proteins in this system, the dimeric transcriptional activator Gal4, have revealed many details of how transcription is regulated in eukaryotes. The Gal4 protein acts as a transcriptional activator for a large number of genes, including many of the genes required for the cell to metabolize galactose. Genes whose expression is regulated by the same set of transcriptional regulators are often said to belong to the same regulon. The sequence upstream of the gal1 gene contains two regulatory regions; an upstream activating sequence (UAS G ) that contains four binding sites for the Gal4 protein, and a binding site for the transcriptional repressor Mig1 (Fig. 1.10A). The gal1 gene is subject to glucose (catabolite) repression. The Gal4 protein is only active when galactose is not present and the Mig1 protein is only active when glucose is present. The repression is stronger than the activation and the expression of the genes in the galactose regulon diminishes when cells are grown in media containing galactose and glucose.

25 1.4 Regulation of Gene Expression 25 Figure 1.10: (A) Regulatory elements affecting transcription of the gal1 gene. The region upstream of the gene contains a TATA-box, a binding site for the repressor protein Mig1 and an upstream activating sequence (UAS G ). The UAS G contains four binding sites for the dimeric form of the Gal4 transcriptional activator. (B) Speculative spatial arrangement of SAGA, Mediator, TBP and RNA polymerase recruited directly or indirectly by Gal4. (C) Time-course of recruitment of components of the transcriptional apparatus following galactose induction (Based on the experimental study by Bryant and Ptashne). The Gal4 protein is active when cells are grown in the presence of galactose and in the absence of glucose. In a manner similar to CAP and λ CI, the Gal4 protein is modular with an activation region that operates independently of a separate DNAbinding region. These domains can be detected by elimination of their corresponding DNA sequence in the gal4 gene. Removal of the DNA binding domain (BD) gives a protein that fails to bind to the UAS G. On the other hand, removal of the sequence that encodes the activating region, or activation domain (AD), gives a protein that can bind to the UAS G, but fails to activate transcription. This property of Gal4 forms the basis of the two-hybrid technology for detection of protein-protein interactions (see section 2.2.2). While a great deal is known about the Gal4 protein, its mode of action is far from clear. The binding of the Gal4 protein to the UAS G appears not to depend on factors that modify the nucleosomes, such as the Gcn5 component of the histone acetylation complex Spt-Ada-Gcn5 acetyltransferase (SAGA). This is supported by the observation that a weakening of the Gal4 binding sites causes expression from the gal1 promoter to depend on the presence of Gcn5. The interaction between the Gal4 protein and the UAS G region is probably stronger than the interaction between the histones and the UAS G region. However, even though Gal4 can bind to the DNA in the absence of SAGA, the latter is still required for transcription from the Gal1 promoter. Experiments done in vitro, i.e., in solutions outside the cellular environment, have shown that the Gal4 protein may interact with a number of components in the transcrip-

26 26 The Biology of Gene Expression Figure 1.11: (A) Repression of Gal4 activation by the Gal80 protein in the absence of galactose. Gal80 binds to Gal4 near its activation domain and prevents the recruitment of SAGA and Mediator. The regulatory protein Gal3 is unable to bind Gal80. (B) Gal3 binds to Gal80 in the presence of galactose. Gal4 can now recruit SAGA and Mediator causing the initiation of the cascade (Fig. 1.6) leading to transcription. tional machinery. This includes RNA polymerase II, the transcription factors TFIIE, TFIIH and TBP, various components of the Mediator and components of SAGA. A recent experimental study by Bryant and Ptashne investigated the temporal order in which Gal4 recruits components of the transcriptional apparatus in vivo, i.e., in living cells. The experiments indicate that SAGA and the Mediator complexes are the direct targets of the Gal4 protein and that they are recruited to the promoter region independently of each other. As illustrated in Fig. 1.10C, the SAGA complex is recruited first, then the Mediator complex. The RNA polymerase II, TBP, TFIIE, TFIIH and TFIIF are recruited last and the temporal resolution in the experiments (0.5-1 minute) is too long to determine the order in which these components are recruited to the promoter. Other experiments have demonstrated that the binding of the TBP is required for the binding of the polymerase holoenzyme. The Bryant-Ptashne experiment indicates that the polymerase (and the other factors required for transcription) is recruited very rapidly once SAGA, Mediator and TBP are bound to the DNA. In addition to the UAS G, the region of sequence that regulates gal1 expression contains a cis-regulatory element to which the transcriptional repressor protein Mig1 can bind. The Mig1 protein is a key regulatory factor in glucose repression and its activity depends on the concentration of glucose in the growth medium. As for the Gal4 protein, the Mig1 protein exerts its effect by recruiting complexes to the promoter region. The complex that is recruited to the promoter by the Mig1 protein consists of the components Ssn6 and Tup1. Artificial constructs in which either one of these components is fused to a DNA binding domain (section 2.2.2) indicates that the Mig1 protein is not required for repression and simply acts to bring the Tup1 component into the appropriate position on the promoter. Tup1 may then recruit complexes that remove acetyl groups from the histones (histone deacetylases, or HDATs) to make the promoter region less accessible to the transcription apparatus. It has also been suggested that Tup1 interacts directly with components of the transcriptional apparatus and somehow interferes either with the assembly of a transcription pre-initiation complex or with

27 1.4 Regulation of Gene Expression 27 Figure 1.12: Components of the positive feedback regulation of Gal4 activity. In the presence of galactose, Gal4 activity is increased through Gal3-Gal80 (Fig. 1.11), which increases the rate of Gal2-mediated galactose uptake and the concentration of Gal3. Gal80 expression is also up-regulated by Gal4. This negative feedback is relatively weak transcription initiation. In addition to the cis-regulation exerted at the UAS G and the Mig1 binding sites, the transcription of the gal1 gene is strongly influenced by interactions between transregulatory factors. When cells are grown in a medium that contains a non-repressive sugar, such as raffinose, the Gal4 protein occupies the four binding sites within the UAS G of the Gal1 promoter. However, in the absence of galactose, the Gal4 protein is unable to recruit components of the transcriptional apparatus to the vicinity of the promoter region. This repression is due to the protein encoded by the gal80 gene, which binds to the Gal4 protein at a site that is partly overlapping with the activating domain. The repression is very efficient and there is virtually no expression from the gal1 promoter when the Gal80 protein inhibits the Gal4 protein (Fig. 1.11A). This inhibition is released in the presence of galactose through a third regulatory protein encoded by the gal3 gene. The Gal3 protein, which becomes activated when galactose binds to it, binds to the Gal80 protein and causes either the dissociation of Gal80 from Gal4 or the movement of Gal80 away from the activating region of the Gal4 protein (Fig. 1.11B). In a manner that is similar to the regulation of the lactose operon, the genes in the galactose regulon are organized in a circuit containing a positive feedback loop (Fig. 1.12). The expression of the galactose permease, which is encoded by the gal2 gene, and the Gal3 protein are both activated by the Gal4 protein. The removal of the repression of the Gal4 protein will therefore cause an increased rate of galactose uptake, an increased activity of the Gal3 protein and, subsequently, a further increase in the activity of Gal4. Interestingly, the Gal4 protein also increases the expression of the gal80 gene, though to a lesser extent than gal2 and gal3. The regulatory function of this negative feedback is unknown. One possibility is that it enables a rapid shutdown of the circuit when glucose is present in the growth medium and robust switching to glucose repression. The regulation of the transcription of the yeast genes that are required to metabolize galactose is in many ways similar to the genes required to metabolize lactose in E. coli. In both cases, the expression of the relevant genes is simulated when the alternative carbon source is present and suppressed if the preferred carbon source glucose is present.

28 28 The Biology of Gene Expression However, the activating and inhibiting signals are mediated in different ways in the two systems. In the bacterial system, the alternative energy source lactose enhances transcription from P lac by suppression of the repressor LacR and the presence of glucose is mediated through the suppression of the activator CAP. In yeast, the alternative energy source galactose enhances transcription from the Gal1 promoter by stimulation of the activator Gal4 and the presence of glucose is mediated through the stimulation of the repressor Mig1. These differences may indicate a fundamental shift of gene regulatory mechanisms from a default on state in bacteria to a default off state in higher organisms.

29 1.4 Regulation of Gene Expression 29 Suggested Further Reading Textbooks: Alberts B. et al., The Molecular Biology of the Cell. Garland Science. New York, New York (2002). Latchman D. Gene Regulation: A Eukaryotic Perspective. Cheltenham, United Kingdom (1998). Stanley Thornes. Lewin B. Genes VII. Oxford University Press. Oxford, United Kingdom (2000). Müller-Hill B. The lac Operon. de Gruyter. Berlin, Germany (1996). Ptashne M., & Gann A. Genes & Signals. Cold Spring Harbor Laboratory Press. Cold Spring Harbor, New York (2002). White R. J. Gene Transcription: Mechanisms and Control. Blackwell Science. Oxford, United Kingdom (2001). Articles Bednar J. et al. Nucleosomes, linker DNA, and linker histone form a unique structural motif that directs the higher-order folding and compaction of chromatin. Proc. Natl. Acad. Sci. U. S. A. 95, (1998) Bryant G. O. & Ptashne M. Independent recruitment in vivo by Gal4 of two complexes required for transcription. Mol Cell. 11, (2003). Hochschild A.The λ switch: ci closes the gap in autoregulation. Curr. Biol. 12, R87-9 (2002). Dieci G, Sentenac A. Detours and shortcuts to transcription reinitiation. Trends Biochem Sci. 28, (2003). Orphanides G. & Reinberg D. A unified theory of gene expression. Cell, 108, (2002). Zhang X. et al. Mechanochemical ATPases and transcriptional activation. Mol Microbiol. 45, (2002).

30 30 The Biology of Gene Expression

31 Tutorial Part2 Engineered Gene Networks Natural gene networks can be described as circuits of interconnected functional modules, each consisting of specialized interactions between proteins, DNA, RNA, and small molecules. The simplest element of a gene regulatory network consists of a promoter, the gene(s) expressed from that promoter, and the regulatory proteins (and their cognate DNA binding sites) that affect the expression of that gene. While there are several different ways by which regulatory proteins and small molecules can modulate gene expression, the regulation of the frequency of gene transcription is the most prevalent control instrument employed in natural gene circuits. In the previous section, it was discussed how the frequency of transcription is regulated in a number of natural systems; expression from the P lac promoter is modulated by the trans-acting proteins CAP and LacR, the P R /P RM promoters by CI and Cro and numerous promoters in the Galactose regulon are modulated by Gal4 and Mig1. The frequency of transcription is also the parameter that can most easily be manipulated in the laboratory. The nucleotide sequences contained in cis-regulatory elements, the promoter region(s) and in the untranslated regions of the mrna transcript can be altered with relative ease to control the level of gene expression by altering the binding affinities of transcription factors and the various components of the transcriptional and translational apparatus. In addition, cis-regulatory elements that make transcription controllable by one set of transcription factor can be substituted with other cisregulatory elements. This ability to mix and match cis- and trans-regulatory elements in living cells has allowed for the construction of artificial gene circuits with customiz-

32 32 Engineered Gene Networks able properties and characteristics. This part of the tutorial is intended to provide a brief introduction to the practical aspects of genetic network engineering. Aside from the obvious biotechnological and biomedical implications of this research, there are numerous applications to Systems Biology. Simple engineered expression systems provide a framework for the deduction and validation of the basic principles of transcriptional regulation and allows for the testing of the methodologies discussed in Part 3 that are used to link the properties of molecules to systems level functionality. 2.1 Some Tools of the Trade The construction of artificial gene circuits is based on relatively recent advances in molecular biology technologies and the availability of DNA sequence data for the cisregulatory elements and their corresponding transcription factor proteins. Some common tools from genetic engineering include restriction enzymes that are used to cleave DNA molecules at specific locations, DNA ligases that are used to glue DNA fragments together, and vectors that are used in vivo to express genes of interest or to amplify DNA sequences contained in the vector. Another frequently used technology is the polymerase chain reaction (PCR), which is used to amplify DNA sequences in vitro Cutting and Pasting DNA One of the most basic and powerful molecular biology technologies is the ability to cut DNA at specific locations with restriction endonucleases and to paste DNA fragments back together with DNA ligase. Restriction endonucleases are enzymes that bind specific DNA sequences, typically 4 to 6 base pairs, and cleave the DNA within or near its recognition sequence. Figure 2.1A illustrates how two commonly used restriction enzymes, EcoRI and EcoRV, isolated from E. coli, cleave DNA sequences that contain the hexamers GAATTC and GATATC. Treatment of a purified DNA sample with EcorI gives DNA fragments with complementary overhangs. One fragment has a TTAA overhang on the 3-5 strand, the other a AATT overhang on the 5-3 strand. Treatment with EcorV gives fragments that have blunt ends as the enzyme cleave the GATATC hexamer in the middle. Note that EcoRI and EcoRV cleave DNA sequences that differ only in the two central base pairs of the hexamer. In sequence that is cleaved by EcoRI the central base pairs are AT. If the central base pairs were TA instead of AT, the sequence would not be cleaved by EcoRI but by EcoRV. There are currently hundreds of commercially available restriction enzymes that can cleave DNA sequences with a very high specificity and produce a variety of different overhangs. DNA fragments obtained from a restriction enzyme digests can be separated using gel electrophoresis. DNA is a negatively charged molecule and fragments of different sizes migrate at different velocities through a gel when an external electrical field is

33 2.1 Some Tools of the Trade 33 Figure 2.1: (A) Cleavage of double stranded DNA by the restriction enzymes EcoRI and EcoRV. EcoRI recognizes the sequence GAATTC and cleaves the phosphodiester bond between G and A to produce two DNA fragments with unpaired nucleotides (overhangs). EcoRV recognizes the sequence GATATC and cleaves the DNA in the middle of this sequence to produce blunt ends with no overhangs. (B) DNA ligase can be used to reestablish the phosphodiester bond and joined fragments that have been cleaved by restriction enzymes. applied. To separate DNA fragments, the sample is loaded onto a gel together with a dye and an electrical field is applied, typically for 30 to 90 minutes depending on the current, the density of the gel, and on the size of the fragments that are being separated. The gel is then stained with a dye, ethidium bromide, that fluoresces brightly under ultraviolet light when bound to double-stranded DNA. The appropriate DNA fragments then can be excised from the gel with a razor blade and purified. Once the desired DNA fragments have been obtained, they can be glued back together in a ligation reaction (Fig. 2.1B). In this reaction, a DNA ligase enzyme reestablishes the phosphodiester bond in the DNA backbone that was initially cleaved by the restriction enzyme Plasmid Vectors Genetic engineering generally uses cloning vectors to carry out manipulations on DNA sequences and expression vectors to control the in vivo expression of genes of interest. Most engineered gene regulatory networks are constructed on vectors. Cloning and expression vectors are plasmids and are typically used to express a gene of interest at high levels in vivo, for instance during the manufacturing of enzymes, or to amplify a DNA sequence of interest. Since the plasmid is replicated each time the cell divides, large quantities of vector DNA (or protein) can be isolated from a cell population that has been allowed to grow to a high density. In other words, once a DNA sequence has been successfully cloned into a plasmid vector, essentially unlimited quantities of the DNA can be obtained by isolation of the plasmid DNA from cell extracts. In addition to the sequence of interest, vectors carry at least one origin of replication

34 34 Engineered Gene Networks and a selective marker. As mentioned in section 1.2, the origin of replication typically allows for amplification of the plasmid in a rapidly growing host cell, such as E. coli, while the selective marker ensures that the host cell can only grow if it contains one or more copies of the plasmid. Selective markers in E. coli are typically genes that confer resistance to antibiotics, such as tetracycline, amphicillin and kanamycin. When the bacterium is grown in the presence of the antibiotic, only the cells that carry the appropriate resistance gene will be able to survive. The method of selection in yeast is typically an auxotrophic marker. The marker is a gene that has been deleted from the genome of the host cell, but is required for cell growth when certain nutrients, for instance specific amino acids, are absent from the growth medium. Vectors can be inserted into host cells through a process called transformation. Typically, transformation is typically carried out with cells that are made competent by treatment with various chemicals. These cells can accept foreign DNA when a mixture of DNA (typically 50 to 500 ng) and cells are subjected to brief heat shock. The basic theory behind the transformation procedure is poorly understood. It is a very inefficient process and requires billions of cells to produce on the order of 10 to 1000 cells that can grow on a plate containing an appropriate mixture of nutrients and selective markers. When the sequence that allows for plasmid replication is absent, the only way for the host cell to survive is to integrate the vector DNA into its chromosome. This can be done by a process known as homologous recombination. One method that is used to integrate DNA sequences into the chromosome of S. cerevisiae is to cut the vector with a restriction enzyme in a region of the vector where the DNA sequence is identical to a sequence within one of the yeast chromosomes. This linear DNA molecule will replace the corresponding sequence within the chromosome when the cells are allowed to grow after the transformation. A similar technique can also be used to integrate DNA sequences into the E. coli chromosome, but the process is particularly efficient in S. cerevisiae. A wide variety of cloning and expression vectors are commercially available. An example of a frequently used cloning vector is the pz vector system developed by Lutz and Bujard for expression in E. coli (Fig. 2.2A). This vector system was constructed in such a way that each plasmid contain three modular region that are flanked by unique restriction sites for the endonucleases AatII, XhoI, SacI and XbaI. Region (I) can be used to insert arbitrary sequences. Region (II) contains the origin of replication and region (III) contains the selective marker. The pz system comes with different origins of replication (ColE1, p15a and SC101) and different resistance markers (Fig. 2.2B). While the origin ColE1 allows a high number of plasmid copies in each cell (around 60) the replacement of the ColE1 sequence with the sequence for the p15a origin or a modified SC101 origin lowers the plasmid copy number per cell to about 25 and 3-4, respectively. The modular structure of the pz system allows for the rapid exchange of different components. Any one of the three modules can be replaced with another in a manner of three to four days. First, cells that contain the vector(s) are grown for hours

35 2.1 Some Tools of the Trade 35 Figure 2.2: (A) The pz expression vectors contain three modular regions that carry sequences for, respectively, (I) the promoter/gene of interest, (II) an origin of replication and (III) a resistance marker. Region (I) contains a promoter (P/O) and a sequence that encodes a ribosome binding site (RBS). A gene of interest can be inserted between the restriction sites KpnI and XbaI. T1 and t0 refer to sequences that terminate transcription. (B) Examples of pz expression vectors with different origins of replication and resistance markers. The engineered regulatory units are discussed in section Reproduced from Lutz and Bujard without permission. c Oxford University Press. and the DNA is isolated, treated with restriction enzymes and the desired vector and insert fragments are purified. These fragments are then ligated together and transformed with competent cells. After the transformation, the cell mixture is spread out on plates containing selective antibiotics. After a day or two, cells that are viable will have formed colonies that can be used to inoculate a batch culture. An additional hours later, vector DNA can be isolated and analyzed, for instance, by treating the vector DNA with restriction enzymes followed by gel electrophoresis to confirm that it contains fragments of the correct size. Due to the relative fast growth of E. coli, it is often desirable to perform the basic DNA sequence manipulations on a vector that can be propagated in E. coli and only insert the vector into the desired cell type once the correct vector has been obtained. Vectors that are used for such purposes typically contain two origins of replication and two selective markers in order to provide means of propagation and selection in the two different cell types. A shuttle vector system of this type that is frequently used in S. cerevisiae is the so-called prs vectors developed by Sikorski and Hieter (Fig. 2.3). The multipurpose prs vectors contain a ColE1 high copy number bacterial origin of replication and a gene that confers amphicillin resistance to an E. coli host. The prs vectors also carry an auxotrophic marker for selection in yeast (the his3 gene in Fig. 2.3B) and may contain a sequence (ARS/CEN) that allows the plasmid replicate autonomously. Other features of the prs system include an origin of replication from the f1 filamentous phage, a multiple cloning sequence (MCS) containing various restriction sites and

36 36 Engineered Gene Networks Figure 2.3: (A) The multipurpose prs vector contains a bacterial origin of replication (ColE1) and resistance marker (amphicillin), an auxotrophic marker (HIS3). The ARS/CEN sequence allows the replication of the plasmid in yeast. If this sequence is absent, the yeast cell can only grow if the plasmid is integrated into the yeast chromosome. In addition, the prs vector contains an origin of replication from the f1 phage, promoters for T3 and T7 phage polymerases, a multiple cloning sequence (MCS) and the lacz gene. (B) The pesc vector is very similar to the members of the prs vector family. It contains the bidirectional Gal1 and Gal10 promoters allowing for expression of two genes inserted into the multiple cloning sites MCS1 and MCS2. Modified without permission. c Stratagene Cloning Systems. the sequence that encodes a variant of the LacZ protein. When a sequence that encodes a protein is inserted into the MCS using the appropriate restriction enzymes, it becomes fused to the sequence of the lacz gene. If there is no transcriptional or translational stop signal between the two sequences, the result is a hybrid protein that may have all the properties of the protein of interest and the LacZ protein. Recall from section the lacz encodes a β-galactosidase that in E. coli converts lactose into the inducer of LacR, allolactose. It can also cleave various artificial galactopyranosides to give a brightly colored reaction product. This can be used to quantify the expression of the hybrid protein in an enzymatic assay. A cell extract is mixed with an appropriate artificial galactopyranoside and the activity of LacZ can be measured by quantifying the absorbance of the sample at an appropriate wavelength. The pesc system shown in Fig. 2.3 is closely related to the prs vectors, but with some noticeable differences. It contains the 2µ origin of replication that allows a very high number of plasmids per yeast cells. It also contains the divergent Gal1/Gal10 promoters. This allows the simultaneous galactose-induced expression of two genes Extracting DNA Sequences There are a number of methods that can be used to isolate specific DNA sequences. The most direct methods are to purify chromosomal or plasmid DNA directly from a cell extract or by direct chemical synthesis. Custom-made single-stranded DNA, or oligonucleotides, can be obtained from commercial sources for a reasonable price

37 2.1 Some Tools of the Trade 37 Figure 2.4: Illustration of one cycle in the polymerase chain reaction. Separation of the doublestranded DNA is followed by primer annealing, DNA polymerase binding and DNA synthesis. when the sequence contains 100 nucleotides or less. Purification of DNA from plasmid or genomic DNA is usually the preferred option for longer fragments. The quantities obtained by isolation of genomic DNA is, however, quite low and the sequence of interest is embedded in the chromosomal DNA. It is usually desired to increase the yield of a specific region of the DNA by performing a polymerase chain reaction (PCR). The theory behind PCR is simple. In order to replicate the chromosomal DNA, the DNA polymerase synthesizes double-stranded DNA by adding single nucleotides through base-pairing to a single-stranded region of the parental DNA molecule. In a PCR reaction, this process is repeated multiple times in vitro and the quantity of DNA is doubled in each step. The PCR reaction relies on polymerases that can synthesize DNA at elevated temperature and on the requirement that DNA synthesis has to start form a region where the DNA is double-stranded. The region of the DNA that needs to be amplified is specified by two oligonucleotides, or primers, that are complementary to the sequence that flank the region of interest. One primer is designed to bind to the 3-5 strand upstream of the region of interest while the other is designed to bind to the 5-3 downstream of the region of the interest (see Fig. 2.4). The amplification of the region of interest is done by repeated cycles of DNA melting, primer annealing and primer extension. In the first step (Fig. 2.4A), the double-stranded DNA is cleaved into single-stranded chains at high temperatures (typically 94 ). The sample is then cooled (typically to ) to allow the single-stranded DNA to form a stable complex with the primers (Fig. 2.4B). DNA synthesis is then initiated by increasing the temperature to a value where the polymerase works most efficiently (typically 72 ). Once the polymerase has completed the synthesis of the region of interest, the sample is heated to melt the newly

38 38 Engineered Gene Networks Figure 2.5: Examples of sequence modifications by PCR. (A) A sequence of interest can be augmented with restriction sites appropriate for cloning by using primers that in addition to the DNA binding sequence, contain a sequence recognized by a restriction enzyme (RS1). (B) Two-step replacement of DNA sequence by PCR. The sequence to be inserted is carried on primer overhangs. synthesized double-stranded DNA and the process is repeated. PCR is used in many different applications. For example, PCR is used to obtain DNA sequences that are flanked by restriction sites compatible with those in a desired cloning vector. This is done by using primers with overhangs that contain the recognition sequence for the restriction enzyme (Fig. 2.5A). Once the PCR amplification is complete, the PCR product is purified, cut with the appropriate restriction enzyme and ligated into the vector fragment that has been treated with the same restriction enzymes. In addition to these and countless other applications, a similar method can also be used to introduce an arbitrary sequence into an existing sequence at an arbitrary location. One way this can be done is to do two PCR reactions with a total of four primers (Fig. 2.5B). Two of the primers A and B are chosen to coincide with flanking regions in the parental DNA that contain appropriate restriction sites. The two remaining primers C and D are designed to bind to the parental DNA at positions flanking the region that needs to be replaced. These primers have overhangs that contain the sequence to be inserted and the sequence for a common restriction site. After the two PCR products AC and DB have been amplified, they are cut with the common restriction enzyme and ligated together. The ligated ACDB fragment can then be PCR amplified using the primers A and B and subsequently inserted into the vector using the restriction sites contained in the primers A and B.

39 2.2 Engineering Regulatory Modules Engineering Regulatory Modules Genetic Switches in E. coli Many genetic switches constructed to control gene expression are based on an architecture that mimics ones found in natural bacterial operons (see section 1.4.1). Two commonly used switches were constructed by Lutz and Bujard (using the pz vector system) by replacing the binding sites for the λ CI repressor with teto and laco operators in a modified P L promoter. The TetR repressor/teto-operator module is another example of a natural system that has found broad use in genetic engineering (see section and 2.2.3). It is derived from a system that confers bacterial resistance to tetracyclinebased antibiotics. Two genes, tetr and teta, modulate tetracycline-resistance in a manner similar to the functioning of the genes in the lactose operon. Specifically, in the absence of tetracycline (Tc), or analogues such as the non-toxic anhydrotetracycline (ATc) and doxycyline (Dox), the Tet repressor (TetR) binds to tet operator sites within the promoter controlling the expression of teta. The TetA protein is an antiporter that is located in the cell membrane and exports tetracyclines from the cell. Binding of the antibiotic to the TetR repressors decreases its affinity for the teto binding site causing up-regulation of teta expression, and subsequent removal of tetracycline from the cell. The interactions between TetR, teto and inducer have been extensively studied and these components are, together with the components of the LacR system, some of the best characterized systems in molecular biology. The P L -based, TetR and LacR repressible hybrid promoters, P LtetO 1 and P LlacO 1 (Fig. 2.6A), were obtained by direct chemical synthesis followed by insertion into the pz vector using the XhoI and the AatII restriction sites. Expression from these promoters can be modulated over a broad range by tuning the amounts of the inducers in the growth medium (Fig. 2.6B). A third hybrid promoter in the pz vector system, designated P lac/ara 1, was constructed from a variant of the natural P lac promoter by insertion of two additional laco operators (Os and O1 in Fig. 2.6A), and by replacement of the CAP/cAMP binding site with sequences that are recognized by a transcriptional activator encoded by the arac. The AraC protein binds to its recognition sequence when the sugar arabinose is present in the growth medium and activates transcription by facilitating the recruitment of the RNA holoenzyme to the promoter. The hybrid promoters P LtetO 1 and P LlacO 1 (as well as many other engineered expression systems not mentioned here) demonstrate that negative transcriptional regulation can be achieved by a relatively simple mechanism. When an appropriate DNA sequence is inserted within or near the promoter sequence, the binding of a repressor protein to this sequence can attenuate expression from an otherwise active promoter. This can be as simple as a competition in which the repressor and the polymerases cannot occupy the same space at the same time, i.e., steric hindrance. Specialized interactions, such as cooperative binding and DNA looping, are not required. They may however increase the efficiency of the switch (see section 3.3). Similarly, the binding site for the CAP/cAMP transcriptional activator near the P lac promoter can be replaced

40 40 Engineered Gene Networks Figure 2.6: (A) Promoters constructed by Lutz and Bujard by the replacement of λ CI binding sites in the P L promoters with teto operators (P LtetO 1 ) and laco operators (P LlacO 1 ). The P lac/ara 1 promoter was constructed by replacing the CAP binding site with a binding site for the AraC transcriptional activators. (B) Modulation of expression from the engineered promoters by addition of the inducers ATc (P LtetO 1 ), IPTG (P LlacO 1 ) and arabinose and IPTG (P lac/ara 1 ). Modified without permission from Lutz and Bujard. c Oxford University Press. with the binding site for the AraC transcriptional activator. This demonstrates that transcriptional activation does not require particularly sophisticated mechanisms to work. The binding of a protein that can interact with the RNA polymerase near the promoter appear to be sufficient to enable a more efficient binding of the polymerase holoenzyme to the promoter Genetic Switches in S. cerevisiae As mentioned in section 1.4.3, regulation of eukaryotic gene expression contrasts that of most prokaryotic genes, as the eukaryotic genes are generally in a silenced state. It is common for many eukaryotic genes to require some mechanism of activation before they can be expressed. This has been exploited in a relatively large number of engineered expression systems and led to the development of technologies such as yeast two-hybrid for the detection of protein-protein interactions. The theory behind the two-hybrid technology (Fig. 2.7) relies on the ability of Gal4 to activate expression from promoters that contain the Gal4 upstream activating sequence (UAS G ). However, rather than having the DNA binding domain (BD) and the activation domain (AD) located on one protein (Gal4), the DNA sequences that encode

41 2.2 Engineering Regulatory Modules 41 Figure 2.7: Basic yeast two-hybrid system. The sequences for the activation domain (AD) and the DNA binding domain (BD) from the gal4 gene is fused to two different protein encoding sequences to give two different hybrid proteins, the bait and the prey. The bait binds to the DNA at the Gal4 UAS. If the bait and the prey can bind to each other, the AD fused to the prey can recruit the transcriptional apparatus to the promoter and the expression of a reporter gene detected. these two domains are fused to sequences of two different proteins. As a result, the cell will express two hybrid proteins, one, the bait, containing the Gal4 BD, and the other, the prey, the AD. Since the bait contains the BD, it will associate with the UAS G, but will not activate transcription since this hybrid protein lacks the ability to recruit the transcriptional apparatus to the promoter. However, if the bait is able to interact with the prey, the association between the two hybrid proteins causes the AD fused to the prey to be brought to the vicinity of the promoter and activate transcription by facilitating the assembly of the transcriptional apparatus. In other words, transcription will only occur when the two proteins interact and the level of expression will be correlated with the strength of the protein-protein interaction. The success of two-hybrid expression systems demonstrates an important principle in transcriptional regulation. The sequences of the activation and binding domains from Gal4 correspond to only about 10% of the sequence of the native gal4 gene. In other words, there is not anything particularly special about the full-length Gal4 gene that enables its protein product to act as a transcriptional activator. The activation of transcription seems only to require that the activation domain is recruited to the vicinity of the promoter. A remarkable example of this is a light-switchable expression system developed by Shimizu-Sato et al. This engineered two-hybrid system exploits the ability of certain plant photoreceptors (phytochromes, Phy) to change reversibly from one form, Pr, to another, Pfr, in response to light signals and on the ability of a second protein, PIF3, to associate only with the Phy(Pfr) form of photoreceptor. Absorption of red light by Phy(Pr) causes the protein to be converted into the Phy(Pfr) form and Phy(Pfr) can be converted back to the Phy(Pr) form when it absorbs far-red light. Hence, the strength of the interaction between and PIF3 depends on the light signal. Based on the light-dependent binding of PIF3 to Phy, Shimizu-Sato et al. constructed the light-switchable yeast two-hybrid expression system illustrated in Fig. 2.8A

42 42 Engineered Gene Networks Figure 2.8: (A) Light-switchable two-hybrid system. The activation domain carried on the PIF3- GAD hybrid protein is only recruited to the promoter region when the Phy-GBD hybrid protein is activated by red light and (the Pfr conformation). (B) Red light activates the expression of a histidine auxotrophic marker and colonies can form on histidine selective plates when exposed to red light. by fusing the Gal4 BD to the Phy protein and the AD to the PIF3 protein. The Phy- BD hybrid protein binds to an UAS sequence and the interaction between the Pfr form of Phy-BD and PIF3-AD is sufficient to activate the transcription of either a histidine (HIS) auxotrophic marker or the lacz reporter gene. Fig. 2.8B shows an example of light-controlled expression of the auxotrophic marker. Cells were spread onto plates lacking histidine. One was grown in darkness while the other was grown in the red light. Colonies were only observed on the latter demonstrating that the exposure to red light enables the expression of the auxotrophic marker. The ability to regulate transcription in yeast is not associated with any specific properties of the activation and DNA binding domains of Gal4. For example, a yeast two-hybrid system that involves entirely prokaryotic components (the DNA binding domain from the LexA protein, its operator, and the activation domain from the B42 protein) is available commercially. Another example of a prokaryotic repressor/operator system that has been exploited to control gene expression in yeast, and in higher eukaryotes (section 2.2.3), is the TetR/tetO discussed in section Two TetR-based one-hybrid yeast expression systems developed by Herrero and co-workers are illustrated in Figs. 2.9A. These systems were originally constructed by Gossen et al. and are today used widely to regulate gene expression in higher eukaryotes (see section 2.2.3). In Fig. 2.9A, the TetR protein is fused to the activation domain of the protein VP16 from the Herpes simplex virus. In the absence of TetR inducers, the TetR-VP16 hybrid protein, or tetracycline controlled transactivator (tta), can bind to teto operator sequences inserted at positions near a cytomegalovirus promoter (P CMV ) promoter in such a way that the VP16 activation domain can interact with, and recruit, the transcriptional apparatus. The strength of the interaction between the TetR binding domain and the teto binding site is reduced by the addition of tetracycline. Increased concentrations of the inducer decreases the rate of transcription as the probability that the VP16 AD is in the vicinity of the promoter is decreased.

2.2 Engineering Regulatory Modules 43 Figure 2.9: (A) The tetracycline controlled TetR-VP19 transactivator (rta) system developed by Gossen and Bujard and adapted to yeast by Herrero et al.

43 2.2 Engineering Regulatory Modules 43 Figure 2.9: (A) The tetracycline controlled TetR-VP19 transactivator (rta) system developed by Gossen and Bujard and adapted to yeast by Herrero et al. (B) Tetracycline controlled transcriptional silencing (tts) system based on a TetR-Ssn6 hybrid protein. The second TetR-based yeast expression system (Fig. 2.9), makes use of a hybrid protein composed of TetR and Ssn6 or Tup1. Contrasting the yeast one- and two-hybrid systems, the TetR-Ssn6/Tup1 system relies on promoter silencing rather than promoter activation. Recall from section that Ssn6 and Tup1 are components of the machinery that is recruited to the Gal1 promoter by Mig1 to downregulate expression of the gal1 gene in the presence of glucose. In the TetR-Ssn6 hybrid system, the interaction between TetR and its teto DNA binding site serves the same role in the regulation of transcription as Mig1; it recruits transcriptional regulators that interacts with the nucleosomes and/or the RNA polymerase holoenzyme. Similarly to the TetR-VP16 hybrid protein, the TetR-Ssn6/Tup1 hybrid protein, or tetracycline controlled transcriptional silencer (tts), has a reduced affinity for the teto binding sites in the presence of inducer. In other words, the rate of transcription is increased when inducer is added to the growth medium. It would appear that transcriptional regulation in eukaryotes is significantly more complicated compared to the corresponding process in prokaryotes. Regulation in prokaryotes is many cases a one-step process; a transcriptional activator such as CAP or AraC binds near the promoter and increases the rate of transcription by interacting with the RNA polymerase holoenzyme. Transcriptional repressors may work simply by reducing the accessibility of the promoter through steric hindrance. In all the examples above, the regulation is indirect and mimics that of many genes in the galactose regulon

44 44 Engineered Gene Networks Figure 2.10: (A) Transcriptional repression of the yeast Gal1 promoter by steric hindrance by the TetR repressor. In the absence of ATc, the TetR repressor binds to the DNA and prevents polymerase binding to the promoter. (B) Correlation between expression level and induction with galactose at 500 ng/ml ATc and with ATc at 2% galactose. The engineered TetR-switch works nearly as well as the natural system. (section 1.4.3). However, it is possible to engineer eukaryotic promoters where expression is attenuated in a manner that resembles the simple regulation found in prokaryotes. Such a system was constructed by Blake et al. by inserting tandem teto operators downstream of the TATA box in the Gal1 promoter (Fig. 2.10A). Transcription from this promoter is activated by Gal4 when galactose is present in growth medium and the expression level can be attenuated by the addition of ATc. Figure 2.10B shows the expression of a yeast-enhanced variant of the green fluorescent protein (yegfp) when the concentration of galactose is varied at full induction with ATc (500 ng/ml) and when the concentration of ATc is varied at full induction with galactose (2% w/v). Although the mechanism of repression is relatively simple compared to the switches that are based on hybrid proteins, the TetR switch works remarkably well. There is a low basal level of transcription in the absence of ATc and the dynamic range matches that of the natural Gal1 promoter. In the next section it will be discussed how the LacR repressor can, in a similar way, be used to regulate transcription in mammalian cells Mammalian Switches An engineered expression systems that is frequently used to regulate transcription in eukaryotic cells is the TetON/TetOFF system originally developed by Gossen and Bujard and co-workers. The TetOFF system for mammalian transcription regulation works in the same way as the system that was adapted to regulate transcription in yeast; a fusion protein composed of the bacterial TetR repressor and the VP16 activation domain, the tetracycline controlled transactivator tta, is capable of activating transcription from a

45 2.2 Engineering Regulatory Modules 45 Figure 2.11: (A) The reverse tetracycline controlled transactivator (rtta) composed of a mutated TetR protein and the VP16 transcriptional activator binds to teto operators in the presence of inducer. (B) Dual system in which a tetracycline controlled transcriptional silencer (tts) prevents transcription in the absence of inducer and rtta activates transcription in the presence of inducer. promoter that contains multiple teto binding domains in the absence of tetracycline. The TetON system uses a mutant variant of the hybrid protein, the reverse tetracycline controlled transactivator (rtta), in which four altered amino acids in the TetR component cause the interaction with the teto binding sites to require the presence of inducer. Adaptations of the rtta systems are also available for yeast. The yeast TetR-Ssn6 hybrid protein in Fig. 2.9 is an analogue of a tetracycline controlled transcriptional silencer (tts) first developed for mammalian cells. This hybrid protein contains a fusion of TetR and the transcriptional silencing domain (SD) from the Kid-1 protein and can be used to attenuate transcription in the absence of tetracycline. Improved switching properties can be achieved by having tts and rtta present at the same (Fig. 2.11B). In the same manner that the TetR repressor can be used directly to control the expression from yeast promoters, the LacR repressor can be used to control gene expression in higher eukaryotes by the insertion of laco operator sequences at appropriate locations near a promoter. There are many examples of this reported in the literature. These switches operate in a manner that is similar to the regulation of the lactose operon in E. coli and the P Llac0 1 system; the laci gene is expressed at a constant rate giving rise to a high level of tetrameric LacR repressor proteins that can interfere with transcription when bound to laco operators within or near a promoter. A particularly re-

46 46 Engineered Gene Networks Figure 2.12: Regulation of transcription with LacR/IPTG in the mouse. (A) Three laco sequences were inserted within a promoter that control the expression of the sequence that encodes the enzyme tyrosinase involved in pigmentation. Transcription is low when LacR is bound to the operators. (B) The effect on pigmentation of IPTG on the expression of tyrosinase. Removal of IPTG causes downregulation of expression and the pigmented infant becomes an adult albino. Modified without permission from Cronin et al. c Cold Spring Harbor Laboratory Press. markable example is the LacR/IPTG mediated control of pigmentation in the mouse reported by Cronin et al. In this system (Fig. 2.12A), three laco DNA binding sites were inserted into the promoter region of the gene containing the coding (cdna) sequence for the enzyme tyrosinase. Tyrosinase catalyzes the first step in melanin biosynthesis and its deletion or downregulation gives rise to an albino phenotype. Figure 2.12B shows an example of the phenotypic alterations that can be controlled by this simple gene regulatory system. The mouse embryo and nursing pup is feed IPTG (through the mothers milk) and the expression of tyrosinase causes the infant to be pigmented. When the administration of IPTG to the infant is discontinued, LacR represses the expression of the enzyme causing the adult mouse to display an albino phenotype. 2.3 Engineering Regulatory Circuits The regulatory systems described in the previous section are composed of a single regulated promoter element, but are in fact small networks; the regulated promoter receives inputs from a biochemical reaction network composed of transcription factors and their inducers. The function of these networks is relatively simple. The inducers modulate transcription by up- or down-regulating the activity of the transcription factor proteins. Because of this simplicity, it seems appropriate to classify these systems as input/output modules or regulatory motifs rather than networks. Input/output modules and regula-

47 2.3 Engineering Regulatory Circuits 47 Figure 2.13: (A) The toggle switch plasmid. R1 and P1 is either TetR and P LtetO 1 (pike toggle) or λ CI and P Ls1con (ptak toggle). RBS1, rbs E and rbs B refers to different ribosome binding sites. T1T2 to transcriptional terminators. The reporter is the GFPmut3 protein. (B) Bistability in the ptak117 toggle. ptak102 is a IPTG inducible switch obtained by deletion of the ci gene. (C) Population distributions obtained by flow cytometry at different levels of induction in (B). Modified from Gardner et al. without permission. c Nature Publishing Group and Annual Reviews. tory motifs can be used as the foundation of more elaborate circuits allowing for more sophisticated control of gene expression. Several such systems have been developed, but only two, the bacterial toggle switch and the ring oscillator, will be discussed here. They serve as a demonstration that complex behavior and functionality can be achieved by combining simple elements. The toggle switch was constructed by Gardner et al. by combining two repressible promoter motifs in such a way that expression from one promoter prevents the expression of the other (Fig. 2.13). It was implemented on high copy number plasmids (ColE1 origins of replication) in two versions (Fig. 2.13A) using either the LacR and the TetR repressors (designated pike) or the LacR and the λ CI repressors (designated ptak). The tetr gene or the ci was expressed from a LacR repressible promoter (P trc 2 ) and the laci was expressed from the P LtetO 1 promoter (ptak toggle) or from a modified P L promoter (P Ls1con ) that is repressed by λ CI (pike toggle). The ci gene used in the ptak system was a mutant version termed ci857, which produces a λ repressor protein that is inactivated at elevated temperatures. Transient pulses of IPTG or ATc (pike system), or of IPTG or high temperature (ptak system), cause robust switching between states. Figure 2.13B illustrates induction with IPTG of one variant of the toggle, ptak117. The experiment was started in the LacR (low fluorescence) and λ CI (high fluorescence) states by growing the cells at elevated temperature and in the presence of IPTG, respectively. These states are stable when IPTG is removed or the temperature is lowered. When IPTG is added to cell that expresses LacR at high levels (Fig. 2.13B), the low fluorescence increases sharply at a critical point corresponding to a saddle-node bifurcation. This sharp transition contrasts the smooth induction curve obtained from an IPTG-inducible switch (ptak102) in which the λ CI component is eliminated. Figure 2.13C shows population distributions obtained by flow cytometric measurements of single cell fluorescence just below, at and

48 48 Engineered Gene Networks Figure 2.14: (A) The pz vectors used by Elowitz and Leibler to construct the ring oscillator. (B) Example of oscillations in fluorescence in a single cell measured by microscopy. Modified without permission. c Nature Publishing Group. just above the critical IPTG concentration. The ring oscillator was constructed by Elowitz and Leibler by insertion of three repressible motifs on a pz expression vector. In this system, the TetR repressor is expressed from the P LlacO 1 promoter, the Lac repressor from P R and the λ repressor from the P LtetO 1 promoter, thus constituting a closed ring with negative feedback to the previous module as illustrated in Fig. 2.14A. To obtain a shorter oscillation period, the oscillator was constructed using variants of repressor proteins (denoted lite ). These proteins are tagged with a amino acid that is recognized by a proteindegradation pathway and are constructed by fusing the sequence for the repressor proteins with the DNA sequence that encodes the recognition tag. The state of the network was monitored by co-transformation of the low copy number pz vector carrying the oscillator (SC101 origin, amphicillin resistance) and a high copy number pz vector (ColE1 origin, kanamycin resistance) on which the P LtetO 1 promoter regulates the expression of a gene, gfp-aav, that encodes a short-lived variant of the GFP protein. In agreement with model predictions (see section 3.5, cells that carry this engineered network are capable of sinusoidal oscillation with a period of approximately 2.5 hours (Fig. 2.14B). The ring oscillator construct behaves somewhat erratically and cells oscillate without phase coherence with a period of 160±40 minutes in only 40% of the cells. The reasons for this are not well understood, but the lack of phase coherence could in part be due to fluctuations in the low number of molecules per cell. A more robust relaxation oscillator has been constructed by Atkinson et al. by augmenting a chromosomally integrated natural, positive feedback system with a negative LacR feedback. This system shows dampened, but coherent, oscillations. 2.4 How Transcriptional Regulation Works It should be clear from the previous sections that the regulation of transcription in many cases is based on relative simple interactions between transcription factor proteins and their corresponding DNA binding sites. In prokaryotes, transcriptional repressors may

49 2.4 How Transcriptional Regulation Works 49 affect transcription simply by being present near or within the promoter where they compete with the RNA polymerase for promoter access. As demonstrated by the engineered prokaryotic and eukaryotic promoter where repressor binding sites are appropriately inserted, more elaborate mechanisms, such as long-range interactions mediated through DNA looping, are not strictly required, but may enhance the performance of the switch. Prokaryotic transcriptional activators like CAP and AraC appear to increase the probability that the RNA polymerase binds to the promoter by interacting directly with components of the holoenzyme. Since the context of transcriptional activation can be changed simply by substituting activator binding sites, it seems that there is nothing sophisticated about the mechanism by which these activators work. Similarly, the success of yeast one- and two-hybrid systems are testaments for a case of relative simplicity in eukaryotic transcription regulation. All that is required for activation domains or silencing domains to work is the appropriate positioning of these elements in the vicinity of a promoter. This positioning is in turn determined by the interactions between protein domains or between DNA and protein DNA binding domains. It would thus appear that simple binding reactions is at the core of many transcriptional regulatory mechanisms. In the next section, we will discuss how the laws of chemistry that govern such interactions can be used to transform qualitative models of molecular interactions into quantitative, systems level models. Suggested Further Reading Textbooks: Burke D. et al. Methods in yeast genetics. Cold Spring Harbor Laboratory Press, (2000). Nicholl, D. S. T. An introduction to genetic engineering. Cambridge University Press (1994). Sambrook & Russell. Molecular cloning. A laboratory manual. Cold Spring Harbor Laboratory Press, 3rd Edition (2001).

50 50 Engineered Gene Networks Suggested Further Reading (cont.) Articles Belli G, Gari E, Piedrafita L, Aldea M & Herrero E. An activator/repressor dual system allows tight tetracycline-regulated gene expression in budding yeast. Nucleic Acids Res. 26, (1998). Blake W. J., Kærn M., Cantor C. R. & Collins J. J. Noise in eukaryotic gene expression. Nature 422, (2003). Chien C. T., Bartel P. L., Sternglanz R. & Fields S. The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc Natl Acad Sci U. S. A. 88, (1991). Cronin C. A., Gluba W & Scrable H. The lac operator-repressor system is functional in the mouse. Genes Dev., 15, (2001). Elowitz M. B. & Leibler S. A synthetic oscillatory network of transcriptional regulators. Nature 403, (2000). Gardner TS, Cantor CR, Collins JJ. Construction of a genetic toggle switch in Escherichia coli. Nature 403, (2000). Gari E, Piedrafita L, Aldea M. & Herrero E. A set of vectors with a tetracyclineregulatable promoter system for modulated gene expression in Saccharomyces cerevisiae. Yeast 13, (1997). Gossen M, Freundlieb S, Bender G, Muller G, Hillen W, & Bujard H. Transcriptional activation by tetracyclines in mammalian cells. Science 268, (1995). Gossen M & Bujard H. Tight control of gene expression in mammalian cells by tetracycline-responsive promoters. Proc Natl. Acad. Sci. U.S.A. 89, (1992). Lutz R. & Bujard H. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 15, (1997). Shimizu-Sato S., Huq E., Tepperman J. M. & Quail P. H. A light-switchable gene promoter system. Nat. Biotechnol. 10, (2002). Sikorski R. S. & Hieter P. A system of shuttle vectors and yeast host strains designed for efficient manipulation of DNA in Saccharomyces cerevisiae. Genetics 122, (1989).

51 Tutorial Part3 Modeling Small Gene Networks From the examples discussed in the previous parts of the tutorial it should be clear that the interaction between cis and trans-regulatory elements is of utmost importance in determining when and how strongly a particular gene is expressed. Since this mode of regulation is one of the primary control mechanisms available to the cell, it is essential to understand the fundamental principles that correlate the frequency of transcription with cis-regulatory dynamics. In this part of the tutorial, we discuss how quantitative models of gene expression can be obtained by combining principles of chemical reaction kinetics with qualitative knowledge of the molecular mechanisms underlying transcriptional regulation. Biological systems are generally described on one of three levels. At the level of single molecules, the attractive and repulsive forces between individual atoms are modeled explicitly and the changes in their relative position is simulated on very short time scales, typically on the order of femto- to nanoseconds ( s). At the level of individual cells, the time-averaged properties of molecules are used to model individual reaction events at the microscopic level. This is typically done in terms of stochastic birth-death processes in which molecules of a specific type are created or destroyed at random. In chemical systems, such descriptions are usually appropriate to model processes on the order of nano- to milliseconds ( s). At the highest level of description, macroscopic behavior is modeled using deterministic equations. This is often the most appropriate level of description of chemical systems on the order of milli- to kiloseconds ( s) and beyond.

52 52 Modeling Small Gene Networks Both stochastic and deterministic modeling serve as useful tools for the analysis of cellular system behavior. The choice of one over the other depends on factors such as the number of molecules involves, the time scale of the process of interest, and on the degree of spatial mixing on that time scale. A deterministic model will typically not be appropriate if the system contains on the order of molecules of a particular type, as it is the case for most living cells. There, the probabilistic nature of individual reaction events and the deviations from the average may significantly alter or even dominate the system s behavior. As a result, the dynamics of a single cell is most appropriately captured using microscopic, stochastic models that describe the temporal evolution of biochemical networks of interacting molecules. This, however, does not mean that macroscopic, deterministic models are irrelevant for the modeling of gene regulatory networks and other cellular systems. If the number of molecules in a single cell is described by a probability distribution with average n and variance σ 2 n, the central limit theorem theory tells us that a population of N cells will have an average of n molecules per cell and a variance that is given by σ 2 n/n. If the population is large enough, the population variance becomes negligibly small and the dynamics of the population average will reflect the behavior of the majority of cells in the population. As a result, a macroscopic model will in many cases be an adequate description of the most probable behavior of an average cell and of the average behavior of a single cell over a long time period. There are a number of situations that need to be kept in mind. These include, but are not limited to, (1) noise-induced transitions in which basin boundaries are crossed as a result of random fluctuations, (2) noise-induced shifts of critical points and (3) noise-induced bifurcations in which new attractors emerge solely as a consequence of the fluctuations. Unfortunately, time limitations prohibit a detailed discussion of these fascinating topics. The modeling of chemical and biochemical reactions at the microscopic and the macroscopic levels usually involves a description of an process of interest in terms of elementary reactions. Elementary reactions describe individual reaction events at the molecular level and it is not unusual that hundreds or even thousands of different molecules are involved in a given process. To facilitate analysis and interpretation, the dimensionality of large scale models can be reduced by estimating which reactions are of marginal importance as well as systematic simplification schemes, such as the quasi-steady state approximation. It is often the case that large networks of interacting chemical species can be broken down to smaller sets of subsystems that can be described by response functions, or transfer functions in engineering terms, that reflect how a subsystem changes its outputs as its inputs are varied. This part of the tutorial will focus on the estimation of response functions by applying the laws of chemistry to a qualitative molecular level descriptions.

53 3.1 Biochemical Reaction Kinetics 53 Figure 3.1: Binding of repressor (LacR) to the operator O1 occurs in three steps; (1) dimerization of LacR monomers, (2) dimerization of LacR dimers, and (3) binding of the LacR tetramer to the O1 form the LacR-operator complex. 3.1 Biochemical Reaction Kinetics The first step in the formulation of a quantitative model of gene regulation is to construct a qualitative diagram that shows the molecular interactions that are known to occur in the system of interest. In the best case scenario, all of the individual steps are know and understood in detail at the molecular level. For example, consider the binding of the LacR tetramer to the main operator O1 located downstream of the P lac promoter (see section 1.4.1). This reaction scheme can be described by the quantitative model illustrated in Fig The model contains three reversible reaction steps (numbered 1, 2 and 3) for a total of six chemical reactions; (1) two LacR monomers combine to form a LacR dimer and a LacR dimer falls apart to form two LacR monomer, (2) two LacR dimers combine to form a LacR tetramer and a LacR tetramer falls apart to form two LacR dimers, (3) a LacR tetramer binds to the laco operator sequence to form a LacR-operator complex and the LacR-operator complex falls apart to form a free LacR tetramer and the unoccupied O1 operator. Chemical kinetics provide a theoretical foundation that can be used to transform a qualitative cartoon model of molecular interactions into a quantitative description. In the context of gene regulatory systems, a particularly important application is to estimate how changes in one or more input signals, e.g., the abundance of LacR monomers, changes an output signal, e.g., the average fraction of laco operators that are occupied. The response function associated with a given biochemical input/output system usually involves a series of hierarchical steps; (1) the qualitative model is broken down into independent elementary reaction steps (or their equivalent), (2) the rates of each reaction is estimated by applying the law of mass action, (3) the individual reactions are assumed to have reached a quasi-steady state and (4) constraints such as mass conservation are then used to calculate the output of the system. These steps are ubiquitous in the modeling of biochemical reaction systems and are employed both at the macroscopic and the microscopic levels of descriptions.

54 54 Modeling Small Gene Networks Elementary Reactions A general description of the reaction between the reactants A, B,... and their conversion into products P, Q,... is given by: aa + bb +... k f pp + qq +... (3.1) where the coefficients a, b, p and q are called the stoichiometric coefficients and the parameter k f is called the rate constant. The stoichiometric coefficients relate the number of reactant molecules consumed to the number of product molecules generated in a single reaction. The stoichiometric coefficients are usually chosen such that the total mass conserved and the number of atoms contained in the reactants is the same as the total number of atoms contained in the products. For example, the six reactions involved in the binding of LacR to O1 in Fig. 3.1 can be described by the reaction equations given by: 2LacR k 1a (LacR) 2, k 2(LacR) 2a 2 (LacR)4, (LacR) 4 + O1 k 3a {O 1 (LacR) 4 }, k (LacR) 1b 2 2LacR, k (LacR) 2b 4 2(LacR)2, (3.2) {O 1 (LacR) 4 } k 3b (LacR) 4 + O1 The general reaction in Eq. 3.1 is said to be of order a and b with respect to the reactants A and B, respectively. The overall reaction order n is the sum of the reaction orders for all the reactants and equals n = a + b when A and B are the only reactants. In the reaction scheme in Eq. 3.2, the dimerization reaction (reaction 1a), the tetramerization reaction (reaction 2a) and binding of the LacR tetramer to O1 are second order reactions. The dissociation reactions of the LacR dimer (reaction 1b), the LacR tetramer (reaction 2b) and of the repressor-operator complex (reaction 3b) are first order reaction because they involve only a single reactant. The reactant molecules perform a thermally driven random walk in the intracellular environment and must collide with each other to have any chance of being converted into the reaction products. Equation 3.1 describes elementary reaction if it corresponds to a reaction that takes place as a result of a single molecular encounter in which a total of n reactant molecules come together at the same instance in time. The probability that n different molecules in an nth order elementary reaction will find each other at the instance in time becomes vanishingly small as n increases. A high overall reaction order is therefore a good indicator that the process is not an elementary reaction, but an overall reaction that involves a sequence of elementary reactions and intermediates Law of Mass Action For an elementary reaction, the frequency of encounters between reactants generally depends on the number of reactant molecules per unit volume and on their average

55 3.1 Biochemical Reaction Kinetics 55 velocity. The probability that an encounter will occur is thus proportional to the concentration of the reactant molecules. The law of mass action reflects this basic principle and states that the rate, v, of the general reaction in Eq. 3.1 is given by: v f = k f [A] a [B] b (3.3) where square brackets indicate concentrations. The differential equations that describe how the concentrations change in time from some arbitrary initial concentrations are given by: d[a] dt = ak[a] a [B] b, d[p ] dt = +pk[a] a [B] b, etc. (3.4) In chemistry, concentrations are usually reported in molar, symbolized by M, with one molar corresponding to one mole (a quantity of molecules) per liter. The reaction rate is usually reported in terms of concentration change per time unit, e.g., M/s. The units of the rate constant k f is therefore M n 1 /s where n is the reaction order. The concentrations of chemical species within living cells are typically in the range of 0.1 nm to 1 µm (10 10 M M). While an encounter between the reactant molecules is a prerequisite for their conversion into products, not all of the encounters will result in the completion of the reaction. Only a percentile of the molecular encounters will occur with the required relative orientation of the reactants and only a fraction of these will be able to complete the reaction. These effects are incorporated into a proportionality factor, the rate constant. In the classical model of chemical reaction kinetics due to Arrhenius, the value of the rate constant is given by: ( ) Ea k = A exp, (3.5) RT where E a is the activation energy, T is the absolute temperature (in Kelvin), R is the gas constant R = joule mol 1 Kelvin 1 (or calorie mol 1 Kelvin 1 ) and A is a constant. The Arrhenius model can be conceptualized as the motion along a reaction path ξ where each point along the path is associated with a different energy (Fig. 3.2A). The correlation between energy and ξ defines an energy potential E(ξ), which, for simple reactions, has two minima and a single maximum. The minima are located at ξ = 0 and ξ = 1 and corresponds to the energy of the reactants, E R, and the energy of the products, E P, respectively. The change in energy as ξ changes from zero to one is E = E P E R. The maximum is located somewhere in between and corresponds to an energy barrier, E = E R + E a. When the reactants encounter each other, they must have sufficient (kinetic) energy to overcome the energy barrier in order for the reaction to be completed. The fraction of encounters that have the appropriate energy is given by a Boltzmann distribution and depends on the difference between the ground state energy E R and the energy barrier E. In addition to having sufficient

56 56 Modeling Small Gene Networks Figure 3.2: (A) Changes in internal energy E along the extent of reaction ξ. The energy of the reactants has to exceed E in order to cross the activation energy barrier E a. The internal energy E is replaced by Gibbs free energy G in transition state theory. (B) A simple bimolecular substitution reaction between the reactants A and BC, illustrating the formation of an energyrich intermediate ABC prior to the formation of the products AB and C. excess energy, the molecules must encounter each other with the appropriate relative angles. This steric factor is taken into consideration by the pre-exponential factor A, Internal energy is in many cases not the most adequate measure of energy in biological processes. An alternative measure is known as the Gibbs free energy, G, which is defined by G = E + P V T S = H T S, (3.6) where E is the internal energy, P is the pressure, V is the volume, T is the absolute temperature, S is the entropy and H is the enthalpy, H = E +P V. The transition state theory of chemical kinetics is an extension of the Arrhenius theory where the internal energy is replaced by Gibbs free energy, i.e., the activation energy E a is replaced by the Gibbs free energy G a required to reach the transition state with maximal energy G and the change in internal energy E is replaced by the change in Gibbs free energy G. The values of E R and E P can be replaced by total Gibbs free energies G R and G P (or equivalent measures relative to a standard state). The physical concept behind transition state theory is the same as that of the Arrhenius theory and the two coincide when the pressure, the volume, the temperature and the entropy remain unchanged. In most biological systems, the temperature and the volume are unaffected by the completion of a given reaction. A simple illustrative example, a so-called SN2 reaction, is given in Fig. 3.2B. Most chemical and biochemical reaction mechanisms are far more complicated than the single-step SN2 reaction and are associated with multi-dimensional potential surfaces with many peaks and valleys. The example nevertheless illuminates three general principles; (1) that reactions require excess energy to rearrange molecular bonds and atoms, (2) that the rate constant is correlated with the amount of excess energy that is required to reach the transition state and (3) the reaction rate can be modulated by im-

57 3.1 Biochemical Reaction Kinetics 57 posing or by removing factors that exert a steric hindrance on the reacting molecules. Examples of (3) have already been discussed in Part I. The LacR repressor (and many other bacterial transcriptional repressors) may act to prevent transcription by preventing the RNA polymerase holoenzyme from interacting with the promoter. Similarly, histone modification and nucleosome remodeling may greatly alter the accessibility to the regulatory region of eukaryotic promoters Generalized Mass Action The derivation of the reaction rate v f in Eq. 3.3 is based on an idealized model of how molecules interact in very dilute solution. It has been observed that reaction rates measured experimentally deviate from those predicted by applying the law of mass action. This could, of course, reflect a limited understanding of the molecular details of the reaction, but could also be due to the fundamental assumption that the reaction takes place at very low concentrations. It has been established experimentally that the effective concentration under numerous circumstances can be significantly different from the absolute concentration. This is particularly important for living cells because the interior of the cell contains a highly concentrated and inhomogeneous soup with few of the characteristics of very dilute aqueous solutions. This is one of the factors that makes it extremely dangerous to apply qualitative and quantitative measurements obtained from biochemical in vitro experiments to living systems. At best, a parameter value measured in vitro can be within an order of magnitude of its value in vivo. At worst, molecular interactions observed in vitro may not occur in vivo (or vice versa) and a qualitative model that is based on a quantitative model of former may lead to a misinterpretation of how a regulatory mechanism operates within the living cell. In many cases, the deviation from the idealized behavior can be accounted for by introducing the concept of activity. In essence, the activity of a chemical species is a measure of its effective concentration. In the simplest case, the activity ã A of a chemical species A is given by ã A = γ A [A], where γ A is called the activity coefficient. Since the activity coefficients with this assumption can be absorbed into the rate constant, the rate equation in Eq. 3.3 remains valid. In the simplest nonlinear case, the activity and the absolute concentration are related by a power-law ã A = γ A [A] ζ A. Assuming that the activity coefficients do not change over the range of concentrations in question, the general rate of reaction can be written as: v f =k f ã a Aã b B = k f γ a Aγ b [A] aζ A [B] bζ B =k f [A]α [B] β (3.7) where the activity coefficients are incorporated into the effective rate constant k f and the exponents α and β are given by α = aζ A and β = bζ B. The rate equation in Eq. 3.7 has the form of a generalized mass action (GMA) description of the reaction. As the name implies, this representation is an extension of the mass action kinetics in which the stoichiometric coefficients (positive integers)

58 58 Modeling Small Gene Networks have been replaced by exponents that are positive reals. GMA can provide an adequate description of reaction rates under circumstances where mass action kinetics cannot be applied. This includes reaction that takes place under conditions of restricted dimensionality, for instance, when molecules are bound to the DNA or embedded in a membrane, as well as in the complex intracellular environment of living cells. There is at present time no theory that in general can predict the correlation between the exponents in the rate equation and the stoichiometric coefficients in a given environment. However, using the stoichiometric constants in the rate equations is usually a good place to start. An alternative and more mathematically tractable method for modeling genetic systems has been discussed by Hlavacek and Savageau Chemical Equilibrium All chemical reactions are in principle reversible. If there is a finite probability that the transition state with energy G can be reached from an initial state with energy G R (the state with pure reactants) there will also be a finite probability that the transition state can be reached from the initial state with energy G P (the state with pure products). Therefore, to any forward elementary reaction there will exist a backward reaction where the direction of the reaction arrow is reversed. For the general forward reaction in Eq. 3.1, the corresponding backward reaction is given by: pp + qq +... k b aa + bb +... (3.8) From the law of mass action, the rate v b of this reaction is given by: v b = k b [P ] p [Q] q. (3.9) There are however a number of cases where the backward reaction can be ignored without introducing a significant error and the forward reaction considered irreversible. This include reactions where the decrease in Gibbs free energy is very large, enzyme catalyzed reactions where only the forward rate constant is affected (see section 3.1.5), and reactions where the product is converted into something else immediately after it has been formed. An example of the latter is transcription where the product of one reaction, i.e., the addition of one nucleotide to the RNA chain, is the reactant of a subsequent fast reaction, i.e., the addition of a second nucleotide to the RNA chain. When a reaction cannot be considered irreversible, it will eventually settle into a state where there is no net change in the concentration of any of the components. In this equilibrium state, the forward and the backward reactions still take place, but their rates are equal to each other. In other words, in the equilibrium state (denoted with subscript eq) it is obtained that v f ([A] eq, [B] eq,...) = v b ([P ] eq, [Q] eq,...) such that: v f v b = k f [A] a eq[b] b eq k b [P ] p eq[q] q eq = 0 (3.10)

59 3.1 Biochemical Reaction Kinetics 59 It follows immediately from v f = v b that: k f [A] a eq[b] b eq = k b [P ] p eq[q] q eq K k f k b = [P ]p eq[q] q eq [A] a eq[b] b eq. (3.11) Since k f and k b are constants, their ratio K defines a relationship between the reactants and product concentrations that is independent of initial concentrations. Regardless of the quantities in which the chemicals are initially mixed, the reaction will eventually settle into an equilibrium state where the relationship between product and reactant concentrations are uniquely defined by K. Let us assume that reactants and products have been mixed together to achieve a certain set of initial concentrations. The direction in which a given reaction will progress can be determined by calculating the reaction quotient Q defined by the relation Q = [P ]p [Q] q [A] a [B] b. (3.12) When Q < K, the reaction will proceed in the direction that increases Q, meaning that more products will be formed. In the opposite case where Q > K, the reaction will proceed in the direction that decreases Q and the products will be converted into reactants. Similarly, if the equilibrium state Q = K is perturbed by addition of one of the products, the system response will be to reestablish the equilibrium state by decreasing the concentration of the products and increasing the concentration of the reactants. In general, the response of a system in equilibrium to an external perturbation will be to minimize the effect of that perturbation. This phenomenon is known as Le Chatelier s Principle or the Principle of Mass Action. When applied to the binding of LacR to its main operator (Fig. 3.1, step 3), the Principle of Mass Action has the intuitive and well known consequence; the probability that the operator is occupied increases if the concentration of the LacR tetramer (or the LacR monomer) is increased. What is not widely appreciated, however, is that this response is a direct result of mass action kinetics. The magnitude of the equilibrium constant depends on the difference between the products and the reactants in standard Gibbs free energy, G. The standard Gibbs free energy is a measure of Gibbs free energy relative to a standard state and can be calculated from tables obtained through extensive experimental measurements. The relationship between the equilibrium constant K and the standard Gibbs free energy G is given by the Gibbs-Helmholtz equation: G = RT ln K. (3.13) The standard Gibbs free energy and the equilibrium constant are thus equivalent measures of the distribution of molecules at equilibrium and can be used interchangeably. At 37 C, an equilibrium constant of 10 corresponds to G = 5.9 kj/mol (-1.4

60 60 Modeling Small Gene Networks kcal/mol) while an equilibrium constant of 0.1 corresponds to a standard free energy of G = +5.9 kj/mol. When a system has been perturbed away from an equilibrium state, or have yet to reach it, the distance between the current state of the system (Q), and the equilibrium state (K) can be quantified by the change in free energy of the reaction G r. The reaction free energy is defined by: G r = G + RT ln Q. (3.14) When the change in free energy is zero, G r = 0, the system is in equilibrium and Q = K eq (since Q = K eq and RT ln K = G ). If G r < 0, the value of the reaction quotient is lower than the equilibrium constant and the reaction will proceed in the forward reaction in order to reach the equilibrium state. Conversely, the reaction will proceed in the backward direction when G r > The Michaelis-Menten Reaction Many biochemical reactions have very high activation energies and will not occur spontaneously at any measurable rate. In living cells, this problem is solved by the use of highly specialized enzymes that may increase the rate constant by lowering the activation energy. The Michaelis-Menten reaction is a classic, molecular-level quantitative model of how some enzymes work. It describes the conversion of a substrate S into a product P in terms of three elementary reactions described by the reaction scheme: E + S k f k b ES kc P + E. (3.15) First, the enzyme E combines with the substrate to form an enzyme-substrate complex ES. When the enzyme does it job successfully, the enzyme-substrate complex decomposes into the product P and the free enzyme E is regenerated. In the case of an unsuccessful encounter, the enzyme substrate complex simply dissociates into the original substrate and free enzyme. Applying the law of mass action to the three elementary reaction gives four differential equations: d[s] = k f [E][S] + k b [ES], dt d[e] = k f [E][S] + (k b + k c )[ES], dt d[es] = k f [E][S] (k b + k c )[ES], dt d[p ] = k c [ES]. dt (3.16) Since the conversion of the enzyme-substrate complex is irreversible, the time-invariant nontrivial solution of Eq is referred to as a steady state. In contrast to a chemical

61 3.1 Biochemical Reaction Kinetics 61 equilibrium state, a steady state can only be maintained if there is a constant influx of fresh reaction substrates and a continuous removal of reaction products. It will for now be assumed that there is no influx of substrate and that there is no product when the reaction is started, [P ](t = 0) = 0. It will further be assumed that the initial substrate concentrations is given by [S](t = 0) = S 0 and that all of the enzyme is initially in the free form, i.e., [E](t = 0) = E 0, [ES](t = 0) = 0. In this case, the system will eventually reach the trivial time invariant solution where [S] = 0 and [P ] = S 0, The first step in any modeling is to reduce the dimensionality of the problem considered. Inspection of Eq reveals two conservation relations: d[e] dt + d[es] dt = 0, d[s] dt + d[es] dt + d[p ] dt = 0. (3.17) The conservation relations reflect the fact that the total enzyme concentration is constant ([E] + [ES] = E 0 ) and that a decrease in substrate concentration is coupled to corresponding increases in the concentrations of the enzyme-substrate complex and of the product ([S] + [ES] + [P ] = S 0 ). The presence of the two conservation relations means that only two of the four differential equations in Eq, 3.16 are needed to fully describe how the concentrations of all of the species evolve as the reaction progresses. The reduced system is given by: with initial conditions d[s] dt d[e] dt = k f (E 0 [ES])[S] + k b [ES], = k f (E 0 [ES])[S] + (k b + k c )[ES], (3.18) [S](t = 0) = S 0, [ES](t = 0) = 0. (3.19) The concentration of free enzyme can be substituted everywhere by [E] = E 0 [ES] and the product concentration can be calculated from [P ] = S 0 [S] [ES]. The time-course of the Michaelis-Menten reaction is illustrated in Fig. 3.3A. The figure shows the concentrations [S]/S 0, [ES]/E 0 and [P ]/S 0 are plotted as a function of time. Note that the substrate concentration changes very little initially (t < 10), that the enzyme-substrate concentration quickly reaches a fairly flat plateau and that [ES] remains more or less constant for an extended time period (between t 0.1 and 10). In this region, it can be assumed that [S] S 0 and d[es]/dt 0. At later times, the substrate concentration decreases and the product accumulates. The time-course of the reaction can thus be separated into three distinct regions; a region where the concentration of the enzyme-substrate complex is rapidly increasing, a region where the enzymesubstrate complex concentration remains constant and a region where the concentration of the enzyme-substrate complex decreases as the substrate is depleted. The second region corresponds to a situation where the enzyme-substrate complex can be assumed to

62 62 Modeling Small Gene Networks Figure 3.3: (A) Time course of a reaction catalyzed by a Michaelis-Menten type enzyme. (B) The dependence of the Michaelis-Menten rate equation on the substrate concentration. The insert shows the slow convergence to v/v max = 1. be in a quasi-steady state defined by d[es]/dt = 0. Once the enzyme-substrate complex has reached this state, the rate of product formation obeys the Michaelis-Menten rate equation: v = v max[s] K M + [S], (3.20) where v max is the maximal rate of product formation and K M is called the Michaelis- Menten constant. These constants are defined by v max = k c E 0 and K M = (k b + k c )/k f, respectively. The Michaelis-Menten rate equation can readily be derived using the following steps; (1) assume a quasi-steady state for the enzyme substrate complex (d[es]/dt = 0), (2) solve the resulting algebraic equation with respect to free enzyme (k f [E][S] = (k b + k c )[ES]), (3) insert the resulting equation into the conservation relation for the total enzyme concentration (E 0 = [E] + [ES]) and (4) solve for the concentration of the enzyme-substrate concentration in the quasi-steady state to obtain [ES] = E 0 [S]/(K M + [S]). The rate of the reaction, i.e., the Michaelis-Menten rate equation, is then obtained from v = k c [ES]. It can be shown rigorously that the quasi-steady state introduces minimal error when E 0 K M. The dependence of the rate of reaction on the substrate concentration is shown in Fig. 3.16B. The dependence of v/v max on [S] has the characteristic shape of a saturation curve. It describes the relative occupancy of the enzyme by the substrate, i.e., [ES]/([E] + [ES]) and is thus the response function associated with the input [S] and the output [ES]. When data points can be fitted well to the response function for the Michaelis-Menten reaction, the parameter K M can be read directly from the plot of f([s]) = v([s])/v max as the input signal that gives 50% response, i.e. the value [S] 0.5 where f([s]) = 0.5. Note that the Michaelis-Menten rate equation converges quite slowly as the substrate concentration is increased (Fig. 3.16B, insert). In fact, a Michaelis-Menten enzyme does not act as a very efficient switch. In order to change

63 3.1 Biochemical Reaction Kinetics 63 the response from 10% to 90% it is necessary to increase the input, i.e., the substrate concentration, by 810%. The ratio of the input signals that produce 10% and 90% is called the response coefficient and is denoted by R S. The response coefficient for an Michaelis-Menten enzyme is R S = 81 and the high value is primarily due to low slope of the response function at [S] 0.5. It is noted that the equilibrium constant for the binding of the substrate to the enzyme is given by K S = k f /k b and that the Michaelis-Menten constant is equal 1/K S when k c k b. The assumption that k c k b together with d[es]/dt 0 is often referred to as the pre-equilibrium or the quasi-equilibrium approximation Hill-type Kinetics It is often observed that response functions measured experimentally have slopes that exceed those predicted from the Michaelis-Menten reaction scheme. A canonical function that frequently is employed when the Michaelis-Menten rate equation fails is provided by the so-called Hill rate equation or Hill-type function. The most common form of this equation is given by: h(x) = x n H K H + x n H, (3.21) where n H and K H are called the Hill coefficient and the Hill constant, respectively. Another frequently used form that is used to describe an inhibitory input signal is given by: h (x) = 1 h(x) = K H K H + x n H. (3.22) The Hill constant is related to the input signal that gives 50% response (x 0.5 ) through the relationship x 0.5 = n H KH and the Hill coefficient is related to the steepness of the response function. For n H = 1, the Hill rate equation coincides with the Michaelis-Menten rate equation and the response coefficient is R S = 81 For n H 1, the Hill rate equation approaches a Heaviside step function with a threshold at x = x 0.5 (see Fig. 3.4A). The response coefficient decreases dramatically as the Hill coefficient increases. When n H = 2, less than a 10-fold increase is required to change the output from 10 to 90% (R S = 10). When n H = 6, the input signal only needs to double (R S = 2). Experimental data points, r(y), obtained at different values of the input signal y, can be often be fitted well to a Hill-type function to capture the essential behavior of the response function. The fitting procedure involves the construction of the so-called Hill plot in which the logarithm of the ratio r(y)/(1 r(y)) is plotted as a function of log(y). It is here assumed that r(y) is appropriately normalized. If this is not the case, the Hill plot is constructed from the logarithm of the ratio r(y)/(r max r(y)) where r max is the maximal value of r(y) obtained for y y 0.5. A linear fit to data

64 64 Modeling Small Gene Networks Figure 3.4: (A) Response functions generated by the Hill rate equation. (B) Hill plot of log h(x)/(1 h(x)) versus log(x) used to determine the Hill coefficient as the slope and the Hill constant as the intersect at log x = 0. point that are transformed in this way will give the Hill coefficient as the slope and the logarithm of the Hill constant as the negative intersect. This is illustrated in Fig. 3.4B where the experimental points r(y) are generated from by the function h(x). Since 1 h(x) = K H /(K H + x n H ), the plot will for an activating signal consists of lines with positive slopes defined by: ( ) h(x) log = log(k 1 H 1 h(x) xn H ) = log K H + n H log x. (3.23) For an inhibitory signal, the Hill plot will consist of lines with negative slopes defined by: ( h ) (x) log 1 h = log (x) ( KH x n H ) = log K H n H log x. (3.24) The Hill plot is usually constructed from measurements in the range of input values that give 10% to 90% response. The Hill-type response functions are generally used to model the relationships between an input and a response with a minimal number of unknown parameters and without the complexity of the underlying reaction kinetics. We shall see examples of this in sections and where the correlation between the intracellular activity of transcription factors (the output) and the extracellular concentrations of their inducers (the input) are modeled in terms of Hill-type functions. However, in contrast to black-box models in which a response is fitted to a polynomial function with no clear molecular basis, models that are based on Hill-type functions can be considered as a type of gray-box phenomenological descriptions that often, but not always, preserve some connection to the underlying reaction mechanisms. The most simple molecular mechanism that gives rise to a Hill-type response function is a generalized mass action reaction scheme that resembles the Michaelis-Menten

65 3.2 Modeling Gene Expression 65 reaction: E + n H S k f k b ES kc E + P (3.25) where the stoichiometric coefficient of S is replaced by the real-valued exponent n H. The response function in Eq can be obtained by applying the law of mass action to the reaction in Eq when the it is assumed that [E] + [ES] = E 0 and that d[es]/dt 0 (see section 3.1.5) and is usually interpreted as the simultaneous binding of n H substrate molecules to the enzyme. However, when setting k c = 0, the reaction scheme applies equally well to other types of reactions. For example, E, S and ES could also represent a DNA binding region, a transcription factor and a DNAtranscription factor complex, respectively, or a transcription factor, an inducer and a transcription factor-inducer complex, respectively. 3.2 Modeling Gene Expression The expression of protein encoding genes involves processes that by all measures are irreversible; the transcription of a gene into an mrna and the translation of the mrna into a protein are enzyme catalyzed reactions that involve thousands of reactants and consume vast amounts of energy. In the simplest model of gene expression, mrna is synthesized from nucleic acids and protein is synthesized from amino acids following the irreversible reaction scheme: nucleic acids κm mrna amino acids κ P protein, (3.26) where the rate constants κ M and κ P are pseudo-first order, i.e., have units of inverse time. Despite the complexity of these reactions at the molecular level, the process of gene expression can in many cases be described in terms of two ordinary differential equations that determine the temporal evolution of the number of mrna molecules n M and of the number of protein molecules n P : dn M dt = κ M n D k dm n M, dn P dt = κ P n M k dp n P, (3.27) where n D is the average number of active promoters, and k M and k P are first-order decay constants associated with the half-life of the mrna and the protein, respectively. For gene regulatory systems in general, regulatory signals could alter the rate of transcription, the rate of translation as well as the half-lifes of mrna and protein. Equation 3.27 is a coarse description of the immensely complicated processes involved in gene expression. However, experience has shown that it often is a simple and appropriate alternative to more complicated models that incorporate the molecular details of mrna and protein synthesis and decay. Situations where it is not a suitable

66 66 Modeling Small Gene Networks description include the modeling of processes that takes place on a time scale that is comparable to the time scales of transcription and translation. Transcription in yeast occurs at a rate of 30 nucleotides/second so it takes less than 1 minute for RNA polymerase II to translate a gene with 1400 nucleotides (the average length of genes in its target class). Translation occurs at approximately the same time scale. Depending on the process of interest, the time delay between the initiation of transcription or translation and the formation for the corresponding mrna or protein product could have important implications. In this case, it may be sufficient to reformulate the ordinary differential equation in Eq as a delay differential equation. While Eq is used frequently and with good results, there are additional approximations that are need to be made when the model describes gene expression in cells that grow and divide. There are in this case two fundamental problems with Eq. 3.27; volume-dependent rates of reaction and partitioning of the cellular content between mother and daughter cells at cell division. The latter arise since a certain fraction of the content of the mother cell will be transferred to the daughter cell when the it divides. The number of mrna and protein molecules per cell will therefore oscillate with a period that is determined by the period of the cell division cycle and an amplitude that depends on the partition mechanism (see Fig. 3.5). The cellular concentrations of mrna and protein will generally also oscillate. The problem with volume-dependent reaction rates arises since the rate constants in Eq are associated with reactions that, at the very least, are second order. The probability that two (or more) molecules will interact depends not only on the number of molecules, but also on the volume in which the molecules are distributed and the rates of reactions that involve more than one molecule. Examples of reactions that have volume-dependent reaction rates include the binding of polymerases and transcription factors to the DNA, the binding of ribosomes to mrna and dimerization of proteins. A second order reaction between two molecules A and B has a rate of reaction v f = k f [A][B] that is given in concentration units per time unit. This rate can be converted into a rate v f that has units of number of molecules per time unit by multiplication with the cell volume v(t): v f = v f v(t) = k f [A][B]v(t) = k f n A n B /v(t). (3.28) where n A and n B are the numbers of A and B molecules per cell. In certain situations, it may be that the concentration of one of the molecules, say B, remains constant. In this case, it is possible to define a pseudo-first order rate constant k f = k f c B, where c B = n B /v(t), such that the reaction rate is v f = k f n A. This assumption implies the presence of some intracellular feedback mechanism to ensure a constant number/volume ratio throughout the cell division cycle. This seems reasonable for housekeeping enzymes, such as RNA polymerases and ribosomes, but not for other types of reactions, such as protein dimerization or binding of transcriptional regulators to the DNA. The problem of volume-dependent reaction rates can be circumvented by working

67 3.2 Modeling Gene Expression 67 with a model of cellular concentrations rather than the number of molecules per cell. The transformation of the number-based model in Eq to a concentration-based model is done by differentiation of the concentration c(t) = n(t)/v(t). For example, the rate equation that describes the variation in protein numbers in Eq is in terms of protein concentration c P (t) = n P /v(t) given by: dc P (t) dt = d n P (t) dt v(t) = 1 dn P (t) v(t) dt = c M (t)κ P k dp c P (t) + c P (t) v(t) + c P (t) dv(t) v(t) dt dv(t), dt (3.29) where c M (t) is the concentration of mrna. While solving one problem (the volumedependent rates of reaction), the transformation has introduced another. In order to solve Eq. 3.29, it is necessary to specify the explicit form of v(t). How this can be done in general is uncertain. For example, E. coli cells are rod-shaped and double their length during a cell division cycle that ends when the cell is cleaved in the middle. On the other hand, the cell division of S. cerevisiae is highly asymmetric. A small bud is formed on the surface of the spherical mother cell and the bud grows in size until it eventually dissociates. The daughter cells continue to grow after cell division and it takes some time before it reaches maturity and begins to produce off-spring of its own. In other words, the cell volume depends not only on the time since the last cell division but also on the overall age of the cell. The only way to avoid describing how the volume changes as the cells grow and divide is to assume that the rate of cell volume increase dv(t)/dt is proportional to the current volume v(t) with some proportionality constant k g. In mathematical terms it must be assumed that: dv(t) dt = k g v(t), v(t) = v 0 e kgt, (3.30) where v 0 is the initial volume of the cell. If the cell divides every time its volume doubles, i.e., when v(t) = 2v 0, the proportionality factor k g is related to the period of the cell division cycle T through k g = ln 2/T. In other words, to avoid modeling v(t) explicitly, it is necessary to assume that the cell volume grows exponentially. With this assumption, the concentration-based model becomes a system of ordinary differential equations: dc M dt = c D κ M γ M c M, dc P dt = c M κ P γ P c P, (3.31) where c D = n D /v(t) and γ M = k dm + k g and γ P = k dp + k g are the first-order rate constants associated with the apparent, biological half-lifes of mrna and protein, respectively. In summary, in order to model gene expression in cells that grow and divide, it is necessary to provide an explicit description of how the cell volume changes during

68 68 Modeling Small Gene Networks Figure 3.5: Levels of (A) mrna and of (B) protein predicted by different models of gene expression. Black curves shows the number of mrna (n M ) and proteins (n P ) predicted by Eq with exponential volume growth, v(t) = v 0 exp(ln 2t/T ) and periodic equipartition of cellular content at regular intervals T (cell division). Blue curves shows the cellular concentrations c M = n M /v(t) and c P = n P /v(t) obtained from Eq Green curves show the average number of molecules per cell predicted by Eq Red curves show the average cellular concentration predicted by Eq Parameter values: v 0 = 1 (arbitrary units), n D = 1, c D = 0.5/ ln 2, T = 90 minutes. Other parameters values (in min 1 ): κ M = 0.04, κ P = 6, k M = 0.03, k P = the cell division cycle and how cellular content is partitioned at cell division. This is desirable to avoid. First of all, cell growth and division are complicated processes that are difficult to describe with a simple mathematical formula. Secondly, systems of differential equations with time-dependent parameters are significantly more difficult to analyze compared to systems of ordinary differential equations. The assumption required to obtain a system of ordinary differential equations is unambiguous in a concentration-based model. It must be assumed that the cell volume grows exponentially. This assumption implies that the concentration is unaffected by cell division, i.e., that the mother and daughter cells inherit a fraction of the cellular content that is proportional to their respective size. It is not so clear what assumptions are required to convert a number-based model into a system of ordinary differential equations. The simplest way to obtain a model for the average number of molecules per cell is to assume that dc(t)/dt v(t) 1 dn(t)/dt and multiply both sides of Eq with v(t) to obtain: dn M dn P = n D κ M γ M n M, = n M κ P γ P n P. (3.32) dt dt The most direct interpretation of this model is that the loss of molecules at division is averaged over the cell division cycle and that the effect of volume changes on the rate constants is negligible. The choice between a model of the average numbers of molecules per cells (Eq. 3.32) or the average cellular concentrations (Eq. 3.31) can be made based on whether n D or c D remains constant. For genes that are carried on the cell s chromosome, it is reasonable to assume that n D is a constant (or changes in discrete steps). For genes that

69 3.3 Modeling cis-regulatory Systems 69 mrnas/cell mrnas/cell mrna half transcription translation synthesis Stanford MIT life (min.) rate (min 1 ) rate (min 1 ) per mrna # entries total average min max median Table 3.1: Summary of experimental estimates of transcription rates, mrna half life, protein synthesis rates and the number of proteins synthesized per mrna transcript. Data is obtained from and translation/ are carried on self-replicating plasmids, it is reasonable to assume that c D is constant. However, since both models describe time-averages, the concentration based model can still be used when n D is constant if c D is set equal to the concentration averaged over on cell division cycle, c D = T n D/v(t)dt/T ). Figures 3.5A and 3.5B compare the two models of average gene expression with the predictions from the real description in Eq where cell growth and division is modeled explicitly. It is assumed that the cell volume grows exponentially, v(t) = v 0 exp(ln 2t/T and that the cell divides into two identical halves with volume v 0 when v(t) = 2v 0. It is also assumed that n D = 1 and that c D = c D = n D /(2 ln 2). Two large scale experiments have provided estimates of transcription rates, the apparent half-lifes of mrna and translation rates for most of the yeast genes. These extensive data sets, which are summarized in Table 3.1, can be used to provide rough estimates of the magnitude of the parameters κ M, κ P and γ M. The median apparent mrna half-life is 16 minutes (average of 19 minutes), corresponding to γ M 0.04 min 1. The rate of transcription probably in the range of κ M = 0.03 to 0.1 min 1 for most genes. The median rate of translation per mrna is κ P 6 min 1, but can vary greatly. While no genome-wide estimates of apparent protein half lifes are currently available, the value of γ P is expected to be significantly lower than γ M. For well-translated genes, it has been estimated that there are about 4000 protein molecules per mrna. This puts the value of γ P in the range of approximately γ P min 1. Combining the two datasets can be used to predict that the median number of proteins produced per mrna, i.e., the ratio b = κ P /γ M is 130 (average of 185). 3.3 Modeling cis-regulatory Systems One of the most important means of controlling gene expression is the modulation of the rate of transcription by transcription factor proteins. Regardless of whether gene expression is modeled in terms of numbers or concentrations, the steady state number

70 70 Modeling Small Gene Networks of mrna (n s M ) and protein (ns P ) per cell is predicted to be given by: n s M = c s Mv(t) = f(x) κ M γ M, n s P = c s P v(t) = n Mκ P γ P = f(x) κ M κ P γ M γ P, (3.33) where κ M is the maximal rate of transcription and f(x) describes how the rate of transcription is modulated in response to the input signals x = x 1,..., x n. The response function f(x), which varies between zero and unity, gives the time- or populationaveraged relative occupancy of the promoter by a polymerase. In this section, we consider a number of examples of how quantitative response functions describing the promoter occupancy can be obtained from the qualitative knowledge of molecular interactions between cis and trans-regulatory elements. In what follows, the state of the cis-regulatory region will be denoted using a compact notation, O ijk, where the i, j and k indicate the occupancy of a specific binding site on the DNA. When a single binding is considered, there will be two states O 0 and O 1 where O 0 indicates that the binding site is unoccupied and O 1 indicates that the binding site is occupied. When two binding sites are considered, the state O 00 indicates that both sites are unoccupied, O 10 and O 01 indicates that either site i or site j is occupied, respectively, and O 11 indicates that both binding sites are occupied Repressor-Operator Binding The binding of the LacR to the laco operator in Fig. 3.1 involves three equilibrium reactions. These reactions can be represented by the symbolic reaction equations given by: 2X K 1 X 2, 2X 2 K 2 X 4, X 4 + O 0 K 3 O 1, (3.34) where X, X 2 and X 4 denote repressor monomers, dimers and tetramers, respectively, while O and O 1 denote the unoccupied operator region and the tetramer-operator complex respectively. The equilibrium constants defined by: K 1 = [X 2] [X] 2, K 2 = [X 4] [X 2 ] 2, K 3 = [O 1] [X 4 ][O 0 ]. (3.35) Note that the subscript eq indicating the equilibrium concentration has been omitted for clarity. The remaining part of this section deals exclusively with steady states and it is implied that concentration brackets refers to equilibrium concentrations. In some cases, it may not be necessary to explicitly include all of the possible intermediate reaction steps. For example, the overall reaction for the binding of LacR to the laco operator can be represented by a single reversible reaction step where four LacR monomers associate simultaneously with the LacR binding site: 4X + O 0 k 4f k 4b O 1, K 4 = [O 1] [X] 4 [O 0 ]. (3.36)

71 3.3 Modeling cis-regulatory Systems 71 It is thus possible to define an equilibrium state based solely on the initial reactants and the final products without knowledge of the intermediate steps. When the detailed reaction mechanism is known, the equation for the overall reaction can be obtained by adding together the individual reactions. The following illustrates how the overall reaction equation is obtained for the reactions in Eq. 3.34: 2X X 2, 2X X 2, 2X 2 X 4, O 0 + X 4 O 1 (3.37) 4X + 2X 2 + O 0 + X 4 2X 2 + X 4 + O 1 4X + O 0 O 1. In this procedure, terms that appear on the same side of the reaction arrows are first summed to give a reaction equation where some of the terms may appear on both sides of the reaction arrows. These terms, which are indicated by boxes in Eq. 3.37, can be eliminated if they appear with the same stoichiometric coefficient on both sides as they represent intermediate states that are neither reactants nor products. Note that the reaction 2X X 2 appears twice in Eq to ensure that the overall stoichiometry is correct (dimerization has to occur twice for each tetramer formed). The equilibrium constant for the overall reaction is given by the product of the equilibrium constants for the individual reactions: K 4 = [O 1] [X] 4 [O 1 ] = K2 1K 2 K 3. (3.38) The equilibrium constant K 1 is squared since this reaction occurs twice in the overall reaction. The correctness of the expression for K 4 can easily be validated by inspection. The relative occupancy of the operator by the tetramer is given by the response function f([x]) = [O 1 ]/[O T ] where [O T ] = [O 0 ] + [O 1 ]. It can easily be derived by setting [O] = [O] T [O 1 ] in Eq as: f([x]) = K 4[X] K 4 [X] 4 or f(x) = x4 1 + x 4, (3.39) where the response function f(x) is obtained by introducing the dimensionless concentration x of repressor monomers as x = 4 [X]. This is a Hill-type response function (see section 3.1.6) with a Hill coefficient equal to four and a Hill constant that is given by K H = K 1 4 (or K H = 1 when the normalized input x is used) Alternative Reaction Paths While the three reaction steps in Fig. 3.1 and described by Eq are believed to capture the molecular level details of the binding of LacR to O1, there are alternative

72 72 Modeling Small Gene Networks Figure 3.6: Alternative reaction path for the binding of a tetrameric complex to an operator region composed of two adjacent binding sites O A and O B. Reaction path (i) (gray box) corresponds to the binding of the lac repressor (Fig. 3.1). In path (ii) and (iii), the tetramer-operator complex is formed by the sequential binding of dimers. routes to the formation of the operator-tetramer complex. The laco operator is comprised of two half-sites that each makes contact to one dimeric subunit of the tetramer. It is therefore a possibility that the binding of the tetramer occurs sequentially, one dimer at a time, rather than in a single step. The resulting alternative paths are illustrated in Fig. 3.6 together with the reaction path (Fig. 3.1), believed to best capture the actual binding of the LacR to O1. This reaction path is labeled (i) in Fig. 3.6 and is emphasized by a gray box. In the scenario labeled (ii), a dimer first binds to the left binding site (O A ) and a second dimer then binds to the right binding site (O B ) to form the tetramer-operator complex. The order of dimer binding is reversed in the reaction path labeled (iii). The observant reader will notice that the reaction path (ii) describes the binding of λ CI repressor to the adjacent operators OR1 and OR2 to form a tetrameric complex between four λ CI monomers and the OR region of the P R /P RM promoters (see section 1.4.2). λ CI will be discussed further in section Moreover, in the modified Gal1 promoter (section 2.2.2) two TetR repressor dimers can bind to two nascent and identical teto operators. The dimeric TetR repressor proteins are not known to form tetramers and their binding to the two operator sites probably follows reaction paths (ii) and (iii) with identical equilibrium constants for each step. The modified Gal1 promoter will be discussed further in section When O 10 and O 01 denote configurations where a repressor dimer X 2 is bound to the operators O A and O B, respectively, the additional reaction steps can be represented

73 3.3 Modeling cis-regulatory Systems 73 by the symbolic reaction equations: X 2 + O 00 K 2A O 10, X 2 + O 10 K 3A O 11, X 2 + O 00 K 2B O 01, X 2 + O 01 K 3B O 11, (3.40) where O 11 is the state where both O A and O B are occupied. The equilibrium constants for the reactions in Eq are defined by: K 2A = [O 10] [X 2 ][O 00 ], K 2B = [O 01] [X 2 ][O 00 ], K 3A = [O 11] [X 2 ][O 10 ], K 3B = [O 11] [X 2 ][O 01 ]. (3.41) Using same method that was used to derive the overall equilibrium constant K 4 for reaction path (i) in Eq. 3.38, the overall equilibrium constants K 4A and K 4B for path (ii) and (iii), respectively, can be obtained as: K 4A = K 2 1K 2A K 3A, K 4B = K 2 1K 2B K 3B. (3.42) The three different reaction scenarios in Fig. 3.6 have the same overall reaction and the overall equilibrium constant must therefore be identical, i.e., K 4 = K 4A = K 4B. This is a consequence of a principle known as Independence of the Path and of the Gibbs- Helmholtz equation. Independence of the path means that the overall change in free energy is the same regardless of the reaction mechanisms involved in the conversion of the reactants into the products. It is based on the fact that the total energy of a molecule depends only on its present state, not on the details of its past history. The Gibbs- Helmholtz equation then tells us that if two processes that have identical values of G, they will also have identical equilibrium constants (Eq. 3.13). This has important implications as the parameters associated with two alternative reaction paths will not be independent. In the case of paths (ii) and (iii) in Fig. 3.6A, the equilibrium constants are constrained by: K 2A K 3A = K 2B K 3B = K 2 K 3. (3.43) This constrain simply reflect that the change in Gibbs free energy is the same regardless of the reaction path taken. Modelers that do not pay attention to such constrains are likely to make spurious predictions. Despite the fact that three alternative reaction paths have the same overall equilibrium constant, there can be significant differences in their response functions. The response function associated with the binding of a tetramer was already derived in Eq The response function for the formation of a tetramer-operator complex becomes slightly more complicated when the alternative paths in Fig. 3.6 and more operator states are included. In this case, the response function is given by g([x]) =

74 74 Modeling Small Gene Networks [O 11 ]/[O T ] where [O T ] = [O 0 ] + [O 10 ] + [O 01 ] + [O 11 ]. The response function can be derived by using the equilibrium constants K 3A, K 3B and K 4 to express the equilibrium concentrations of O 0, O 10 and O 01 as functions of [O 11 ] and [X]: [O 0 ] = [O 11] K 4 [X] 4, [O 10] = [O 11 ] K 1 K 3A [X] 2, [O 01] = [O 11 ] K 1 K 3B [X] 2. (3.44) The derivation of these relationships uses the definition of K 1 to express the concentration of dimers as a function of the concentration of monomers, i.e., [X 2 ] = K 1 [X] 2. The equilibrium concentrations for [O 0 ], [O 10 ] and [O 01 ] are then inserted into the expression for the total operator concentration to give: [O T ] = [O 11] K 4 [X] 4 + [O 11 ] K 1 K 3A [X] 2 + [O 11 ] K 1 K 3B [X] 2 + [O 11]. (3.45) The response function g([x]) is then obtained by rearrangement as: g([x]) = K 4 [X] K 1 (K 2A + K 2B )[X] 2 + K 4 [X] 4. (3.46) Assume that equilibrium constants for the binding of a dimer to O A and O B sites are identical and that the binding of a second dimer to one of the two sites is independent of the occupancy of the other. In this special case, the equilibrium constants K 2A, K 2B, K 3A and K 3B have identical values. When the equilibrium constant for the binding of a dimer to any one operator is denoted K AB, it is obtained from Eq that K AB = K2 K 3 (since K 2A K 3A = K 2 K 3 ). The definition of K 4 in Eq then implies that the term K 1 (K 2A + K 2B ) in Eq can be replaced by 2 K 4 (since 2K 1 K AB = 2K 1 K2 K 3 ). Accordingly, the response function is given by: g S (x) = x x 2 + x 4, (3.47) where x = 4 [X]. This response function is probably the most suitable description of the modified Gal1 promoter (section 2.2.2) where the TetR repressor dimers is believed to bind to the two teto operators independently. The comparison of the response functions f(x) and g S (x) in Fig. 3.7A, shows that the response function f(x) is steeper, reaches the 50% level at a lower value of x and saturates faster than g S (x). Hence, a switch based on the binding of a tetramer (reaction path i) is more sensitive and robust compared to a switch based on the sequential binding of dimers (reaction paths ii and iii) Cooperative Binding of Dimers The Hill plots in Fig. 3.7B demonstrated that a switch that involves sequential and independent binding of dimers functions less efficiently, i.e., has a lower Hill coefficient,

75 3.3 Modeling cis-regulatory Systems 75 Figure 3.7: (A) Response functions r(y) for the occupancy of the O1 operator when the repressor can bind to the two operator half-sites as a tetramer (r(y) = f(x)) or sequentially as dimers (r(y) = g S (x)). The value of x 0.5 is the concentration of repressor monomers that gives 50% occupancy. The curve h(x) is the Hill curve approximation to g S (x). (B) Hill plots constructed from the curves in (A). The Hill coefficients for r(y) = f(x) and r(y) = g S (x) are 4 and 2.3, respectively. compared to a switch where a tetramer is formed in solution rather than on the DNA. The Hill coefficient associated with the sequential binding of dimers can however be increased by manipulation of the equilibrium constants. The function g([x]) in Eq coincides with the function f([x]) in Eq when K 2A and K 2B becomes vanishingly small. The Hill coefficient can thus be increased from its value of 2.3 when all the binding constants are equal to K AB (the function g S (x)) to a maximal value of 4 (the function f(x)) by decreasing the affinity of O A and O B for the binding of the first dimer. This in turn implies that the binding of a dimer to one site can no longer be independent of the occupancy of the other site. Since the products K 2A K 3A and K 2B K 3B must be constant (and equal to the product K 1 K 2 ), a decrease of K 2A (or of K 2B ) must be accompanied by a corresponding increase of K 3A (or of K 3B ). In other words, a more efficient switch can be obtained when a dimer that is bound to one site participates in the stabilization of the interaction between a second dimer and its binding site. Such synergism is observed frequently between cis and trans regulatory elements and typically leads to improved switching properties. In terms of energetics, synergism implies that the decrease in free energy associated with the binding of two dimers to the regulatory region is greater than the sum of the decrease in free energies associated with the dimer-dimer, the dimer-o A, and the dimer-o B interactions. This reflects that most of the free energy is released when the tetramer-operator complex is formed. The binding of λ CI dimers to the OR1 and OR2 sites in the operator region of the P R /P RM promoters is one example of how synergy between cis and trans regulatory elements can improve the performance of a genetic switch. As mentioned above, the formation of a tetramer-operator complex in the OR region is believed to occur through

76 76 Modeling Small Gene Networks Figure 3.8: (A) Response functions g(x) (Eq. 3.48) obtained for different values of σ and γ. The function g S (x) is recovered when σ = γ = 1 (broken curve). The steepness increases and x 0.5 decreases as σ and γ increases. Parameter values: σ = 1, γ = 10 (red curve), σ = 10, γ = 1 (blue curve), σ = 10, γ = 10 (green curve). (B) Hill plots obtained for the different values of σ and γ. The Hill coefficients obtained by linear fitting are: n H = 2.3 (broken curve), n H = 2.8 (red curve), n H = 2.9 (blue curve) and n H = 3.3 (green curve). the reaction path labeled (ii) in Fig Recall from section that the first CI dimers binds preferentially to OR1 (O A in Fig. 3.6) and that the binding of a CI dimer to OR2 (O B in Fig. 3.6) is dependent on the presence of a CI dimer being bound to OR1. These observations imply that K 2B K 2A and that K 3A K 2A. In other words, a CI dimer bound to OR1 increases the equilibrium constant for the binding of a CI dimer to OR2 from a value that is significantly lower than K 2A to a value that is significantly higher than K 2A (roughly 10-fold higher). This makes for a more efficient switch that has a higher Hill coefficient than the switch were the dimers bind independently of each other. To quantify the increase in the Hill coefficient, assume that the binding of the first CI dimer to OR2 is associated with an equilibrium constant that is reduced by a factor of γ compared with that for OR1, i.e., K 2B = γ 1 K 2A. With this assumption, the term K 1 (K 2A + K 2B ) in Eq can be replaced by (1 + γ 1 )K 1 K 2A. In addition, assume that the binding constant for the association of a second CI dimer to OR2 is increased by a factor σ compared to the binding constant for the association of the first dimer to OR1, i.e., K 3A = σk 2A. From K 2A K 3A = σk 2 2A and the constraint K 2AK 3A = K 2 K 3, the term (1 + γ 1 )K 1 K 2A can be replaced by (1 + γ 1 ) σk 1 K2 K 3. In the final step, the definition of K 4 is used to obtain the response function given by: g(x) = x (1 + γ 1 ) σ 1 x 2 + x 4, (3.48) where x = 4 K 4. Constructing the Hill plot using data points in the range 0.1 < g(x) < 0.9 for different values of γ and σ produce curves that can be fitted reasonably well to straight lines (Fig. 3.8B). The slope of the line, i.e., the Hill coefficient, obtained

77 3.3 Modeling cis-regulatory Systems 77 Figure 3.9: The formation of a CAP-DNA-polymerase complex can occur through three reaction paths; (i) binding of the RNA polymerase holoenzyme to the promoter when a CAP dimer is bound to the DNA, (ii) binding of a CAP-polymerase complex and (iii) binding of polymerase followed by binding of CAP. for σ = 10 and γ = 10 has a value of 3.3 which is an significant improvement over the value of 2.3 obtained when the dimers bind independently to OR1 and OR2. A detailed treatment of the interaction between λ CI and operator elements in the P R promoter is given by Isaacs et al. and references therein Synergism in RNA Polymerase Binding The recruitment of the RNA polymerase to the promoter of the lactose operon by the CAP transcription factor (section 1.4.1) can be described by the reaction mechanism illustrated in Fig In the reaction scheme, there are three alternative paths to the formation of a closed polymerase-promoter complex. In the reaction path labeled (i), a CAP dimer binds to its operator site and the RNA polymerase is subsequently recruited to form the CAP-polymerase-operator complex. The order of CAP and polymerase is reversed in the reaction path labeled (iii). In reaction path (ii), the CAP protein and the polymerase form a complex prior to the formation of the CAP-polymerase-operator complex. Note that the overall reaction scheme is very similar to the one illustrated in Fig The most significant difference is the substitution of one protein dimer with the polymerase holoenzyme. The three different reaction paths in Fig. 3.9 have the same overall reaction. When A and P are used to denote CAP monomers and polymerase holoenzymes, the overall reaction can be represented by a fourth-order elementary reaction: 2A + P + O 00 K 4 O 11, K 4 = [O 11 ] [A] 2 [P ][O 00 ], (3.49)

78 78 Modeling Small Gene Networks where O 00 denotes operator region with an unoccupied CAP site and an unoccupied promoter and O 11 denotes the state where these sites are occupied. To derive the response function, it is only necessary to consider a subset of the reactions since the system is overdetermined. In addition to Eq. 3.49, the intermediate reaction steps from Fig. 3.9 that need consideration are given by: 2A K 1 A 2, A 2 + O 00 K 2 O 10, P + A 2 K 2A A 2 P, O 00 + P K 2B O 01, (3.50) where O 10 and O 01 indicates an occupied CAP site and an occupied promoter, respectively, and A 2 P is the CAP-polymerase complex. The equilibrium constants for these reactions are given by: K 1 = [A 2] [A] 2 K 2 = [O 10] [A 2 ][O 00 ], K 2A = [A 2P ] [A 2 ][P ], K 2B = [O 01] [P ][O 00 ]. (3.51) Before proceeding with the derivation of the relative promoter occupancy, it is worth commenting on the reaction path (ii) in which the activator forms a complex with the RNA polymerase holoenzyme prior to the binding to the promoter. At equilibrium, the fraction of polymerases that are bound with activator, [A 2 P ]/([P ] + [A 2 P ]) is given by: [A 2 P ] [P ] + [A 2 P ] = K 2A [A 2] 1 + K 2A [A 2]. (3.52) If K 2A has any appreciable value, a significant fraction of the polymerases will have a CAP dimer attached and the CAP-polymerase could be viewed as a type of holoenzyme in which the CAP component provides specificity to promoters with an adjacent CAP binding site. Holoenzymes that contain an additional activator have, to the best of my knowledge, not been reported in the literature. As a result, the value of K 2A will be assumed to be negligibly small corresponding to a small negative (or perhaps a positive) value of G for the interaction between CAP and the holoenzyme. The difficulties in pinpointing the exact constituents of the eukaryotic RNA polymerase II holoenzyme could be an indication that the staged assembly of the pre-initiation complex (sections and 1.4.3) might involve a number of alternative reaction paths. When K 2A is set equal to zero and the concentration of free RNA polymerase holoenzyme is assumed constant, [P ] = c P, the relative occupancy of the promoter can be obtained from ([O 01 ] + [O 11 ])/[O T ] with [O T ] = [O 00 ] + [O 10 ] + [O 01 ] + [O 11 ], as: h(a) = σ B + a σ B + (1 + γ B )a 2, (3.53)

79 3.3 Modeling cis-regulatory Systems 79 Figure 3.10: (A) Relative promoter occupancies predicted by Eq for different values of σ B and γ B. Parameter values: σ B = 0, γ B = 0 (green curve), σ B = 0, γ B = 0.5 (blue curve), σ B = 0.5, γ B = 0 (red curve), σ B = 0.5, γ B = 0.5 (broken curve). (B) Hill plots of the different parameter values with h = σ B /(1 + σ B ). Only curves obtained for low values of σ B can be approximated well by a Hill-type function. where the new parameters σ B and γ B are defined by: σ B = K 2Bc P, γ B = K 1 K 2 K 4 c P = 1 K 3 c. (3.54) P Different relative occupancies for different values of σ B and of γ B are shown in Fig. 3.10A. The most efficient switch is obtained in the limit where σ B and γ B are negligibly small. The Hill plot generated by h(a) shows straight lines with a slope of two when σ B = 0 (Fig. 3.10B). For σ B > 0, a linear fit to the curves in the Hill plot gives lines with slopes that are less than two. Moreover, high values of h(a) requires that γ B is low (Fig. 3.10A). These observations have clear biological interpretations (see Fig. 3.9). If the value of K 2B (and hence σ B) is high, the polymerase will bind to the promoter in the absence of the activator, h(a = 0) = σ B /(1 + σ B ). This gives rise to leaky expression and a decreased Hill coefficient. If K 3 is low (corresponding to a high value of γ B ), the activator is ineffective and high occupancies cannot be achieved. Taken together with the previous argument that K 2A is low, the most efficient switch is obtained under the following conditions; (1) the interaction between the activator and the polymerase must be weak, (2) the interaction between the polymerase and the promoter must be weak, and (3) the interaction between the polymerase and the promoter with CAP bound must be strong. In other words, a more efficient switch is obtained when the activator and the promoter operate synergistically DNA looping Synergism implies cooperative interactions between multiple components and typically manifests as an increase of the Hill coefficient. The two previous sections demonstrated

80 80 Modeling Small Gene Networks this phenomenon for the binding of λ CI dimers to the OR region and in the CAPdependent binding of the RNA polymerase, respectively. In section 1.4.1, it was discussed how the auxiliary operators O2 and O3 affect repression of the lactose operon by forming a DNA loop in which a repressor tetramer is bound to O1 and to any one of the auxiliary operators. One explanation for the increased efficiency of the LacR tetramer is that the looped DNA acts as a physical barrier that prevents the RNA polymerase from finding the promoter region. Another possible explanation is that the DNA-looping, particularly the loop formed between O1 and O2, prevents the RNA polymerase from moving down the gene after it successfully has initiated transcription from the P lac promoter. However, the observed increase in repression may also be a consequence, at least in part, of a change in the equilibrium distribution of promoter occupancies due to the additional repressed states that are possible when one of the auxiliary operators can cooperate with the main operator, without necessarily acting in synergy. The derivation of the response function becomes quite messy when the system contains a large number of binding sites. There are 2 3 = 8 possible states of the P lac promoter that do not involve the formation of a loop and each of the three different loops (O1-O2, O1-O3 and O2-O3) can have the third operator either occupied or unoccupied by the repressor. This gives a total of 14 possible states. For simplicity, what follows will assume the presence of only one auxiliary operator site, O3. However, the method used to derive the response function in this simpler case can readily be extended to a system that contains an arbitrary number of binding sites. It is advisable to employ software capable of symbolic algebra when deriving the response functions of larger systems. The occupancy of the O1 and O3 sites can be described using the notation O ij where i is the number of repressor molecules bound to the main operator and j is the number of repressor molecules bound to the auxiliary operator. In the absence of DNAlooping, the distance between operator sites makes it reasonable to assume that the binding of the repressor to one site is independent of the occupancy of the other. The binding reactions can then be described by: O 00 + X 4 K 3 O 10, O 00 + X 4 σk 3 O 01, (3.55) O 01 + X 4 K 3 O 11, O 10 + X 4 σk 3 O 11, (3.56) where the equilibrium constant for the binding of the repressor to the auxiliary operator is expressed relatively to that for the main operator site. The auxiliary operator is typically weaker than the main operator, i.e., σ < 1. The equilibrium concentrations of the states O 01, O 10 and O 00 are accordingly given by: [O 01 ] = [O 11] K 3 [X 4 ], [O 10] = [O 11] σk 3 [X 4 ], [O 00 ] = [O 10] K 3 [X 4 ] = [O 11] σk 2 3 [X 4] 2. (3.57)

81 3.3 Modeling cis-regulatory Systems 81 The formation of a DNA loop C can occur through two reaction paths; a repressor molecule bound to the main operator can make contact with the auxiliary operator, or vice-versa. The corresponding reactions are: O 10 K L C, O 01 K L C. (3.58) The equilibrium concentrations of O 10 and of O 01 in Eq can be used to derive two different expressions for the equilibrium concentration of C in terms of [O 11 ]. One involves the equilibrium constant K L, the other the equilibrium constant K L. These expressions are given by: K L = [C] [O 10 ] [C] = K L[O 10 ] = K L[O 11 ] σk 3 [X 4 ], (3.59) K L = [C] [O 01 ] [C] = K L[O 01 ] = K L [O 11] K 3 [X 4 ]. (3.60) Since the equilibrium concentration is independent of the chosen path, this implies that K L = σk L. This constrain is not necessary to impose (since the system is overdetermined). The relative occupancy of the main operator can be obtained from the definitions of the equilibrium constants. The total concentration of DNA molecules that carry the regulatory region is given by: [O T ] = [O 00 ] + [O 10 ] + [O 01 ] + [O 11 ] + [C], (3.61) which, based on the equilibrium concentrations, can be used to express [O 11 ] as a function of [X 4 ] and [O T ]: [O 11 ] = σk 2 3 [X 4] 2 [O T ] 1 + (1 + σ + K L )K 3 X 4 + σk 2 3 [X 4] 2. (3.62) The equilibrium concentrations of O 10 and C can now be used to obtain the total concentration [R T ] of states in which the main operator is occupied and the promoter is repressed: [R T ] = [O 11 ] + [O 10 ] + [C] = [O 11 ] ( σk 3 [X 4 ] + K ) L = σk 3 [X 4 ] (1 + K L )K 3 [X 4 ] + σk3 2 = [O T ] [X 4] (1 + σ + K L )K 3 [X 4 ] + σk3 2[X 4] 2. (3.63) The response function f([x]) = [R T ]/[O T ] is then obtained from the equilibrium concentration of X 4 ([X 4 ] = K 2 1 K 2[X] 4 ) and the definition of K 4 (K 4 = K 2 1 K 2K 3 ) as: f(x) = (1 + K L + σx 4 )x 4 (1 + σx 4 ) + (1 + K L + σx 4 )x 4, (3.64)

82 82 Modeling Small Gene Networks Figure 3.11: (A) Response functions in the absence (green curve) and presence (blue and red curves) of DNA looping. Parameter values K L = 0 (green curve), K L = 10, σ B = 0 (blue curve), K L = 10, σ B = 1 (red curve). (B) Hill plots demonstrating that the Hill coefficients are not affected by DNA looping. The slopes of the straight lines for K L = 10 (blue) and K L = 0 (green) are identical and equal to 4. Increased affinity for the auxiliary operator, i.e., increased σ, decreases the Hill coefficient. The value of n H is 3.3 for σ B = 1, corresponding to equal binding affinities of O1 and O2. where x is the normalized concentration of repressor monomers defined by x = 4 K 4 [X]. There are some interesting observations that can be made based on the response function in Eq First of all, the response function: f(x) = x4 1 + x 4, (3.65) associated with the binding to a single operator (Eq. 3.39), is recovered when the DNA loop is unable to form, i.e., when K L = 0 (the terms 1 + σx 4 and 1 + K L + σx 4 in Eq cancel each other when K L = 0). This is due to the independent binding of the repressor to the two operator sites, i.e., it is due to the lack of cooperativity in the system. Figure 3.11A shows plots of f(x) obtained for different values of σ and K L. From these plots it appears as if the response becomes more nonlinear as K L increases. However, this is a deception introduced by a shift of the monomer concentration x 0.5 that gives 50% saturation to a lower value. This becomes apparent when Hill plots are constructed with the data points also used to draw the curves in Fig. 3.11A. As seen in Fig. 3.11B, the curves are linear in the Hill plot and have identical slopes but different intersects. This can also be obtained directly from Eq When σ is small and the term σx 4 is negligible, the response function can be rearranged into a form that coincides with the standard form of the Hill curve (Eq. 3.21): f(x) (1 + K L)x (1 + K L )x 4 = x 4 K H + x 4, (3.66) where K H = (1 + K L ) 1. In other words, the Hill constant (and the value of x 0.5 ) decreases as K L increases. The Hill coefficient is independent of K L. The only signifi-

83 3.4 Models of Gene Regulatory Systems 83 cant deviation from the standard form of the Hill curve occurs when σ is relatively large and the binding affinity for the auxiliary operator is close to that of the main operator. In this case, however, the Hill coefficient is decreased (Fig. 3.11B). The discussion above has demonstrated that DNA looping does not increase the Hill coefficient, which is the usual measure of cooperativity. Rather, the formation of the DNA loop state C causes a shift in the equilibrium distribution of the different repressor-operator states and results in decreased value of the Hill constant. Because an increase in the concentration of C necessarily is associated with a decrease in the concentration of all of the other states, the ability to form a DNA loop shifts the equilibrium distribution toward the states in which the main operator is occupied by a repressor. As a result, the higher the stability of the DNA loop complex, i.e., of K L, the lower is the repressor concentration that is required to achieve the same level of occupancy of the main operator. This has in the literature been explained in terms of a local increase in the concentration of the repressor. Indeed, Eq is recovered from Eq when introducing the rescaling y = K L x where y is the local or effective repressor concentration. The effective concentration is always greater than the actual concentration if the loop structure is able to form, i.e., when K L > 0. An alternative treatment of DNA looping is given by Vilar & Leibler. 3.4 Models of Gene Regulatory Systems The Lactose Operon in E. coli The P lac promoter of the lactose operon in E. coli depends on two distinct input signals; the activity of the CAP transcriptional activator and the activity of the LacR transcriptional repressor. Both CAP and LacR are naturally present in E. coli and their activity is regulated by inducers. CAP is activated in the presence of camp and LacR is inhibited by allolactose (or by the artificial inducer IPTG). High camp usually signals the absence of glucose and the purpose of CAP is to activate genes that are required for the bacterium to utilize alternative sources of energy. On the other hand, the purpose of LacR is to prevent the expression of the lactose operon genes when lactose is not available. The P lac promoter thus has the computational logic of an AND NOT gate. Expression of the lactose operon genes is high in the presence of lactose AND NOT glucose. Since intracellular camp concentrations are inversely correlated to the concentration of glucose, the P lac has computational logic corresponding to an AND gate in terms of the signals camp and IPTG. Expression is high when the concentrations of IPTG AND of camp are high. A recent study by Setty et al. investigated the AND gate operation of the P lac promoter at different extracellular concentrations of camp and IPTG. They measured transcription of the gfp gene encoding green fluorescent protein (GFP) from a P lac promoter and correlated the population-averaged fluorescence with a model of P lac cis-regulatory dynamics. The P lac promoter activity was measured using an engineered plasmid system. A

84 Modeling Small Gene Networks Figure 3.12: (A) Experimental measurements of P lac activity under 96 different combinations of inducer concentrations.

84 84 Modeling Small Gene Networks Figure 3.12: (A) Experimental measurements of P lac activity under 96 different combinations of inducer concentrations. (B) Experimental data points presented as a smoothened surface. Modified from Setty et al. without permission c The National Academy of Sciences. 232 bp region that extents 130 bp into the lacz gene and include most of the cisregulatory elements of the wild type lactose operon and extents 130 bp into the lacz gene. Recall (section 1.4.1) that the main laco operator O1 is centered at position +9. It is therefore important that some part of the lacz gene (and its 5 UTR) is included. The 232 bp fragment, which lacks the auxiliary O2 operator, was fused to the gfpmut2 gene that encodes a variant green fluorescent protein. This artificial reporter system was transformed into E. coli using a low-copy plasmid carrying the SC101 origin of replication and the gene that endows kanamycin resistance to the host cell. The experiments were carried out using a 96-well microplate containing all possible combinations of 8 and 12 different concentrations of IPTG and camp. Microplate measurements are a convenient and rapid way to obtain large amounts of data as they allow simultaneous detection of population-averaged expression in a variety of conditions and over an extended period of time. The measurement time is limited by the time it takes for cells to reach a critical density where they will stop dividing at regular intervals and enter a stationary growth phase. Cell density is usually measured as the absorbance of light at 600 nm as it passes though a cell suspension 10 mm in depth and is reported in units of optical density (OD). In the Setty et al. experiment, GFP fluorescence at 535 nm was measured over two cell division cycles, i.e., the time it takes the cell density to increase by a factor of four, during mid-exponential growth. The promoter activity was determined by the change in GFP fluorescence normalized by the cell density (d[gfp]/dt/od 600 ). Expression activity from the P lac promoter measured for 96 different combinations of inducer concentrations is shown in Fig. 3.12A. In Fig. 3.12B the experimental data is represented as smoothened surface. This representation allows for the identification of four distinct regions where the promoter activity is roughly the same. The promoter activity is low at low concentrations of IPTG and camp (plateau I) and high when the concentrations of both the inducers are high (plateau II). This is what is expected from

85 3.4 Models of Gene Regulatory Systems 85 the AND operation of the P lac promoter. The presence of two additional plateaus (III and IV) demonstrates that the P lac promoter does not operate as a perfect AND gate. At high concentrations of IPTG and low concentrations of camp, the promoter activity reaches nearly 50% of its maximal value (plateau III). This considerable level of expression in the absence of camp indicates that the CAP transcriptional activator is not required for transcription of the lac operon genes, i.e., that the expression is leaky (see section 3.3.4). An significant increase in promoter activity (to about 20% of maximal) is also observed when the concentration of camp is high and the concentration of IPTG is low (plateau IV). There a several possible explanations for the presence of this plateau. Perhaps the presence of CAP increases the rate of transcription when the polymerase and the repressor are bound at the same time. It is also possible that CAP and LacR are mutually exclusive such that the occupancy of the O1 operator decreases at increasing concentrations of camp. A third possibility is that the polymerase and LacR are mutually exclusive. Since CAP-cAMP may increase the affinity of the polymerase for the promoter, an increased camp concentration could shift the equilibrium distribution toward the state where the CAP site and the promoter are occupied and O1 is unoccupied. The cis-regulatory dynamics of the P lac -gfp fusion system involves four different control elements; the P lac promoter where the RNA polymerase binds, the O1 and O3 operators where LacR binds and the CAP site where the activated CAP-cAMP binds. The third LacR operator O2 lies within the part of the lacz gene that is replaced by the gfp gene. The cis-regulatory state can thus be described by a binary string of length four where each entry is one or zero depending on the occupancy of the corresponding site. This gives a total of 2 4 = 16 configurations. In addition to these states, the LacR tetramer can facilitate the formation of a DNA loop by simultaneously binding to O1 and O3. This gives rise to an additional four states as CAP and the polymerase in principle are able to bind to their respective sites even when a DNA loop is formed. While the probability that CAP-cAMP and the polymerase actually binds to the DNA in its looped conformation may be very low, these additional states might be considered in a comprehensive analysis. The number of configurations of the P lac regulatory region can be reduced further when it is assumed that the binding of LacR to O3 has marginal effect on the occupancy of the promoter. The justification for such an approximation is that DNA looping can be accounted by rescaling the effective LacR concentration (see section 3.3.5). Moreover, the O3 site is located at position -83 and might not interfere directly with the binding of the polymerase holoenzyme to the promoter. Interference with CAP is a possibility that can not be ruled out. However, the O3 site has a binding affinity that is significantly lower than that of the O1 site and O3 will probably only be occupied to a significant extent at very high concentrations of LacR. With these assumptions, the configuration of the cis-regulatory region can be described by a binary vector of length three. Figure 3.13 illustrates the 8 possible configurations of the cis-regulatory region and the transitions between them. The different states are symbolized by the variables O ijk

86 86 Modeling Small Gene Networks Figure 3.13: Model of cis-regulatory dynamics of the P lac promoter. The the index of O ijk gives the occupancy of the CAP site (i), the O1 operator (j) and the promoter (k). Black arrows indicates reversible reactions. Grey arrow indicates the return to configuration O ij0 after transcription initiation from configuration O ij1. where the i, j and k are either zero or one and denotes the occupancy of the CAP-site, the O1 operator and the promoter, respectively. Transcription of the gfp gene can be initiated from any one of the configurations where the promoter is occupied, i.e., from the states for states O ij1, i = 0, 1, j = 0.1. This gives rise to basal or leaky transcription in the absence of an activator or at saturating concentrations of a repressor. In an ideal switch, there would only be transcription from the configuration O 101 because the polymerase would be unable to occupy the promoter in the absence of the activator or when the repressor is bound. In other words, the configurations O 001, O 011 and O 111 would not exist. The observation of basal transcription indicates that (1) binding of the polymerase to the promoter can occur without the assistance of the activator and (2) the RNA polymerase can bind to the the promoter and initiate transcription even when the repressor is bound to its operator. 1 The rate constant associated with transcription initiation from the O ij1 configuration is denoted by α ij κ M where κ M is the maximal rate of transcription. This is done because the rate of transcription initiation might be affected by the presence of CAP-cAMP or LacR. CAP-cAMP could in principle affect the rate of open complex formation or the transition to an elongating complex and LacR could in principle repress transcription by preventing transcription elongation and/or the formation of the open complex. It is expected that the rate of transcription is maximal from the O 101 configuration and that this state is associated the maximal value of κ M, i.e., α 10 = 1. In order to calculate the relative occupancy of the different configurations in the quasi-steady state it is only necessary to consider a total of seven reactions (since the 1 The model presented here is slightly different from that presented by Setty et al., which assumes that leaky transcription occurs from three configurations S, SC and SR corresponding to O 000, O 100 and O 010 in the present representation, respectively. Moreover, the states O 011, O 101 and O 111 are not present in the Setty et al. model.

87 3.4 Models of Gene Regulatory Systems 87 system is overdetermined). Three of these reactions are equilibrium reactions while the remaining four are Michaelis-Menten reactions. These reactions are shown in Fig where A, R and P are used to denote CAP-cAMP, LacR and the RNA polymerase, respectively. The equilibrium constants K A, K R and K RA for the reversible binding of A and R are defined by: K A = [O 100] [O 000 ][A], K R = [O 010] [O 000 ][R], K AR = [O 110] [O 010 ][A] γk A, (3.67) where the subscript eq has been omitted for clarity and γ is defined by K AR /K A. The model thus assumes (Occam s razor) that the binding of CAP dimers and LacR tetramers to their respective binding sites are appropriately described by second-order elementary reactions. The constants K ij associated with the Michaelis-Menten type reactions are defined by: K ij = k f,ij k b,ij + α ij κ M = [O ij1] [O ij0 ][P ] σ ijk 10 (3.68) where σ ij is defined by K ij /K 10. The concentration of the CAP-cAMP transcriptional activator A and the LacR transcriptional repressor R depends on the concentration of the inducers camp and IPTG. This dependency is approximated by Hill-type functions: A = [A] [A T ] = [camp ] n, + [camp ]n K n camp R = [R] [R T ] = KIP m T G, + [IP T G]m K m IP T G (3.69) where [A T ] and [R T ] are the total concentrations of CAP dimers and LacR tetramers, respectively, while K camp and K IP T G are the extracellular concentrations of camp and IPTG that give 50% activity of CAP (A = 0.5) and LacR (R = 0.5), respectively. This approximation ignores all the molecular details of how the intracellular concentrations of camp and IPTG are regulated. In this context, it is worth mentioning recent studies by Yildirim and Mackey and by Vilar et al. who investigated models of the feedback mechanism that governs the LacY-mediated import of lactose, its conversion by LacZ into allolactose and subsequent up-regulation of lacz and lacy expression by the suppression of LacR activity by allolactose. In the quasi-stationary state, the relative occupancies of the different cis-regulatory configurations can be expressed in terms of the dimensionless input signals A = [A]/[A T ] and R = [R]/[R T ] by using the definitions of the equilibrium constants in Eqs and 3.68: [O 100 ] = a[o]a, [O 010 ] =b[o]r, [O 110 ] = abγ[o]ar, [O 001 ] = σ 00 c[o], [O 101 ] =ca[o]a, [O 011 ] = σ 01 cb[o]r, [O 111 ] =σ 11 cabγ[o]ar, (3.70)

88 Modeling Small Gene Networks Figure 3.14: Promoter activity predicted by the model (Eq. 3.72 with γ = 0) after a fit to the experimental data. Best fit parameters are from Setty et al.

88 88 Modeling Small Gene Networks Figure 3.14: Promoter activity predicted by the model (Eq with γ = 0) after a fit to the experimental data. Best fit parameters are from Setty et al.. where a = K A [A T ], b = K R [R T ], c = K 10 [P ] and [O] = [O 000 ]. Introducing the new parameters β ij = α ij σ ij, the rate of transcription can be obtained as: i,j f(a, R ) = α ij[o ij1 ] i,j,k [O (3.71) ijk] β 00 c + caa + β 01 cbr + β 11 abγar = 1 + σ 00 c + (1 + c)aa + (1 + σ 01 c)br + (1 + σ 11 c)abγar. By consolidating the various biological parameters into seven parameters V 1,..., V 7 the promoter activity can be expressed as: f(a, R ) = V V 2 A + V 3 R + γv 6 AR 1 + V 4 A + V 5 R + γv 7 AR, (3.72) where V 1 = β 00 c/(1 + σ 00 c), V 2 = a/β 00, V 3 = β 01 b/β 00, V 4 = (1 + c)a/(1 + σ 00 c), V 5 = (1 + σ 01 c)b/(1 + σ 00 c), V 6 = β 11 ab/β 00 c and V 7 = (1 + σ 11 c)ab/(1 + σ 00 c). Despite the differences in the derivation, Eq coincides with that presented by Setty et al. when γ = 0. Having non-zero values of γ does not change the qualitative shape of the surface f(a, R ). The model in Eq fully capable of capturing the details of the corresponding experimental plot, including the plateaus observed in the absence of camp and at high concentration of IPTG. This is illustrated in Fig 3.14 which shows the result of a fit to the experimental data using ten model model parameters (with γ = 0). The performance of the AND gate, i.e., suppression of the expression plateaus observed in the absence of IPTG or camp, could be enhanced, for instance, by increasing the equilibrium constant K R for the repressor binding. This would increase the parameters V 3 and V 5 and shift the off-diagonal expression plateaus III and IV to higher concentrations of IPTG and camp, respectively. The complete elimination of the plateaus would require

3.4 Models of Gene Regulatory Systems 89 Figure 3.15: Model of the cis-regulatory configurations and transcription initiation from the Gal1 promoter.

89 3.4 Models of Gene Regulatory Systems 89 Figure 3.15: Model of the cis-regulatory configurations and transcription initiation from the Gal1 promoter. ATc is assumed to alter the effective rate of TetR binding k Rf = k Rf [R] while galactose is assumed to affect the effective rate of activation k Af and/or deactivation k Ab that the binding of the polymerase can only occur when camp-cap is bound to it binding site, i.e., σ 00 = β 00 = 0 and that the repressor and the polymerase are mutually exclusive, i.e., σ j1 = β j1 = The Galactose Regulon in S. cerevisiae The expression of GFP from the modified, TetR-repressible Gal1 promoter discussed in section can be modeled using the same set of equations that was used to model the P lac promoter in section However, due to the low basal expression in the absence of galactose or ATc, it is not necessary to include the states O 001, O 011 and O 111. A minimal model of cis-regulation of the TetR-repressible Gal1 promoter thus incorporates five distinct configurations (Fig. 3.15). It is noted that this model is more abstract than the model of the P lac promoter where each cis-regulatory state corresponds to the occupancy of a single DNA binding site. The Gal1 promoter contains multiple binding sites and the five different configurations corresponds to the states where inactive Gal4 is bound to the UAS G (O 000 ), an intermediate state where activated Gal4 has recruited the transcriptional activators SAGA, Mediator and TBP/TFIID (O 100 ), the pre-initiation complex where the polymerase holoenzyme is bound (O 101 ) and the repressed states where TetR is bound in the presence (O 001 ) and absence (O 101 ) of SAGA, Mediator and TBP. The actual assembly of the pre-initiation complex is far more complicated than depicted in Fig (see section 1.3.2) and additional steps, for instance the independent recruitment of SAGA, Mediator and TBP/TFIID, could be incorporated in a more comprehensive analysis.

90 90 Modeling Small Gene Networks Since the transitions between the different cis-regulatory configurations for the modified Gal1 promoter follow the general reaction scheme in Fig the rate of transcription can be obtained directly from Eq by setting β ij = σ ij = 0 as: f(a, R ) = caa 1 + a(1 + c)a + abγar + br, (3.73) where A and R give the relative level of activation by galactose and ATc, respectively. The input A changes from zero (inactive) to one (full induction) as the extracellular concentration of galactose changes from 0 to 2% w/v while R changes from one (full repression) to zero (full induction) as the extracellular concentration of ATc changes from 0 to 500 ng/ml. Recall that the steady state number of proteins per cell can be approximated by Eq and is proportional to f(a, R ). Therefore, the relative GFP fluorescence signal measured from single cells by flow cytometry can be correlated directly with the normalized response r(a, R ) obtained by dividing Eq with the maximal value f of f(a = 1, R = 0) = ca/(1 + a + ac): r(a, R ) = (1 + a + ac)a 1 + a(1 + c)a + γabar + br. (3.74) In order to fit the response function r(a, R ) to experimental data it is necessary to specify how the relative activity levels A and R depends on the extracellular concentration of galactose c gal and ATc c AT c, respectively. The dependence between the activity of Gal4, denoted by a gal4, and the extracellular concentration of galactose is expected to be quite complicated as galactose import is highly regulated and its presence mediated through a series of protein-protein interactions (between Gal3, Gal80 and Gal4). As a first approximation, it is assumed that the dependence can be capture by a Hill-type function: A(c gal ) = a gal4 a gal4 = c n gal K gal + c n, (3.75) gal where denotes maximal activity, K gal is the Hill constant and n is the Hill coefficient. It is further assumed that the activation step from O 0j0 to O 1j0 behave as if it was a second-order binding reaction. This assumption is not critical and is invoked for simplicity. When the concentration of galactose is varied at saturating amounts of ATc, i.e., R = 0, Eq becomes: r(a, R = 0) = which can be rewritten in the form of a standard Hill-curve: (1 + a + ac)a 1 + a(1 + c)a, (3.76) r(c gal ) = c n gal K H1 + c n, K H1 = gal K gal 1 + a + ac. (3.77)

91 3.4 Models of Gene Regulatory Systems 91 Using the Hill plot method, the experimental data obtained in section for galactose induction at 500 ng/ml ATc reveals that the Hill coefficient in Eq is 2.0 and the Hill constant K H1 is equal to 0.06 (% w/v) 2. This corresponds to 50% transcriptional efficiency at 0.24 % v/w galactose. Figure 3.16A shows the good agreement between the experimental data points and Eq when K H1 is equal to 0.06 by setting K gal = 0.7 (% w/v) 2, a = 2 and c = 4.5. There are, of course, many other combinations of the parameters that can give rise to this particular value of K H1. Additional experiments are required to extract this information. The binding of TetR dimers (R) to the teto operators can by described as a single reaction step: k Rf O i00 O i10, (3.78) where the pseudo-first order rate constant k Rf is a function of the number n R of active repressor dimers per cell. In the framework of generalized mass action (GMA) the value of k Rf is given by k Rf = k Rf nm 1 R, where m 1 is the GMA exponent for the TetR dimer binding reaction. Note that this model differs from that for the P lac system where it was assumed that the binding of activator and repressors to the DNA obey second-order reaction kinetics (m 1 = 1). The presence of ATc causes a titration of active TetR dimers and the formation of an inert form (T ) that has a significantly reduced affinity for the teto operators. The equilibrium constant K AT c for this reaction is in the frame of GMA given by: K AT c = n T n R c m, (3.79) 2 AT c where n T is the number of inactive TetR repressors and m 2 is the Hill coefficient associated with the binding of ATc to the TetR dimers. The number of dimers in the active conformation is thus given by: n R = n tot 1 + K AT c c m, (3.80) 2 AT c where n tot is the total number of TetR dimers per cell. With these approximations, The pseudo-first order rate constant k Rf is given by: and R can be expressed as: k Rf = k Rf nm 1 tot (1 + K AT c c m, (3.81) 2 AT c )m1 R (c AT c ) = (1 + K AT c c m 2 AT c ) m 1, (3.82) when the dimensionless equilibrium constant b is redefined as b = k Rf nm 1 T /k Rb.

92 92 Modeling Small Gene Networks Figure 3.16: (A) Comparison of fitted induction curves (broken lines) with experimental data for induction of transcription from the modified Gal1 promoter with galactose (blue points) or ATc (red points). (B) Response function r(a, R ) predicted based on fit to experimental data for =1 (ATc induction) and R = 0 (galactose induction). For full galactose induction, A = 1, the response function Eq is given by: r(a = 1, R ) = (1 + a + ac) 1 + a + ac + b(1 + γa)r, (3.83) which, following insertion of Eq and rearrangement, becomes: r(c AT c ) = (1 + K AT c c m2 AT c )m1 K H2 + (1 + K AT cc m2 AT c )m1 m c 1 m 2 AT c K H2 + c m 1m 2 AT c, (3.84) where K H2 = b(γ + 1)/(1 + a + ac) and K H2 = K H2 /Km 1 AT c. The approximation that allows the response function r(c AT c ) to be written as a Hill-type function requires a high value of K H2 to ensure that there is no significant response unless K AT cc m2 AT c 1. This in turn implies that K AT c must not be too small. Using the Hill plot method with the experimental data for ATc induction at 2% galactose (section 2.2.2) gives a value of m 1 m 2 equal to 8.0 and a value of K H2 equal to (ng/ml) 8. This corresponds to 50% transcriptional efficiency when the system is induced with 34 ng/ml ATc. Figure 3.16A shows the good agreement between the experimental data points and the induction curve predicted by Eq when a value of K H2 = (ng/ml) 8 is obtained by setting m 1 = 4, m 2 = 2, a = 2.0, b = , c = 4.5, γ = and K AT c = Similar to the case of galactose induction (Eq. 3.76) there are many combinations of the parameter that can give the correct Hill coefficient and Hill constant for the induction with ATc. Additional experiments are required to obtain these values. A critical test of the five state model of the cis-regulatory region of Gal1 would be to compare different combinations of the input signals galactose (A) and ATc (R ) with the surface r(a, R ) (Eq. 3.74) shown in Fig. 3.16B. This surface predicted from data obtained at full induction with ATc (R = 0, Eq. 3.76) and full induction with galactose

7.1 The lac Operon 7-1

7.1 The lac Operon 7-1 7.1 The lac Operon The lac operon was the first operon discovered It contains 3 genes coding for E. coli proteins that permit the bacteria to use the sugar lactose Galactoside permease (lacy) which transports