Codon usage and secondary structure of MS2 pfaage RNA. Michael Bulmer. Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK

Similar documents
Protein Synthesis: From Gene RNA Protein Trait

Protein Synthesis. DNA to RNA to Protein

Big Idea 3C Basic Review

THE GENETIC CODE Figure 1: The genetic code showing the codons and their respective amino acids

No evidence that mrnas have lower folding free energies than random sequences with the same dinucleotide distribution

Translation BIT 220 Chapter 13

Chapter 12: Molecular Biology of the Gene

Module 6 Microbial Genetics. Chapter 8

The Flow of Genetic Information

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 39. End Show. Copyright Pearson Prentice Hall

DNA: The Molecule Of Life

-... :.. -:... Ill tlto to~e O O ~ rjlj td e

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

Chapter 13 - Concept Mapping

Molecular Basis of Inheritance

BIO 101 : The genetic code and the central dogma

I. Gene Expression Figure 1: Central Dogma of Molecular Biology

Gene Expression Transcription/Translation Protein Synthesis

Section 14.1 Structure of ribonucleic acid

NCEA Level 2 Biology (91159) 2017 page 1 of 6. Achievement Achievement with Merit Achievement with Excellence

Class XII Chapter 6 Molecular Basis of Inheritance Biology

Neutral theory: The neutral theory does not say that all evolution is neutral and everything is only due to to genetic drift.

Written by: Prof. Brian White

Key Area 1.3: Gene Expression

C. Incorrect! Threonine is an amino acid, not a nucleotide base.

molecular genetics notes 2013_14 filled in.notebook February 10, 2014

Lecture for Wednesday. Dr. Prince BIOL 1408

Mutation Rates and Sequence Changes

KEY CONCEPT DNA was identified as the genetic material through a series of experiments. Found live S with R bacteria and injected

Do you remember. What is a gene? What is RNA? How does it differ from DNA? What is protein?

Gene Expression Transcription

Student Exploration: RNA and Protein Synthesis Due Wednesday 11/27/13

DNA Structures. Biochemistry 201 Molecular Biology January 5, 2000 Doug Brutlag. The Structural Conformations of DNA

RNA, & PROTEIN SYNTHESIS. 7 th Grade, Week 4, Day 1 Monday, July 15, 2013

From DNA to Protein: Genotype to Phenotype

RNA & PROTEIN SYNTHESIS

Molecular Genetics. The flow of genetic information from DNA. DNA Replication. Two kinds of nucleic acids in cells: DNA and RNA.

DNA is the MASTER PLAN. RNA is the BLUEPRINT of the Master Plan

Assessment Schedule 2014 Biology: Demonstrate understanding of gene expression (91159)

Transcription and Translation

Regulation of bacterial gene expression

RNA and Protein Synthesis

On the origin of the genetic code: pattern and processes

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

1/4/18 NUCLEIC ACIDS. Nucleic Acids. Nucleic Acids. ECS129 Instructor: Patrice Koehl

NUCLEIC ACIDS. ECS129 Instructor: Patrice Koehl

Gene Expression: Transcription, Translation, RNAs and the Genetic Code

RNA does not adopt the classic B-DNA helix conformation when it forms a self-complementary double helix

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

BIOCHEMISTRY REVIEW. Overview of Biomolecules. Chapter 13 Protein Synthesis

Genes and How They Work. Chapter 15

Lecture 10. Ab initio gene finding

MBioS 503: Section 1 Chromosome, Gene, Translation, & Transcription. Gene Organization. Genome. Objectives: Gene Organization

RNA and Protein Synthesis

Biology 3201 Genetics Unit #5

8.1. DNA was identified as the genetic material through a series of experiments. Injected mice with R bacteria. Injected mice with S bacteria

6.C: Students will explain the purpose and process of transcription and translation using models of DNA and RNA

Molecular Cell Biology - Problem Drill 08: Transcription, Translation and the Genetic Code

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

The common structure of a DNA nucleotide. Hewitt

Chapter 11. Gene Expression and Regulation. Lectures by Gregory Ahearn. University of North Florida. Copyright 2009 Pearson Education, Inc..

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

RNA : functional role

Review? - What are the four macromolecules?

Transcription and Translation. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

Annotating the Genome (H)

From Gene to Protein

The Nature of Genes. The Nature of Genes. Genes and How They Work. Chapter 15/16

Study Guide A. Answer Key

From DNA to Protein: Genotype to Phenotype

PROTEIN SYNTHESIS. copyright cmassengale

Chapter 10: Gene Expression and Regulation

Human Gene,cs 06: Gene Expression. Diversity of cell types. How do cells become different? 9/19/11. neuron

3'A C G A C C A G T A A A 5'

Lesson Overview. Fermentation 13.1 RNA

How to Use This Presentation

con t. Chapter 32 The Genetic Code Nature of the Genetic Code BCH 4054 Spring 2001 Chapter 32 Lecture Notes Slide 1 Slide 2

DNA, RNA & Proteins Chapter 13

SCBC203 Gene Expression. Assoc. Prof. Rutaiwan Tohtong Department of Biochemistry Faculty of Science PR318

Gene function at the level of traits Gene function at the molecular level

8.1. KEY CONCEPT DNA was identified as the genetic material through a series of experiments. 64 Reinforcement Unit 3 Resource Book

Molecular Genetics. Before You Read. Read to Learn

DNA and RNA are both composed of nucleotides. A nucleotide contains a base, a sugar and one to three phosphate groups. DNA is made up of the bases

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

4/22/2014. Interest Grabber. Section Outline. Today s Goal. Percentage of Bases in Four Organisms. Figure 12 2 Griffith s Experiment

Biotechnology Unit 3: DNA to Proteins. From DNA to RNA

DNA - DEOXYRIBONUCLEIC ACID

Chapter 3.5. Protein Synthesis

Chapter 8. Microbial Genetics. Lectures prepared by Christine L. Case. Copyright 2010 Pearson Education, Inc.

Codon usage diversity in city microbiomes

Multiple choice questions (numbers in brackets indicate the number of correct answers)

The Structure of RNA. The Central Dogma

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

The study of the structure, function, and interaction of cellular proteins is called. A) bioinformatics B) haplotypics C) genomics D) proteomics

Transcription steps. Transcription steps. Eukaryote RNA processing

CH 17 :From Gene to Protein

Lecture #18 10/17/01 Dr. Wormington

13.1 RNA Lesson Objectives Contrast RNA and DNA. Explain the process of transcription.

Transcription:

volume 17 Number 5 1989 Nucleic Acids Research Codon usage and secondary structure of MS2 pfaage RNA Michael Bulmer Department of Statistics, 1 South Parks Road, Oxford OX1 3TG, UK Received October 20, 1988, Revised and Accepced February 14, 1989 ABSTRACT MS2 is an RNA bacteriophage (3569 bases). The secondary structure of the RNA has been determined, and is known to play an important role in regulating translation. Paired regions of the genome have a higher G+C content than unpaired regions. It has been suggested that this reflects selection for high G+C content to encourage pairing, but are-analysisof the data together with computer simulation suggest that it is an automatic consequence in any RNA sequence of the way it folds up to minimise its free energy. It has also been suggested that the three registers in which pairing can occur in a coding region are used differentially to optimise the use of the redundancy of the genetic code, but re-analysis of the data shows only weak statistical support for this hypothesis. INTRODUCTION MS2 is an RNA bacteriophage. The complete genome of 3569 bases has been sequenced (1) and contains four genes (2): for the maturation protein (bases 130 to 1308), coat protein (bases 1335 to 1724), replicase (bases 1761 to 3395) and the lysis protein (bases 1678 to 1902). The lysis protein gene overlaps the 3' end of the coat protein gene and the 5' end of the replicase gene in a different reading frame. The secondary structure of MS2 RNA has been determined by Fiers and his co-workers (1). It plays an important role in regulating translation (2,3), affecting both the relative amounts of the gene products produced and the timing of their production. It is therefore of interest to consider how selection has operated to maintain the optimal secondary structure. Hasegawa et al. (4) observed that in base-paired regions of MS2 RNA there is a bias in the use of synonymous codons which favours C/G over U/A in the third codon position, while the opposite is true in unpaired regions. They interpreted this as the result of a selective constraint to stabilise the optimal secondary structure, but they could not exclude the alternative explanation that it is an automatic consequence of the way in which any RNA sequence folds up to minimise its free energy. Fitch (5) observed that pairing in a paired section of RNA can occur in three registers (see Fig. 1) which have different potentialities for facilitating base pairing by utilising the degeneracy of the genetic code. He therefore suggested that selection for secondary structure would lead to differential use of these registers, and he presented evidence suggesting that this might be true for the MS2 coat protein gene, which was the only part of the genome sequenced at the time. I shall here re-examine the data on MS2 RNA secondary structure in the light of these ideas to determine what the statistical properties of codon usage reveal about how selection operates on secondary structure. 1839

1 2 3 1 2 3 3 2 1 3 2 1 The 2-2 register: 1.3 or 2.2 or 3.1 The 3-3 register: 1.2 or 2.1 or 3.3 1 2 3 1 2 3 1 3 2 1 3 2 The 1-1 register: 1.1 or 2.3 or 3.2 Fig. 1. The three registers with anti-parallel base pairing of two coding sequences. G + C CONTENT IN PAIRED AND UNPAIRED REGIONS Hasegawa et al. (4) have used the secondary structure of MS2 RNA proposed in (1) to classify all the sites in the complete nucleotide sequence as paired or unpaired, and they present a table of codon usage in the coding region broken down by whether the base in the third position of the codon is paired or not. They find that there is a substantial bias in the third codon position towards the use of C/G over U/A in paired as compared with unpaired sites. After excluding the nondegenerate codons for methionine and tryptophan, the usage of C or G in the third position is 62% in paired sites compared with 34% in unpaired sites (Table 2 in (4)). There are two explanations of this striking fact. The one preferred by Hasegawa et al. (4) is 'that the present day secondary structure of MS2 RNA was the best among other alternative structures and that evolution has proceeded so as to stabilise this structure through biasing the codon usage'. The other explanation, which Occam's razor leads me to prefer, is that any RNA sequence when it folds up in a stable secondary structure with minimum free energy will tend to accumulate C/G bases in paired sites since the creation of a C.G bond leads to a larger free energy loss than that of an A.U bond. These explanations are not mutually exclusive, but it seems unnecessary to invoke the first in addition to the second without additional evidence. It is possible to test the theory that synonymous codon usage has been biased to determine secondary structure by repeating the calculation for non-synonymous first and second position sites. If this theory is true then non-synonymous paired sites should show a smaller excess of G+C over unpaired sites than synonymous sites. If the effect is only the automatic consequence of folding into the most stable secondary structure, there should be no difference between synonymous and non-synonymous sites. The results are shown in Table 1. (Throughout this paper the lysis protein gene and the overlapping parts of the coat protein and replicase genes have been excluded from the analysis.) The difference in G+C content 1840

Table 1. Percent G+C in paired and unpaired sites at different positions of MS2 RNA (sample size in brackets) Position in codon Paired sites Unpaired sites First, non-synonymous 1 Second Third, synonymous 2 First synonymous 3 Non-coding regions 1 Excluding t/ur, CUR, AGR, CGR. 2 Excluding AUG, GUG. 3 tojr, CUR, AGR, CGR only. 64 (616) 55 (668) 63 (671) 62 (61) 70 (204) 34 (302) 29 (334) 32 (295) 30 (23) 34 (97) Table 2. Statistical analysis of data in Table 1 Item X 2 d. f. Significance level Pairing status Position Interaction 258.2 21.0 3.4 1 4 4 P«10" 3 P=10-3 Not significant Table 3. Percent paired sites and percent G+C in paired and unpaired sites in a random sequence and in coding regions of MS2 RNA (sample size in brackets) Sequence Energy function % paired sites ] Paired sites %G+C Unpaired sites MS2 Salser Tinoco Freier et al. 60(3000) 62(3000) 64 (3000) 68 (3006) 64 (1814) 62 (1848) 59 (1932) 61 (2044) 35 (1186) 38 (1152) 41 (1068) 32 (962) between paired and unpaired sites is least in non-synonymous positions (first and second rows) and greatest in non-coding regions (last row) with synonymous positions (third and fourth rows) intermediate. This effect is in the direction predicted by the theory of Hasegawa et al. (4), but it is very weak and is not statistically significant, as shown by the absence of a significant interaction term in the statistical analysis shown in Table 2. (This analysis was done with the statistical package GLIM using a logistic regression model.) It is concluded that the excess of G+C in paired sites can be explained as an automatic consequence of RNA assuming its most stable structure. It is not necessary to postulate that synonymous codon usage in paired segments has been biased towards G+C to maintain a stable paired structure; if it does exist this effect must be rather weak. To test this hypothesis, I generated a 'random' sequence of 3000 bases by letting a random number generator choose 1000 codons with probabilities determined by their usage in MS2 phage. The computer program RNAFOLD of Zuker and Stiegler (6) was used to find the minimum energy folding of this sequence. Because of time limitations the sequence was divided into three subsequences of 1000 bases which were folded separately; additional computer runs (not shown here) in which the sequence was divided into shorter subsequences of 300 or 600 bases gave almost the same results. 1841

The current version of the program RNAFOLD allows three options for the energy function to be minimised, due to Salser (7), Tinoco (unpublished) and Freier et al. (8). Results with these three functions are shown in Table 3 for comparison with data for the coding regions of MS2 phage. The percentage of paired sites is slightly higher in MS2 phage than in the random sequences using any of the energy functions; this may reflect selection for pairing in a small part of the MS2 genome. Fitch (5) found 58% base pairing in a 'random' RNA sequence with each of the four bases equally frequent. The difference in G+C content between paired and unpaired sites is about the same in MS2 and the random sequence if the Salser energy function is used for folding the latter but is somewhat greater in MS2 than in the random sequence if either of the other two energy functions is used. It is concluded that little if any of the excess in G+C content in paired sites in MS2 is of selective origin. USE OF THE THREE REGISTERS Fitch (5) observed that pairing in a paired coding section of RNA can occur in three registers (Fig. 1). He suggested that selection for secondary structure would lead to selection against the 3-3 register since, by putting third position degenerate bases opposite each other, this register fails to use the degeneracy of the genetic code optimally to facilitate base pairing. He also suggested that there would be selection against the 2 2 in favour of the 1 1 register for similar reasons since the first position is sometimes degenerate whereas the second never is. For the MS2 coat protein gene (the only part of the MS2 genome sequenced at the time) he found that of the 129 base pairs in paired coding regions the numbers in the 1-1, 2-2 and 3-3 registers were 51, 39 and 39 respectively. Though in the predicted direction, this is not statistically significant. Repeating Fitch's calculation for the whole MS2 genome (apart from the section with the overlapping lysis protein gene) gives corresponding frequencies of 338, 366 and 259. The excess of the 1-1 over the 2-2 register has been lost, but there seems to be a clear deficiency in the 3-3 register. A test for the equality of these three frequencies gives X 2 = 19.2 with 2 d.f. (P < 10-3 ). Unfortunately this statistical test is invalid because the 963 observations are not independent of each other. For example, an uninterrupted paired section of 6 base pairs has 6 observations in the same register, which should only be counted once. To avoid this problem, I broke up the paired coding region into 258 uninterrupted segments, with a break in the pairing occurring between each segment. Counting each segment once gave frequencies of 89, 96 and 73 in the three registers; the statistical test is no longer significant (x 2 = 3.23 with 2 d.f.). Further information can be obtained from the distribution of segment length within each register. The mean segment lengths in the three registers were 3.80, 3.82 and 3.55 base pairs respectively; the residual error mean square was 5.17 with 255 d.f. An analysis of variance testing whether there is any difference between these means gives an F-ratio of 0.35 with 2 and 255 d.f., which is not significant. Thus the 3-3 register is used slightly less often than the other tworegistersand its unbroken segment length is slightly shorter, but neither of these differences is significant. One might also expect that selection for secondary structure would lead to degenerate bases being used more than expected at paired sites, with non-degenerate bases tending to be found in bulges or loops. The sample sizes in Table 1 show that the frequency of being paired is 67% for non-degenerate bases (first two rows) compared with 70% for 1842

degenerate bases (third and fourth rows). This small difference is in the predicted direction, but is not quite significant (P = 0.057 for a one-tailed test). DISCUSSION The secondary structure of MS2 RNA is known to play an important role in regulating translation (2,3), and must therefore be subject to strong selective constraints. Three mechanisms have been suggested which might encourage pairing in regions in which it is advantageous: (a) use of G or C rather than A or U in the third position since binding strengths are greater; (b) use of the 1-1 register since it uses the degeneracy of the genetic code optimally (5); (c) use of degenerate (third position) bases at paired sites. But there is little evidence that any of these mechanisms is of more than marginal significance in determining the secondary structure of MS2 RNA. In particular, most if not all of the excess of G + C in paired sites (4) can be explained as an automatic consequence of RNA assuming its most stable structure rather than as a consequence of selection for secondary structure. How can the importance of selection for secondary structure in MS2 RNA be reconciled with the weakness of evidence for mechanisms which might encourage it? First, selection for secondary structure may be confined to a small part of the genome, particularly the ribosome binding sites; much of the secondary structure of MS2 may be functionless. (But remember that secondary structure may play a role in transcription and virus assembly as well as in translation.) Second, the evolution of secondary structure at parts of the genome sensitive to selection may have been a complex and opportunistic process not following any obvious rules which can be interpreted post facto. ACKNOWLEDGEMENTS I thank Manolo Gouy for the program CRUSOE and Michael Zuker for the program RNAFOLD. REFERENCES 1. Fiers, W. (1979) In Comprehensive Virology, Vol 13, Chapter 3, Plenum Press, New York. 2. Van Duin, J. (1988) hi Calendar, R. (ed), The Bactenophages, Vol. 1, Chapter 4, Plenum Press, New York. 3. Lewin, B. (1977) Gene Expression, Vol. 3, Chapter 9, Wiley, New York. 4. Hasegawa, M., Yasunaga, T. and Miyata, T. (1979) Nucleic Acids Res., 7, 2073-2079. 5. Fitch, W.M. (1974) J. mol. Evol., 3, 279-291. 6. Zuker, M. and Stiegler, P. (1981). Nucleic Acids Res., 9, 133-148. 7. Salser, W. (1977) Cold Spring Harbor Symp. Quant. Biol., 42, 985-1002. 8. Freier, S.M., Kierzek, R., Jaeger, J.A., Sugimoto, N., Caruthers, M.H., Neilson, T., and Turner, D.H. (1986) Proc. Nail. Acad. Sci. USA, 83, 9373-9377. This article, submitted on disc, has been automatically converted into this typeset format by the publisher. 1843