Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model

Similar documents
Textbook Reading Guidelines

Bacteriophages get a foothold on their prey

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C.

(Very) Basic Molecular Biology

Molecular Cell Biology - Problem Drill 01: Introduction to Molecular Cell Biology

Protein Synthesis ~Biology AP~

Protein Structure Function

VALLIAMMAI ENGINEERING COLLEGE

A Possible Mechanism of DNA to DNA Transcription in Eukaryotic Cells : Endonuclease Dependent Transcript Cutout

Chapter 17. From Gene to Protein. Slide 1. Slide 2. Slide 3. Gene Expression. Which of the following is the best example of gene expression? Why?

0.5% Sucrose. Supplemental Data. Chao et al. (2011). Plant Cell /tpc Col-0 tsc10a-2. Root length (mm)

Lecture 7 Motif Databases and Gene Finding

A Possible Mechanism of DNA to DNA Transcription in Eukaryotic Cells : Endonuclease Dependent Transcript Cutout

Molecular Basis of Inheritance

Lecture Series 10 The Genetics of Viruses and Prokaryotes

Clamping down on pathogenic bacteria how to shut down a key DNA polymerase complex

7.03 Final Exam Review 12/19/2006

Answers to Module 1. An obligate aerobe is an organism that has an absolute requirement of oxygen for growth.

CHAPTER 17 FROM GENE TO PROTEIN. Section C: The Synthesis of Protein

Bioinformatics & Protein Structural Analysis. Bioinformatics & Protein Structural Analysis. Learning Objective. Proteomics

Lecture Four. Molecular Approaches I: Nucleic Acids

Old EXAM 1 BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

Chapter 2. An Introduction to Genes and Genomes

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Simulation Study of the Reliability and Robustness of the Statistical Methods for Detecting Positive Selection at Single Amino Acid Sites

Pre-AP Biology DNA and Biotechnology Study Guide #1

CSE : Computational Issues in Molecular Biology. Lecture 19. Spring 2004

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Fig. A1 Characteristics of HepP and its homology to other proteins. Fig. A2 Amino acid sequence of the 78-kDa predicted protein encoded by zbdp

Microbial Genetics. Chapter 8

Genetic Adaptation II. Microbial Physiology Module 3

1. An alteration of genetic information is shown below. 5. Part of a molecule found in cells is represented below.

Teaching Principles of Enzyme Structure, Evolution, and Catalysis Using Bioinformatics

SUPPLEMENTARY INFORMATION

Genomic region (ENCODE) Gene definitions

Protein Synthesis

4/3/2013. DNA Synthesis Replication of Bacterial DNA Replication of Bacterial DNA

When times are good and when times are bad: Stringent response Stationary phase

Class XII Chapter 6 Molecular Basis of Inheritance Biology

(A) Extrachromosomal DNA (B) RNA found in bacterial cells (C) Is part of the bacterial chromosome (D) Is part of the eukaryote chromosome

Zoo-342 Molecular biology Lecture 2. DNA replication

BIO 101 : The genetic code and the central dogma

Recitation CHAPTER 9 DNA Technologies

Bioinformatics. Lecturer: Antinisca Di Marco Tutor: Francesco Gallo

Chapter 8 Lecture Outline. Transcription, Translation, and Bioinformatics

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

NEW OUTCOMES IN MUTATION RATES ANALYSIS

DNA RNA PROTEIN. Professor Andrea Garrison Biology 11 Illustrations 2010 Pearson Education, Inc. unless otherwise noted

Classification and Learning Using Genetic Algorithms

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

Module 2 overview SPRING BREAK

Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication.

Lesson 8. DNA: The Molecule of Heredity. Gene Expression and Regulation. Introduction to Life Processes - SCI 102 1

Recombinant DNA Technology

Bacteria Reproduce Asexually via BINARY FISSION

BIOL 1030 Introduction to Biology: Organismal Biology. Fall 2009 Sections B & D. Steve Thompson:

Bioinformation by Biomedical Informatics Publishing Group

Chapter 11. Gene Expression and Regulation. Lectures by Gregory Ahearn. University of North Florida. Copyright 2009 Pearson Education, Inc..

From DNA to Protein: Genotype to Phenotype

DNA Replication II Biochemistry 302. January 25, 2006

Chapter 8. Microbial Genetics. Lectures prepared by Christine L. Case. Copyright 2010 Pearson Education, Inc.

Q2 (1 point). How many carbon atoms does a glucose molecule contain?

Lecture for Wednesday. Dr. Prince BIOL 1408

Big Idea 3C Basic Review

The Flow of Genetic Information

The Beery Twins Story and Sepiapterin Reductase

Activity A: Build a DNA molecule

Bioinformatics of the Green Fluorescent Proteins

5. the transformation of the host cell. 2. reject the virus. 4. initiate an attack on the virus.

Chapter 14 Regulation of Transcription

Lecture 13. Motor Proteins I

Chapter 16 The Molecular Basis of Inheritance

Genome manipulation by homologous recombination in Drosophila Xiaolin Bi and Yikang S. Rong Date received (in revised form): 9th May 2003

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Course Information. Introduction to Algorithms in Computational Biology Lecture 1. Relations to Some Other Courses

7.1 Techniques for Producing and Analyzing DNA. SBI4U Ms. Ho-Lau

Molecular Biology. IMBB 2017 RAB, Kigali - Rwanda May 02 13, Francesca Stomeo

By two mechanisms: Mutation Genetic Recombination

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch.

MBioS 503: Section 1 Chromosome, Gene, Translation, & Transcription. Gene Organization. Genome. Objectives: Gene Organization

CHAPTER 20 DNA TECHNOLOGY AND GENOMICS. Section A: DNA Cloning

M1 - Biochemistry. Nucleic Acid Structure II/Transcription I

BIO 311C Spring Lecture 34 Friday 23 Apr.

BETA STRAND Prof. Alejandro Hochkoeppler Department of Pharmaceutical Sciences and Biotechnology University of Bologna

8/21/2014. From Gene to Protein

ON-CHIP AMPLIFICATION OF GENOMIC DNA WITH SHORT TANDEM REPEAT AND SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS

An introduction to genetics and molecular biology

BIOTECHNOLOGY. Sticky & blunt ends. Restriction endonucleases. Gene cloning an overview. DNA isolation & restriction

G4120: Introduction to Computational Biology

The study of the structure, function, and interaction of cellular proteins is called. A) bioinformatics B) haplotypics C) genomics D) proteomics

Analysis of the AcrB binding domain of the membrane fusion protein AcrA. Introduction

Mechanisms of Genetic Variation. Copyright McGraw-Hill Global Education Holdings, LLC. Permission required for reproduction or display.

Supplementary Note 1. Enzymatic properties of the purified Syn BVR

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C.

CURRICULUM VITAE FREDERICK S. GIMBLE ASSOCIATE PROFESSOR OF BIOCHEMISTRY

B. Incorrect! Ligation is also a necessary step for cloning.

Introduction to Algorithms in Computational Biology Lecture 1

ONLINE BIOINFORMATICS RESOURCES

Applications of HMMs in Computational Biology. BMI/CS Colin Dewey

From DNA to Protein: Genotype to Phenotype

Transcription:

Relationship between nucleotide sequence and 3D protein structure of six genes in Escherichia coli, by analysis of DNA sequence using a Markov model Yuko Ohfuku 1,, 3*, Hideo Tanaka and Masami Uebayasi 3 1 National Institute of Evaluation and Technology -49-10 Nishihara Shibuyaku Tokyo, 151-0066, Japan Agricultural Sciences University of Tsukuba 1-1-1 Ten-nodai Tsukuba city Ibaraki, 305-857, Japan 3 Institute of Molecular and Cell Biology, AIST central 6, 1-1-1 Higashi Tsukuba Ibaraki, 305-8866, Japan *E-mail:ohfuku-yuko@meti.go.jp (Received June 3, 00; accepted July 9, 00; published online December 5, 00 ) Abstract We speculate that structures of genes might have been produced not only by the origination of the gene by itself, but also by the combination of several parts of structures from other species, through events, such as gene transfer, fusion of DNA sequences, and rearrangement of DNA sequences, to create new functions in evolution. To study the relationship between the 3D structure and DNA sequence of proteins comprised of multi-domains, six genes in Escherichia coli, including those coding for Flavodoxin (ferredoxin) reductase, Flavin oxidoreductase, Integrase/recombinase xerd, Endonuclease III, Heat shock protein (grpe), and Elongation Factor Tu (tufb), were chosen, since the 3D structures of these proteins had been determined by X-ray crystallography. The DNA sequences were analyzed by probability as calculated by using a species-specific inhomogeneous Markov model and divided into regions. The probability was calculated by the GeneMark program from the third-order matrix of Class III genes (Escherichia coli horizontally transferred genes) produced with the Markov model by Borodovsky M. et al.. The probability indicates the degree of similarity to the DNA sequence of the genes that were used for producing the matrix with the Markov model. Moreover, the probability indicates whether the origin of the divided DNA sequence would be different from that of the Class III genes. The divided regions, except for those of Flavin oxidoreductase, corresponded to the structural domains classified by CATH. Namely, the divided regions, suggested to be of different origins by analysis with the Markov model, showed correspondence to the domains and the gross content of the 3D structure. We can ascertain which DNA sequences would have been incorporated into the structure of a gene as parts of the structure in evolution. We propose a hypothesis that in evolution, the structure of genes might have been arisen by the incorporation of DNA sequences from other species. Key Words: DNA sequence analysis, Markov model, evolution, structural evolution Area of Interest: Bioinformatics and Bio Computing Copyright 00 Chem-Bio Informatics Society http://www.cbi.or.jp 119

1. Introduction Some functional genes are composed of domains that are structurally separated into parts [1][]. It is possible to consider that some of them might have been incorporated from other species as parts of the structure in the process of evolution. Therefore, we postulated that structures of genes might have been produced not only by the origination of the gene by itself, but as a composition of several parts of structures from other species, through such events as gene transfer, fusion of DNA sequences, and rearrangement of DNA sequences, etc., and in this way have created a new function in evolution. To demonstrate this hypothesis, we indicate how example DNA sequences of functional genes would be divided into structural domains. Scherer S. et al. reported that by analyzing patterns of probability of DNA sequences using the seventh-order Markov Models produced by DNA sequence fragments, the pattern of some region was found to be different from those of other regions [3]. They suggested the following three reasons to explain the difference of probability in the DNA sequence. One was that there might be a selection pressure on the chromosome. The second is that a lateral gene transfer might have occurred from another species. Last is that the DNA sequence might be incorporated from an organelle into the nuclear genome. They made no mention, however, about the relationship between the structure of the gene and the DNA sequence. Therefore, we thought that we could reveal the structure and function or evolution of a functional gene by analyzing its DNA sequence. Médigue C. et al. suggested that the codon usage of 708 genes in Escherichia coli could be classified into three classes (Classes I, II, and III) by Fractional Correspondence Analysis and the dynamic clustering method. Class I genes, with intermediate codon usage bias, maintain a low or intermediate level of expression, although some genes may occasionally be expressed at a very high level under environmentally rare conditions. Class II genes, having high codon usage bias, are highly expressed under exponential growth conditions. Genes from Class III, with low codon usage bias, included the following genes for fimbriae, flagellae and pili, integration host factors(hip and hima), genes controlling cell division (dicabc, structurally related to template phages ), several outer membrane or periplasmic protein genes and several catabolic operons (threonine degradation, β-glucoside degradation, fucose degradation), genes containing insertion sequences, and genes behaving as mutators when inactivated (muth, mutt, and mutd) and lambdoid phage lysogeny control protein. Also they suggested strongly that these Class III genes mostly comprised genes inherited by horizontal transfer [4][5]. Borodovsky M. et al. analyzed the DNA sequence of Escherichia coli by means of a Markov model and showed that only the results to be identified as the Class III genes in the genome DNA sequence, using matrix produced by Class III genes, were allowed with acceptable accuracy. Accordingly, to explore the DNA sequence from other species, where structural information had been published, we used the third-order matrix, produced with Class III genes by Borodovsky M. et al. (Escherichia coli horizontally transferred genes) for our analysis [5]. The probability indicates the degree of similarity to the DNA sequence of those genes that were used for producing the matrix with the Markov model. Moreover, the probability indicates that the origin of the divided DNA sequence would be different from that of the Class III genes. To reveal the relationship between the 3D structure and DNA sequence for proteins comprised of multi-domains, six genes in Escherichia coli, such as Flavodoxin (ferredoxin) reductase, Flavin oxidoreductase, Integrase/recombinase xerd, Endonuclease III, Heat shock protein (grpe), and Elongation Factor Tu (tufb), were chosen, because the 3D structures of these proteins had been determined by X-ray crystallography and the complete genomic DNA sequence of Escherichia coli 10

was determined. Since the probability calculated using a matrix of the genes produced with the Markov model indicates the degree of similarity to the DNA sequence of the genes, the DNA sequence of the genes could be analyzed by the probability. By comparison with the regions divided by the probability of the DNA sequence of the gene and the domain of the 3D structure, there was a correspondence between the divided regions of five genes excluding Flavin oxidoreductase in Escherichia coli and the domains classified by Class, Architecture, and Topology of CATH [6].. Materials and Methods.1 Materials DNA sequences and 3D structures of Flavodoxin (ferredoxin) reductase [7], Flavin oxidoreductase [8], Integrase/recombinase xerd [9], Endonuclease III [10], Heat shock protein (grpe) [11], and Elongation Factor Tu (tufb) [1] in Escherichia coli were downloaded from the Colibri Database [13] and Protein Data Bank [14], respectively.. Methods The probability of a given segment of the DNA sequence is calculated by the program GeneMark. The calculation is basically performed as follows: Each DNA sequence is broken up into "windows," typically comprising 96 bases. The probability that this window contains coding sequence (given a previously determined model of the coding sequence trained for a particular species) is calculated. The window is then moved over one "step," typically 1 bases, and the coding probability is calculated again. When the entire DNA sequence has been traversed in this manner, the average of the windows spanning the sequence is computed (The basis of the computation is Bayes Rule.) [15]. In our analysis, we calculated the probabilities by the GeneMark program with the third order matrix of Class III genes (Escherichia coli horizontally transferred genes) produced by Borodovsky M. et al. with 48 bases as the window size and three-base steps. The probabilities of DNA sequence were analyzed according the following definitions. Additionally, we compared our results with the domain classified by CATH..3 Definition of class demonstrating the calculated probability It is supposed that the greater the calculated probability score of the DNA sequence, the more similar is the codon usage of the genes, by using those matrices produced by Markov models. This indicates that there is more similarity between the analyzed sequence and the sequences of the genes used for producing the matrices as the score becomes closer to 1.0. When the score is closer to 0.0, there is less similarity between them. In contrast, when the score of the probability is close to 0.5, the degree of similarity among them could not be determined. Therefore, meaningful scores of the probability were considered to be greater than 0.6 or less than 0.3. Then we grouped the probability scores into six classes by steps of 0.1, as given in Table 1. 11

Table1. Classification of Probability of DNA sequence Class Score of probability of DNA sequence 1 less than or equal to 0.3 greater than 0.3 and less than 0.6 3 greater than or equal to 0.6 and less than 0.7 4 greater than or equal to 0.7 and less than 0.8 5 greater than or equal to 0.8 and less than 0.9 6 greater than or equal to 0.9.4 Definition of regions demonstrating classes The probability was divided into regions when classes changed. But when the probability class ranges from class 3 to class 6, the frequency of the same probability class should be more than five. Also when the frequency of class 1 is more than four continuously, then that position or region was the boundary between two regions..5 Definition of CLASS demonstrating classes of region When the frequency of the same continuous class is more than five, the CLASS is regarded as a class, by itself. Conversely, when the frequency of the same continuous class is less than six, or if no class occurs continuously, then the most frequent class should be taken as the CLASS. 3. Results 3.1 Analysis of probability of DNA sequence in Flavodoxin reductase Fig. 1. (a) represents the plot of probabilities versus the position of the DNA sequence of Flavodoxin reductase and the plot between classes versus the position of the DNA sequence, classified by probability. The upper boxes indicate the divided regions and the domains classified by CATH. Fig. 1 (b) shows the 3D structure corresponding to the gene (PDB code 1fdr). The DNA sequence coding for Flavodoxin reductase could be divided into seven regions, from I to VII. Fig.1 (a) indicates the correspondence between the domains and the divided regions. Regions from I to III belong to the FAD domain, where Flavin Adenine Dinucleotide (FAD) binds [7]. The class and architecture of the FAD domain are Mainly Beta and Barrel, respectively, according to the definition of CATH. Regions from IV to VII belonged to the NADP domain, where NADP/NADPH binds [7]. The class and architecture of the NADP domain are Alpha Beta and 3 Layer (aba) Sandwich, respectively. Each domain is classified based on Class, derived from the secondary structure, Architecture, which is derived from the gross orientation of the secondary structure, and Topology defined by CATH, and corresponding to the divided region. 1

(a) Regions divided by six classes of probability CATH domain region (Mainly Beta Barrel) (Alpha Beta 3 Layer aba Sandwich) FAD domain NADP/NADPH Domain I II III IV V VI VII 6 5 4 class 3 1-76 -31 14 59 104 149 194 39 84 39 374 419 464 509 554 599 644 689 734 779 1.0 0.8 Probability 0.6 0.4 0. 0.0-76 -31 14 59 104 149 194 39 84 39 374 419 464 509 554 599 644 689 734 779 13

(b) 3D structures of Flavodoxin reductase (PDB code 1fdr) Figure 1. Result of probability analysis of Flavodoxin reductase (a) shows the plot of probabilities versus the position of the DNA sequence of Flavodoxin reductase and the plot between classes versus the position of the DNA sequence, and classified by the probability calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Light gray, dark green, cyan blue, and dark blue indicate CLASS 1+, 4, 5, and 6, respectively. (b) indicates the 3D structure of Flavodoxin reductase (PDB code 1fdr). Blue and red indicate the FAD domain and NADP/NADPH domain, respectively. Also blue and red indicate the two domains classified by CATH; one domain whose class is Mainly Beta and architecture is Barrel, and the other domain whose class is Alpha Beta and architecture is 3-Layer (aba) Sandwich. 3. Analysis of probability of DNA sequence for Flavin oxidoreductase The DNA sequence coding for Flavin oxidoreductase was analyzed by the same method as that used for Flavodoxin reductase. Fig. (a) indicates the plots of probability and class, as well as region and domain of Flavin oxidoreductase similarly as shown in Fig. 1 (a). Fig (b) shows the 3D structure corresponding to the gene (PDB code 1qfj). The DNA sequence coding for Flavin oxidoreductase could be divided into seven regions, from I to VII. Fig. (a) indicates the correspondence with the domain and the divided region. The regions from I to the most part of IV belonged to the domain at the N-terminus, the class and architecture of which, in accordance with the definition of CATH, are Mainly Beta and Barrel, respectively. The regions from a part of IV to VII belonged to the domain at the C-terminus, the class and architecture of which is Mainly Alpha Beta and 3-Layer (aba) Sandwich, respectively. In this scheme, however, region IV was located so that it bridged over both domains. Thus, this gene could not be completely divided into the regions corresponding to the structural domains. 14

(a) Regions divided by six classes of probability CATH domain region 6 Mainly Beta, Barrel Alpha Beta, 3-Layer (aba) Sandwich I II III IV V VI VII 5 4 class 3 1-50 -0 10 40 70 100 130 160 190 0 50 80 310 340 370 400 430 460 490 50 550 580 610 640 670 700 730 1.0 0.8 0.6 Probability 0.4 0. 0.0-50 100 50 400 550 700 15

(b) 3D structures of Flavin oxidoreductase (PDB code 1qfj) Figure. Result of analysis of probability of Flavin oxidoreductase (a) shows the plots of probabilities versus the position of the DNA sequence for Flavin oxidoreductase and the plot between classes versus the position of the DNA sequence, as classified by the probability calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Light gray, dark green, cyan blue, and dark blue indicate CLASS 1-, 4, 5, and 6, respectively. (b) indicates the 3D structure of Flavin oxidoreductase (PDB code 1qfj). Blue and red indicate the two domains classified by CATH; one domain whose class is Mainly Beta and architecture is Barrel and the other whose class is Alpha Beta and architecture is 3-Layer (aba) Sandwich. 3.3 Analysis of probability for the DNA sequence of Integrase/recombinase xerd The DNA sequence coding for Integrase/recombinase xerd was analyzed by the same method as that used for Flavodoxin reductase. Fig. 3 (a) indicates the plots of probability and class, and region and domain of Integrase/recombinase xerd similarly as those of Fig. 1 (a). Fig 3 (b) shows the 3D structure corresponding to the gene (PDB code 1a0p). The DNA sequence for Integrase/recombinase xerd could be divided into seven regions, from I to VII. Fig. 3 (a) indicates the correspondence with the domain and the divided region. The regions from I to III belonged to the domain at the N-terminus, the class and architecture of which, in accordance with the definition of CATH, are Mainly Alpha and Orthogonal Bundle, respectively. The regions from IV to VII belonged to the domain at the C-terminus, the class and architecture of which, are Mainly Alpha and Orthogonal Bundle, respectively. Each domain classified based on the Class, derived from the secondary structure, Architecture, which is derived from the gross orientation of the secondary structure, and Topology defined by CATH, corresponded with the divided region. 16

(a) Regions divided by six classes of probability CATH domain region Mainly Alpha, Mainly Alpha, Orthogonal Bundle Orthogonal Bundle I II III IV V VI VII 6 5 4 class 3 1-50 -3 4 31 58 85 11 139 166 193 0 47 74 301 38 355 38 409 436 463 490 517 544 571 598 65 65 679 706 733 760 787 814 841 868 895 9 949 1.0 0.8 0.6 Probability 0.4 0. 0.0-50 100 50 400 550 700 850 17

(b) 3D structures of Integrase/recombinase xerd (PDB code 1a0p) Figure 3. Result of analysis of probability of Integrase/recombinase xerd (a) shows the plot of probabilities versus the position of the DNA sequence for Integrase/recombinase xerd and the plot between classes versus the position of the DNA sequence, classified by probability as calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Light gray, dark green, cyan blue, and dark blue indicate CLASS 1+, 4, 5, and 6, respectively. (b) indicates the 3D structure of Integrase/recombinase xerd (PDB code 1a0p). Blue and red indicate the two domains classified by CATH; the class and architecture of both domains are Mainly Alpha and Orthogonal Bundle. 3.4 Analysis of probability of DNA sequence in Endonuclease III The DNA sequence coding for Endonuclease III was analyzed by the same method as that for Flavodoxin reductase. Fig. 4 (a) indicates the plots of probability and class, and region and domain of Endonuclease III as well as those of Fig. 1 (a). Fig 4 (b) shows the 3D structure corresponding to the gene (PDB code abk). The DNA sequence of Endonuclease III could be divided into eight regions, from I to VIII. Fig.4 (a) indicates the correspondence with the domain and the divided region. Regions I and from VI to VIII belonged to the domain at the N-terminus, class and architecture of which, according to the definition of CATH, are Mainly Alpha and Orthogonal Bundle, respectively. Regions from II to V belonged to the domain at the C-terminus, whose class and architecture are Mainly Alpha and Orthogonal Bundle, respectively. Each domain was classified based on Class, derived from the secondary structure, Architecture, which was derived from the gross orientation of the secondary structure and Topology as defined by CATH, corresponded with the divided region. 18

(a) Regions divided by six classes of probability CATH domain region 6 Mainly Alpha, Mainly Alpha, Mainly Alpha, Orthogonal Bundle Orthogonal Bundle Orthogonal Bundle I II III IV V VI VII VIII 5 4 class 3 1-50 5 100 175 50 35 400 475 550 65 1.0 0.8 0.6 Probability 0.4 0. 0.0-50 100 50 400 550 19

(b) 3D structures of Endonuclease III (PDB code abk) Figure 4. Result of analysis of probability of Endonuclease III (a) shows the plot of probabilities versus the position of the DNA sequence for Endonuclease III and the plot between classes versus the position of the DNA sequence, as classified by the probability calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Light gray, dark green, cyan blue, and dark blue indicate CLASS 1-, 4, 5, and 6, respectively. (b) indicates the 3D structure of Endonuclease III (PDB code abk). Blue and red indicate the two domains classified by CATH, the class and architecture of both domains are Mainly Alpha and Orthogonal Bundle. 3.5 Analysis of probability of the DNA sequence in Heat shock protein (grpe) The DNA sequence of Heat shock protein (grpe) was analyzed by the same method as that for Flavodoxin reductase. Fig. 5 (a) indicates the plots of probability and class, and region and domain of Heat shock protein (grpe) similarly as those of Fig. 1 (a). Fig. 5 (b) shows the 3D structure coded for by the gene (PDB code 1dkg). The DNA sequence of Heat shock protein (grpe) could be divided into six regions, from I to VI. Fig. 5 (a) indicates the correspondence with the domain and the divided region. Regions I, II, and III belonged to the domain at the N-terminus, the class and architecture of which, in accordance with the definition of CATH, are Alpha Beta and Complex, respectively. The regions from IV to VI belonged to the domain at the C-terminus, the class and architecture of which, are Mainly Beta and Roll, respectively. Each domain was classified based on Class, derived from the secondary structure, Architecture, which was derived from the gross orientation of the secondary structure, and Topology as defined by CATH corresponded with the divided region. 130

(a) Regions divided by six classes of probability CATH domain region Alpha Beta Complex I II III Mainly Beta Roll IV V VI 6 5 4 class 3 1-50 -0 10 40 70 100 130 160 190 0 50 80 310 340 370 400 430 460 490 50 550 580 610 640 1.0 0.8 0.6 Probability 0.4 0. 0.0-50 10 70 130 190 50 310 370 430 490 550 610 131

(b) 3D structures of Heat shock protein (grpe) (PDB code 1dkg) Figure 5. Result of the analysis of probability of Heat shock protein (grpe) (a) shows the plot of probabilities versus the position of the DNA sequence for Heat shock protein (grpe) and the plot between classes versus the position of the DNA sequence, as classified by the probability calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Dark green, cyan blue, and dark blue indicate CLASS 4, 5, and 6, respectively. (b) indicates the 3D structure of Heat shock protein (grpe) (PDB code 1dkg). Blue and red indicate the two domains classified by CATH. The class of one domain is Alpha Beta and the architecture is Complex, the class of the other domain is Mainly Beta and the architecture is Roll. 3.6 Analysis of probability of the DNA sequence in Elongation Factor Tu (tufb) The DNA sequence of Elongation Factor Tu (tufb) was analyzed by the same method as that used for Flavodoxin reductase. Fig. 6 (a) indicates the plots of probability and class, and region and domain of Elongation Factor Tu similarly as those of Fig. 1 (a). Fig. 6 (b) shows the 3D structure coded for by the gene (PDB code 1efu). The DNA sequence of Elongation Factor Tu (tufb) could be divided into eight regions, from I to VIII. Fig. 6 (a) indicates the correspondence with the domain and the divided region. The regions from I to IV belong to the domain at the N-terminus, the class and architecture of which, in accordance with the definition of CATH are Alpha Beta and 3-Layer (aba) Sandwich, respectively. Regions V and VI belong to the domain between the N-terminus and the C-terminus, the class and architecture of which are Mainly Beta and Barrel, respectively. Regions VII and VIII belong to the domain at the C-terminus, and have the class and architecture of Mainly Beta and Barrel, respectively. Each domain, which is classified on the basis of Class as derived from the secondary structure, Architecture, which is derived from the gross orientation of the secondary structure, and Topology defined by CATH, corresponded with the divided region. 13

(a) Regions divided by six classes of probability CATH domain region Alpha Beta Mainly Beta Mainly Beta 3-Layer (aba) Sandwich Barrel Barrel I II III IV V VI VII VIII 6 5 4 class 3 1-50 -17 16 49 8 115 148 181 14 47 80 313 346 379 41 445 478 511 544 577 610 643 676 709 74 775 808 841 874 907 940 973 1006 1039 107 1105 1138 1171 104 137 1.0 0.8 0.6 Probability 0.4 0. 0.0-50 100 50 400 550 700 850 1000 1150 133

(b) 3D structures of Elongation Factor Tu (pdb code 1efu) Figure 6. Result of analysis of the probability of Elongation Factor Tu (tufb) (a) shows the plot of probabilities versus the position of the DNA sequence of Elongation Factor Tu (tufb) and the plot between classes versus the position of the DNA sequence, classified by the probability calculated by the GeneMark program. The upper boxes indicate the divided regions and the domains classified by CATH. Light gray, greenish yellow, dark green, cyan blue, and dark blue indicate CLASS 1+, 3, 4, 5, and 6, respectively. (b) shows the 3D structure of Elongation Factor Tu (PDB code 1efu). Dark blue, pink, and red indicate the three domains classified by CATH, the class of the domain at the N-terminus is Alpha Beta and the architecture is 3-Layer (aba) Sandwich. For the remaining two domains, the class is Mainly Beta and the architecture is Barrel. 3.7 Comparison between the divided regions and the domain classified in accordance with the definition of CATH Table indicates the summary of the comparison between the divided regions and the domain classified in accordance with the definition of CATH. The divided regions of five genes, except for Flavin oxidoreductase in Escherichia coli by the probability calculated using Markov models, corresponded to the domains classified on the basis of Class, derived from the secondary structure, Architecture, derived from the gross orientation of the secondary structure, and Topology defined by CATH. 134

Table. Comparison between the divided regions and the domains classified in accordance with the definitions of CATH Gene (PDB code9 Flavodoxin reductase (1fdr) Number of domains CATH code of domains (class, architecture) Mainly beta, Barrel ( FAD domain) Alpha beta, 3-Layer(aba) Sandwich (NADP/NADPH domain) I-III Number of regions IV-VII Flavin oxidoreductase (1qfj) Mainly beta, Barrel Alpha beta, 3-Layer(aba) Sandwich I-IV IV-VII Integrase/recombinase xerd (1a0p) Mainly alpha, Orthogonal Bundle Mainly alpha, Orthogonal Bundle I-III IV-VII Endonuclease III (abk) Mainly alpha, Orthogonal Bundle Mainly alpha, Orthogonal Bundle I, VI-VIII II-V Heat shock protein (grpe)(1dkg) Alpha beta, Complex Mainly beta, Roll II, III IV-VI Alpha beta, 3-Layer (aba) Sandwich I-IV Elongation Factor Tu (tufb) (1efu) 3 Mainly beta, Barrel V, VI Mainly beta, Barrel VII, VIII 4. Discussion Six genes in Escherichia coli, Flavodoxin reductase, Flavin oxidoreductase, Integrase/recombinase xerd, Endonuclease III, Heat shock protein (grpe), and Elongation Factor Tu (tufb), were analyzed. Five of the six genes, with the exception of Flavin oxidoreductase, were demonstrated to be divided in correspondence with the domains as parts of the structure by using the probability of the DNA sequence. Especially, Flavodoxin reductase has two domains, the FAD domain and the NADP domain, which are related to function. The divided regions from I to III belonged to the FAD domain, where FAD is bound, and those from IV to VII belonged to the NADP domain, where NADP/NADPH is bound. These two domains exemplified the correspondence to the domains classified on the basis of Class, derived from the secondary 135

structure, Architecture, derived from the gross orientation of the secondary structure, and Topology defined by CATH. By our analysis, however, the domain classified by CATH could be further divided into several more regions. But only region VI of Flavin oxidoreductase was bridged between two domains. Looking into more details, we could include one helix into the former domain among the two domains classified by CATH. We also noted that certain genes exist, such as that for Flavin oxidoreductase, which were not incorporated from the DNA sequence as a part of structure. Although there are genes that are not divided into regions corresponding to structure, we were able to find the genes that are divided into regions corresponding to structure by this method. Accordingly, it can be said that the methodology of the Markov model is a useful tool for dividing the DNA sequence of a gene into regions corresponding to the coded protein structure. Although we will demonstrate the results of our analysis more clearly in next paper, we suggest that the DNA sequence of the gene could be divided into the regions, where gene transfer and recombination would have occurred. This is because the divided regions, which showed correspondence to the domains and the gross 3D protein structure, would be elucidated to have different origins by analysis of the probability with the Markov model. Therefore, it is clear that these five genes would be composed of the DNA sequences, which might have been collected from parts of domains of other species or by recombination within its genome. We concluded that these five genes would have been produced by means of such events as horizontal transfer, rearrangement, or recombination of its own DNA sequences or those from other species in the process of evolution. References [1] Y. Liu and D. Eisenberg, Protein. Sci., 11, 185-199 (00). [] M. Jaskolski, Acta. Biochim. Pol., 48, 807-87 (001). [3] S. Scherer et al., Proc. Natl. Acad. Sci., 91, 7134-7138 (1994). [4] C. Médigue et al., J. Mol. Biol.,, 851-856 (1991). [5] M. Borodovsky et al., Nucleic Acids. Res., 3, 3554-356 (1995). [6] C. A. Orengo et al., Structure, 5, 1093-1108 (1997). [7] M. Ingelman et al., J. Mol. Biol., 68, 147-157 (1997). [8] M. Ingelman et al., Biochemistry, 38, 7040-7049(1999). [9] H. S. Subramanya et al., EMBO J., 16, 5178-5187 (1997). [10] C. F. Kuo et al., Science, 58, 434-440 (199). [11] C. J. Harrison et al., Science, 76, 431-435 (1997). [1] T. Kawashima et al., Nature, 379, 511-518 (1996). [13] http://genolist.pasteur.fr/colibri/ [14] http://www.rcsb.org/pdb/ [15] http://opal.biology.gatech.edu/genemark/ 136