Determination of the complete genomic DNA sequence of Thermoplasma volcanium GSS 1

Size: px
Start display at page:

Download "Determination of the complete genomic DNA sequence of Thermoplasma volcanium GSS 1"

Transcription

1 No. 7] Proc. Japan Acad., 75, Ser. B (1999) 213 Determination of the complete genomic DNA sequence of Thermoplasma volcanium GSS 1 By Tsuyoshi KAWASHIMA,*1l Yoshihiro YAMAMOTO,*2) Hironori ARAMAKI,*3) Tatsuo NUNOSHIBA,*4) Takeshi KAWAMOTO,*5) KOhji WATANABE,*6) Masaaki YAMAZAKI,*6) Keiichi KANEHORI,*7) Naoki AMANO,*l'*8) YOShie OHYA,*1l KOzo MAKING,*9) and MaSaShi SUZUKI*fl *lo) t) (Communicated by Setsuro EBASHI, M. J. A., Sept. 13, 1999) Abstract: The complete genomic DNA sequence of the aero/anaero-facultative archaebacterium, Thermoplasma volcanium GSS1, has been determined. A number of DNA fragments were cloned by using the X, cosmid, and BAC systems, and sequenced. The remaining 30 gaps were bridged by DNA fragments constructed using the polymerase chain reaction. The repetition in sequencing the same base positions was 13.1 ± 7.5 fold. The alignment of the DNA fragments and the completeness of the genomic sequence were confirmed by the consistency of the genomic sequence with the lengths and partial sequences of a second set of DNA fragments that altogether covered 88% of the genome. The number of bases found in the genomic sequence is 1,584,799, with a G/C content of 39.9%. The combination of the four types of bases in the new genomic sequence is compared with those in known genomic sequences of similar sizes. Key words: Algorithmic information content; Shannon's entropy; thermophile. archaebacterium; dot matrix; sequencing repetition; Introduction. We have determined the complete genomic DNA sequence of the archaebacterium, Thermoplasma volcanium GSS11~ (Japan Collection of Microorganisms code 9571). This organism is a thermo- *1) AIST -NIBHT CREST Centre of Structural Biology, 1-1 Higashi, Tsukuba , Japan. *2) Department of Genetics, Hyogo College of Medicine, Nishinomiya , Japan. *3) Department of Molecular Biology, Daiichi College of Pharmaceutical Science, 22-1 Tamagawa-cho, Minami-ku, Fukuoka , Japan. *4) Department of Molecular and Cellular Biology, Biological Institute, Graduate School of Science, Tohoku University, Sendai , Japan. *5) Department of Biochemistry, Hiroshima University, School of Dentistry, Kasumi, Minami-ku, Hiroshima , Japan. *6) Bioscience Research Laboratory, Fujiya, 228 Soya, Hadano , Japan. *7) DNA Analysis Department, Techno Research Laboratory, Hitachi Science Systems, Ltd., Higashi-Koigakubo, Kokubunji , Japan. *8) Doctoral Program in Medical Sciences, University of Tsukuba,1-1-1 Tennohdai, Tsukuba , Japan. *9) Department of Molecular Microbiology, The Research Institute of Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita , Japan. *10) Graduate School of Human and Environmental Sciences, University of Tokyo, Komaba, Meguro-ku, Tokyo , Japan. t) Correspondence to: M. Suzuki at AIST-NIBHT. acidophile, which can adapt to both aerobic and anaerobic environments. It has been hypothesized that this organism is closely related to an ancient archaebacterium that later evolved to the nuclei of eukaryotic cells.2~ In this paper we report on the process used in the determination of the genomic sequence of T volcanium. Sequencing strategy. The number of bases that can be realistically determined in a single sequencing step, 1, is far smaller than the total number of bases, L, found in a genomic DNA molecule. Thus, in order to determine a complete genomic sequence, the DNA molecule needs to be divided recursively until a set of fragments of the lengths close to l are obtained. After sequencing these fragments, the complete sequence can be reconstructed by climbing up the recursive tree to its peak, and by connecting the fragmental sequences at each level. Since overlaps need to be created between the neighboring fragments at each level, the total number of bases that practically need to be sequenced, rl, becomes larger than L. Here, r defines the overall repetition in sequencing the same base positions. It is possible to determine the complete DNA sequences of genomes of approximately 2 M bases by using a single level of DNA fragments. The genomic DNA molecule is sonicated, and the resultant short frag-

2 214 T. KAWASHIMA et al. [Vol. 75(B), meets are sequenced, until the fragmental sequences altogether cover the whole genome. However, instead of applying this strategy, in this study two levels of DNA fragments were prepared. The archaebacterium T volcanium was cultured in an aerobic environment by a reported method.l~ Its genomic DNA molecule was extracted following protocol 14 in ref. 3. Approximately 500 DNA fragments with apparent sizes close to 20 K bases were cloned using the 74) cloning system. In parallel, another set of 384 DNA fragments with sizes close to 0.5 K bases were cloned using the puc85~ vector. By hybridization of the puc8 clones to the library of clones, pairs of A clones that hybridized to the same puc8 clones, and thus were expected to contain DNA fragments that overlapped each other, were identified. Altogether 1212 clones were ordered into 46 contigs. Approximately 100 fragments of the Thermoplasmu DNA were cloned using the cosmid6~ cloning system, and another 200 DNA fragments were cloned using the BAC7~ cloning system. By hybridization of these clones with the ) clones whose DNA fragments were identified to be positioned at the ends of the 46 contigs, l cosmid and 16 BAC clones were identified as bridging some of the contigs. The numbers of bases in the DNA fragments cloned by the ), cosmid, and BAC systems were 15,576± 4,340, 35,040, and 22,640± 25,528, respectively. The whole genome was finally covered by bridging the remaining 30 gaps (Fig, la) by amplifying the DNA fragments bracketed by the nucleotide sequences of the both ends of the gaps of approximately 20 bases each, by the polymerase chain reaction (PCR)8 using the genomic DNA as the template. The numbers of bases in the PCR products were 7,499 ± 5,748. These and the 138 fragments cloned by the ~, cosmid, and BAC systems composed the first sequencing level. At the second level, altogether 34,217 DNA fragments were created by sonication of the fragments at the first level, and sequenced. The average number of bases determined using these individual fragments was 600 ± 137. Confirmation of the sequence. A slight complication was introduced to the sequence determination, as approximately 30% of the clones were found to contain 2-4 distinct different fragments of the Thermoplasma DNA. These fragments occurred in succession at the GATC cloning sites, and were detected by comparing their sequences with the sequences of the DNA fragments expected to be overlapping. In order to confirm the alignment of the DNA fragments and the completeness of the determined genomic sequence, a new set of 600 DNA fragments were cloned by using the. system with a modified protocol. In the new protocol DNA fragments of K bases only were selected, by eliminating contaminating shorter fragments as much as possible, before the ligation reaction. Since the maximum size of a DNA fragment that can be cloned by the system is smaller than 20 K, only one fragment was expected to be cloned by each vector. Of 112 of these fragments bases at both ends were sequenced. The lengths of these fragments were estimated on the basis of their electrophoretic mobility, and these estimations differed by less than 2,000 bases from what were calculated from the determined genomic sequence. In addition, 7 cosmid clones were selected at random from the original library. Of the DNA fragments bases at the both ends were sequenced, and were compared with the genomic sequence. Fragments of DNA were amplified by the PCR method in order to bridge 21 gaps. Altogether these covered 88.0% of the whole genome in 21 contigs (Fig. la). The remaining 12.0% correspond to the regions which were sequenced multiple times by using at least two clones. The genomic DNA molecule of T volcanium is closed circular, and the number of bases found in the complete genomic sequence is 1,584,799. This number is close to our earlier estimate of 1.6 M, made on the basis of the electrophoretic mobility of the genomic DNA molecule. The G/C content is calculated as 39.9%. This value is very close to our earlier estimate of 38%, made on the basis of an HPLC analysis of the bases obtained by treating the genomic DNA molecule with P1 nuclease. It is not so different from another estimate, 46%, made by another group on the basis of the melting temperature of the genomic DNA molecule.' Of all the bases, 1,530,933 bases were determined by sequencing both DNA strands, altogether covering 96.6% of the genome. By sequencing either of the two strands singly, 53,866 bases were determined (i. e. 3.4% of the genome). Repetition in sequencing the same base positions. In the past, the r value was defined as the practical number of bases sequenced in order to sequence the whole genome at least once. The value was an estimation of the trouble, and thus was referred to as the redundancy. However, the r value has more useful information. Sequencing a genome only once can be erroneous, and thus, the overall r value needs to be reasonably high, although further repetition beyond some point is meaningless. The same overall r value can be

3 No. 7] Determination of Thermoplasma genomic sequence 215 Fig 1. Clock-format representations for understanding the process of determining the complete genomic sequence of T 'olcaniuro. (a) Map of the DNA fragments used for the sequence determination (outer) and those used for the confirmation of the determined sequence (inner). Different colors are used for representing the types of the fragments: in the outer set, green for those cloned by the d, system, blue for those cloned by the BAC system, red for that cloned by the cosmid system, and crimson for PCR-amplified fragments, and in the inner set, yellow green for those cloned by the A system, light brown for those cloned by the cosinid system, and brown for PCR-amplified fragments. The 21 contigs created by the inner set are shown closest to the center. (b) Repetition in sequencing the same base positions averaged for each 25 bases around the genome. The overall average value, 13.1, is shown by a green circle. Some of the sharp peaks-i.e. peaks at around 12, 4:30, and 8 o'clock, correspond to short fragments amplified by PCR shown in (a). (c) Shannon's entropy calculated for the combination of the four types of bases in segments of 20 bases each around the genome of T Y'olca,ziurn (black) overlapped onto the equivalent entropy calculated with the genome sequence of Metharuococcus jaiznzashii1"~ (red). The correlation between the positions in the two genomes is fixed arbitrary. point is meaningless. produced exactly small not the tion with the high deviation also rest only can positions by he with sequencing an extremely overall in a sequence r value all the but genome sequencing standard overall by sequencing a reasonably of precision smaller same repetition, of the while Therefore, only same fraction etition, The once r value a high rep- or twice. is an indica- determination, while of r is an indication a of higher efficiency. A perfect sequence out, if the L to l ratio 1, at each sequenced. dependent sequencing step By repeating deviation determination is close can to 1. If this ratio the this of r would whole step be genome n times be produced. carried were to be could no be position However, in reality, the L to l ratio can be as high as 3,000. If the overall r value is kept constant to n, increase in the L to 1 ratio will not change the mode value of r much (Fig. 2b), but will increase the position-dependent deviation of r, until the L to I ratio reaches 2-3. Beyond this point, this effect on the deviation of r rapidly decreases (Fig. 2b). It appears to be unlikely that the current technology of sequencing will so dramatically improve that the l number approaches close to L. By dividing the genome into ordered fragments of the size L,,,, and by applying the two level sequencing strategy, the effective L to l ratio can improve close to L1z/l. However, the current realistic L,,/l number, 30-40, is larger than 2-3, and thus the distribution of r is insensitive to the value (Fig. 2b).

4 216 T. KAWASHIMA et al. [Vol. 75(B), Fig. 2. Number of bases, N, in the genome (ordinate) sequenced with different repetition, r, (abscissa). (a) The repetition observed while determining the complete genomic sequence of T volcanium (labeled 2), and a similar repetition observed for the determination of individual fragments at the first level (labeled 1) are shown by darker lines. Because of the overlap between fragments at the first level, the sum of the lengths of the fragments is larger than the number of bases in the genome. In order to produce a better comparison, distribution 1 is shown by being compensated, so that the total number of bases included in the distribution becomes 1.58 M. The distributions of repetition obtained by simulation, carried out by keeping L to 1.58 M, Lm to 14.9 K, and L to 600, are shown by thinner lines. The percentage of Lm overlapped by the neighboring fragments was increased from 0% to 100% stepwise by 10% each. (b) The repetition obtained by simulation by following the single level sequencing strategy. The average sequencing repetition is kept to 6 (distributions on the left) or 13 (distributions on the right). The L length was kept to 1.58 M. The L to l ratio was changed; 1, 2, 3, 5, 10, 50, 100, to 3,333. Lm becomes close to that of l (i. e. by ordering DNA fragments of bases). Another disadvantage to applying the two level sequencing strategy is the necessity of creating overlaps between neighboring fragments at the first level. In order to determine the sequence of each fragment at the first level, if bases need to be sequenced m times by using the fragments at the second level made from the fragment at the first level, overlaps at the first level produces the repetition of 2m locally. Therefore, the distribution of r becomes bimodal, with peaks at m and 2m, and broader (Fig. 2a). In our process of the sequence determination, the average number of bases in the fragments at the first level was 14.9 K. On average, 36.8% of the bases in each such fragment were overlapped by other fragments. The repetition for climbing up the levels, m, was 8.1(distribution 1 in Fig. 2a), while the overall sequencing repetition, r, was then 13.1 (distribution 2 in Fig. 2a). In theory, by applying the single level sequencing strategy with a similar overall repetition, distribution of r that is characterized by a smaller standard deviation can be created (distribution 3333 with the r value of 13 in Fig. 2b). Difficulty in assembling fragmental sequences. In reality, different types of uncertainty can be associated with the single level sequencing strategy. Fragments of DNA might not really be created at random with the r value as high as Cleavage of DNA by sonication tends to take place at particular combinations of bases (Ohfuku, Y. et al., unpublished). In addition, the target genomic sequence might possess some structure that would prevent application of this strategy; for example, genomic sequences such as [An][Tn][Gn][Cn] or (ATGC)n, where n is 400,000, cannot be determined by this strategy. More generally, if two sections positioned outside the distance l have the same combination of bases, a serious problem will occur, since, upon assembling the fragmental sequences, the two sections are indistinguishable. However, this issue can be avoided, if the two level sequencing strategy is applied, and if the distance is larger than Lm. In order to find such internal structure, dot matrices for calculating matches of two sequences were made by using the genomic sequence of T volcanium as the query sequence as well as the reference sequence (Fig. 3b). As expected, a diagonal row of dots was observed, proving that the two sequences being compared were the same. This row is not important and thus was deleted from the matrix. Still, a number of dots remained, showing pairs of sections of 350 bases whose DNA sequences were similar to each other. The number of the pairs was counted by enhancing the signals (Fig. 3a), and was compared with equivalent numbers counted for known genomic DNA sequences (Fig. 3c). This number is an indicator for evaluating the difficulty of assembling fragmental sequences. Among the genomic sequences compared, that of T volcanium had one of the lowest numbers, suggesting that application of the single level sequencing strategy to this genome is possible. Shannon's entropy. The central chemical reaction in the standard sequencing procedure is the polymerase reaction that incorporates dye-labeled nucleotides into the complementary strand by following the template DNA strand. The efficiency of this reaction has some dependency on the nucleotide sequence; most notably, sequences having high local percentages of single types of bases tend to be incorrectly produced. Shannon's entropy9~ for the combination of the

5 No. 7] Determination of Thermoplasma genomic sequence 217 Fig. 3. Comparison of archaebacterial and eubacterial genomic DNA sequences. (a, b) Three dimensional representations of the number of dots founds in each section of 10 K bases x 10 K bases, N, in the dot matrices of the genomic sequence of TT volcanium made by the algorithm of Suckow and Suzuki.23~ The parameters used were the resolution of 350 bases, and the E-value being smaller than Filtration of noises was carried out for producing (a) from (b), by selecting segments of 350 bases satisfying the E-value threshold, found only in the same domains of 700 bases, that possessed 25 or larger numbers of such segments (see Fig. 223)). The diagonal rows of dots, proving that the query sequence and the reference sequence are exactly the same, are omitted from these matrices. (c) The number of dots divided by 2. Column 1 shows the numbers obtained by the matrices produced by the same procedure as (a), while column 2 shows those by the same procedures as (b). Organisms, Methanobacterium thermoautotrophicum-methanococcus jannaschii, are archaebacteria, while organisms, Thermotoga maritia-helicobacter pylori, are eubacteria. Artificial sequences 1 and 2, are random sequences of the size same as that of TT volcanium. Artificial sequence 1 has the same contents of the four bases as those in TT volcanium, while artificial sequence 2 has the equal content of the four base types-i. e. 25%. (d) Evaluation of the local tendency of clustering the same types of bases. Each genomic sequence was divided into a series of segments of 20 bases each. The 2-S value was calculated for each segment, where S was defined as -~pzlog2p2, and pz was defined as the frequency of each of the four types of bases. Fragments that scored 2-S values higher than a threshold value ( , shown on top of the columns) were selected, and the sum of 2-S values of these fragments was calculated. The calculation was repeated 20 times by shifting the 20 base phase-i.e. the first set of 20 bases being 1,2, ,201 in the 1st calculation, 12,3,4,..., 20,211 in the 2nd calculation, 3,4, ,221 in the 3rd calculation etc. The average and standard deviation of the sum are shown. four types of bases is defined as S = -Epilog2p2, where pi is the frequency of each type of bases. The maximum value of S is 2, when all the frequencies, PA, PT, PG, and pc, are the same-i. e Since a higher value of S corresponds to a less structured state, another value, 2-S, can be used for evaluating the difficulty in the polymerase reaction. For example, a sequence that has PA of 0.7, and the other three frequencies of 0.1, produces the 2-S value of The 2-S value was calculated for each segment of 20 bases in the genomic sequence of T volcanium (Fig. lc). Sections whose 2-S values were higher than a threshold were selected, and the sum of these 2-S values was calculated (Fig. 3d). The equivalent sums were cal-

6 218 T. KAWASHIMA et al. [Vol. 75(B), culated for the genomic sequences of 4 other archaebacterialo)-13) and 4 eubacterial4)-17) of similar sizes, M bases. The same type of calculation was repeated with random sequences of the same size as that of TT volcanium. Artificial sequence 1 was created, so that it possessed the same content of the four bases as those in TT volcanium, while artificial sequence 2 possessed the equal content of the four types. The genomic sequence of Methanococcus jannaschii10) showed extremely high values of the sum of 2- S at various thresholds (Figs. lc and 3d). Neglecting MM jannaschii, the archaebacterial sequences scored the sum values, in general, smaller than those calculated with the eubacterial sequences. The sums calculated for the sequence of T volcanium were, in general, higher than those calculated for the sequences of three other archaebacteria, Archaeoglobus fulgidus, l l) Pyrococcus sp. OT3,13) and Methanobacterium thermoautotrophicum.l2) Algorithmic information. According to Shannon, any combination of the AITIGIC bases in the same length can have the same information content.9) In contrast, the algorithmic information content18)-20) is measured by the most concise algorithmic process that reconstructs a sequence. If a sequence does not possess any structure, describing the full sequence is the most concise algorithmic process. However, the sequences, (ATGC)p and [An][Tn][Gn][Cn], can be reconstructed by much shorter processes. In a broad sense, difficulty in sequence determination originates with structure in the genomic sequence, and thus, the algorithmic information content of a sequence might be a good indicator of overall difficulty of the sequence determination. Unfortunately, it has been proved by Chaitin2l) that precise calculation of the algorithmic information content is not possible. There always is a possibility that unidentified structure remains in the sequence studied; the known content being the maximum limit. All the genomic sequences studied in this paper have structures (Fig. 3c,d), and their algorithmic information contents appear to be far smaller than what is expected for a "totally randomized" sequence. As has been pointed out by Shannon,22) this reduction is an important characteristic of a real message used for communication in a system. Acknowledgements. The part of this work carried out at AIST-NIBHT was supported by the Core Research for Evolutional Science and Technology (CREST) program of the Japan Science and Technology Corporation (JST). The rest of this work was supported by administrative aids of Marine Biotechnology Institute (MBI) and by financial aids of New Energy and Industrial Technology Development Organization (NEDO). We thank Dr. Masahiro Yamagishi for his help at early stages of this study. References 1) Segere, A., Langworthy, T. A., and Stetter, K. 0. (1988) System. Appl. Microbio1.10, ) Searchy, D. G., Stein, D. B., and Searchy, K. B. (1981) Ann. N. Y. Acad. Sci. 361, ) Robb, F. T., and Place, A. R. (eds.) (1995) Archaea, A Laboratory Manual: Thermophiles, Cold Spring Harbor Laboratory Press, Cold Spring Harbor. 4) Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989) Molecular Cloning, Cold Spring Harbor Laboratory Press, Cold Spring Harbor. 5) Viera, J., and Messing, V. J. (1982) Gene 19, ) Bates, P. (1987) Methods Enzymol.153, Part D, ) Shizuya, H., Birren, B., Kim, U.-J., Mancino, V., Slepak, T., Tachiiri, Y., and Simon, M. (1992) Proc. Natl. Acad. Sci. U.S.A. 89, ) Saiki, R. K., Gelfand, D. H., Stoffel, S., Sharf, S. J., Higuchi, R., Horn, G. T., Mullis, K. B., and Erlich, H. A. (1988) Science 239, ) Brillouin, L. (1956) Science and Information Theory, Academic Press, New York. 10) Bult, C. J., White, 0., Olsen, G. J., Zhou, L., Fleischmann, R. D. et al. (1996) Science 273, ) Klenk, H.-P., Clayton, R. A., Tomb, J.-F., White, 0., Nelson, K.E. et al. (1997) Nature 390, ) Smith, D. R., Doucette-Stamm, L. A., Deloughery, C., Lee, H., Dubois, J. et al. (1997) J. Bacteriol.179, ) Kawarabayasi, Y., Sawada, M., Horikawa, H., Haikawa, Y., Hino, Y. et al. (1998) DNA Res. 5, ) Deckert, G., Warren, P., Gaasterland, T., Young, W.G., Lenox, A. et al. (1998) Nature 392, ) Fleischmann, R. D., Adams, M. D., White, 0., Clayton, R. A. et al. (1995) Science 269, ) Tomb, J.-F., White, 0., Kerlavage, A. R., Clayton, R. A., Sutton, G. G. et al. (1997) Nature 388, ) Nelson, K. E., Clayton, R. A., Gill, S. R., Gwinn, M. L., Dodson, R. J. et al. (1999) Nature 399, ) Solomonoff, R. J. (1964) Inform. Control. 7, ) Kolmogorov, A. N. (1965) Probl. Peredachi. Inform. (1965) 1, ) Chaitin, G. J. (1966) J. Assoc. Comp. Machin.13, ) Chaitin, G. J. (1987) Algorithmic Information Theory, Cambridge University Press, Cambridge. 22) Shannon, C. E. (1957) Bell System. Tech. J. 30, ) Suckow, J. M., and Suzuki, M. (1999) Proc. Japan Acad. 75B,