Abstract. Introduction ANALYSIS OF CLOSTRIDIUM BEIJERINCKII NRRL B-598 CODING REGIONS USING RNA-SEQ DATA OF A CLOSELY RELATED STRAIN

Size: px
Start display at page:

Download "Abstract. Introduction ANALYSIS OF CLOSTRIDIUM BEIJERINCKII NRRL B-598 CODING REGIONS USING RNA-SEQ DATA OF A CLOSELY RELATED STRAIN"

Transcription

1 ANALYSIS OF CLOSTRIDIUM BEIJERINCKII NRRL B-598 CODING REGIONS USING RNA-SEQ DATA OF A CLOSELY RELATED STRAIN Sedlar K. 1, Branska B. 2, Kupkova K. 1, Koscova P. 1, Kolek J. 2, Vasylkivska M. 2, Patakova P. 2, Provaznik I. 1 1 Department of Biomedical Engineering, Brno University of Technology, Brno, Czechia 2 Department of Biotechnology, University of Chemistry and Technology Prague, Prague, Czechia sedlar@feec.vutbr.cz Abstract Modern research in biotechnology utilizes methods of molecular biology and associated bioinformatics techniques more than in the past. Genome mining of biotechnologically relevant organisms became a standard procedure. After a genome is sequenced and annotated, more information regarding its genes, including their variant calling, and the analysis of their expression needs to be acquired. This can be achieved by additional sequencing of the transcriptome, so-called RNA-Seq. Here, we present the analysis of the coding regions for a recently reidentified strain Clostridium beijerinckii NRRL B-598, formerly misidentified as C. pasteurianum, using RNA-Seq expression data of a closely related strain C. beijerinckii NCIMB We confirm the correctness of its reidentification as the majority of reads match the genome. Although expression levels or single nucleotide variants cannot be properly explored, we are able to analyze in silico the annotation of this genome, which supports the annotation of biotechnologically relevant genes. Although this kind of analysis does not allow us to reannotate the genome, nor to reconstruct any gene regulatory networks, it still provides us with valuable information for planning our own RNA-Seq experiments that will be performed in the near future. Introduction The strain Clostridium beijerinckii NRRL B-598 became a relatively well described bacterium, at least concerning its genome sequence. It is an oxygen tolerant, spore-forming, mesophilic, and heterofermentative anaerobe with the ability to ferment acetone-butanol, with butanol being the main product. The organism therefore shows great potential in biorefinery. The strain was obtained from the Agricultural Research Service Culture Collection (NRRL) and its genome has been studied in detail over the last three years. The outcome of these analyses led to the correct identification of the strain, which was formerly misidentified as Clostridium pasteurianum NRRL B Although its first draft genome assembly 2 analysis showed a higher similarity to C. beijerinckii NCIMB 8052 than to C. pasteurianum DSM 525 3, it did not contain the genes for species identification, especially the 16S rrna gene 4. Its first complete genome assembly 5 allowed proper phylogenetic, as well as phylogenomic analysis and resulted in its reidentification. At the same time, the complete genome assembly was updated. The sequence can be found in the GenBank database under the version number CP and all the analyses in this paper are related to this version. The update consists of novel reorganization of the sequence according to position of the dnaa gene; therefore, the positions of all annotated genes were changed and several genes were omitted from the previous version. Although the strain was reidentified as C. beijerinckii species based on its genomic and phenotypic similarities to C. beijerinckii NCIMB 8052, it also has a very high average nucleotide identity 6 (ANI) to C. diolis DSM ( 98%). However, the identity is based on relatively low genome coverage ( 84%) because only a draft genome assembly is available for C. diolis DSM Here, we confirm the correctness of this reidentification by successful mapping of RNA-Seq reads obtained from strain C. beijerinckii NCIMB 8052 to the genome sequence of strain C. beijerinckii NRRL B-598. Although both strains are very similar, they are not identical according to their dddh 7 (digital DNA-DNA hybridization) value, which was computed to be 78% using GGDC 8 (Genome-to- Genome Distance Calculator). This technique replaces wet-lab DDH by in silico comparison using complete genome sequences. Values of DDH > 70% indicate that the strains belongs to the same species and values of DDH > 79% indicates the same subspecies. A substantial difference between strains can be found, for example, in a specific type II R-M system requiring Dam and Dcm methylation-free DNA molecules for its transformation 9. On the other hand, both of the strains share very high sequence similarity (96 99%) of genes involved in solventogenesis, including Spo0A, the master regulator of sporulation, or the sol operon, consisting of genes adhe (alcohol/acetaldehyde dehydrogenase), ctfa (CoA transferase subunit A), ctfb (CoA transferase subunit B), and adc (acetoacetate decarboxylase). We further investigate this similarity, as well as similarities during the whole cultivation, by using RNA-Seq data from a better explored strain NCIMB and prepare a strategy for our own RNA-Seq experiments.

2 Materials and methods Cultivation and cell growth analysis Cultivation was performed at 37 C in three 1 L parallel Multiforce bioreactors (Infors HT, Switzerland) with TYA media according to Kolek et al 11. Cells were collected at regular intervals and OD was measured at 600 nm using Varian Cary 50 Bio (Varian, Inc.) spectrophotometer. The ph was measured directly in the bioreactors and recorded continually. Bioinformatics analysis The values measured during cultivation were processed and visualized using MATLAB version 2014b (MathWorks). RNA-Seq raw reads for strain C. beijerinckii NCIMB 8052 were obtained from the NCBI SRA (Sequence Read Archive) database ( under the accession number SRA using the fastq-dump tool from SRA Toolkit. In total, six samples consisting of Illumina 75bp single-end reads covering the entire ABE fermentation (acetone-butanol-ethanol) were downloaded. The complete genome sequences were obtained from the NCBI GenBank database ( under the version numbers CP (NRRL B-598) and CP (NCIMB 8052). The reads were mapped onto the genome sequences using Bowtie2 13. The validity of an alignment was subjected to linear function 0.2 readlength that for 75bp single-end reads translates to a minimum alignment score of 15. The resulting large files with mapping positions of the reads were further processed by SAMtools 14. Sorted and indexed reads were processed into coverage plots using custom Perl scripts and analyzed and visualized using Artemis 15 and DNAPlotter 16. Computations were performed on a stand-alone PC with Core i (4 cores, 24 GB RAM) and a computational cluster with Xeon E5645 (16 cores, 64 GB RAM). Results and discussion Growth kinetics Both strains slightly differ from each other in solvent production, sporulation and other cultivation characteristics 11, as well as in cell growth during ABE fermentation. The growth kinetics of C. beijerinckii NCIMB 8052 is characterized by a rapid exponential growth at the beginning of cultivation with a stationary phase starting around 10 h, preceded by a change from acidogenesis to solventogenesis at approximately h 17. The strain C. beijerinckii NRRL B-598 continues to grow even during the early solventogenic phase, as shown in Fig 1. Figure 1. Fermentation kinetics of C. beijerinckii NRRL B-598 culture. (A) Cell growth curve with suggested points for RNA isolation indicated by circles. (B) ph curve over time of cultivation. Due to the slight differences in the cell growth curves between both strains, we suggest adjusting the sampling points for RNA isolation from our own RNA-Seq experiments in comparison to the sampling points used for the strain NCIMB 8052 (2, 4.5, 10, 14, 17, 26.5 h), preferentially from the ph curve. The first sample should be isolated in the first third of the exponential growth phase, when acid production leads to a massive decrease in ph, yet solvent production is not running out. The second sample should be collected at the minimum ph corresponding to the maximum acid concentration and simultaneously to the acidogenic/solventogenic switch. Other samples will be taken, if possible, in equal intervals. The suggested sampling points (4, 7, 10, 13, 18, 23 h) are indicated by circles in Fig 1.

3 Mapping of raw reads The whole RNA-Seq dataset consists of six samples with 83,844,609 reads in total. Using the Bowtie2 tool, % of these reads are mapped onto the C. beijerinckii NCIMB 8052 genome and only slightly less, specifically %, onto the C. beijerinckii NRRL B-598 genome. This confirms the high sequence similarity of both strains, further detail can be found in Table I. Table I Summary of RNA-Seq raw reads mapping results Time collected (h) Total No. of reads C. beijerinckii NCIMB 8052 C. beijerinckii NRRL B-598 No. of mapped % of mapped No. of mapped % of mapped 2 8,988,633 8,689, ,617, ,457,480 9,254, ,162, ,011,531 7,687, ,603, ,448,929 7,904, ,829, ,363,535 9,996, ,925, ,574,501 37,827, ,621, Total 83,844,609 81,360, ,759, Moreover, the reads are mapped to very similar positions in the genomes. This confirms that both strains have similar genome structure, even though their length and total number of genes slightly differ. The latter difference is caused mainly by elaborate annotation of C. beijerinckii NCIMB 8052 utilizing RNA-Seq that removes pseudogenes and misannotated regions. Yet, there are unique genes for both strains. These similarities and differences are shown in Fig 2, which contains coverage plots of the genomes from a 4.5 h colletion point. Figure 2. Circular plots of the reads from 4.5 h samples mapping to the (A) C. beijerinckii NCIMB 8052 genome and (B) C. beijerinckii NRRL B-598 genome. The outermost and second outermost circles represent CDS on the forward and reverse strands respectively. The third circle represents pseudogenes. The inner shaded area represents genome coverage by RNA-Seq reads. The sol operon, consisting of four genes involved in solventogenesis, plays an important role in the fermentation process. After revision of the first complete genome assembly of C. beijerinckii NRRL B-598 to the version CP , it is evident that the sol operon, regulated by the Spo0A master regulator, is carried by the forward strand in both strains and on average shares as high as 97% sequence similarity. Genes of the sol operon in both strains are summarized in Table II.

4 Table II Summary of the sol operon Gene Locus tag (position) C. beijerinckii NCIMB 8052 C. beijerinckii NRRL B-598 ald Cbei_3832 X276_06755 (4,399,026..4,400,432) (4,539,268..4,540,674) ctfa Cbei_3833 X276_06750 (4,400,524..4,401,177) (4,540,766..4,541,419) ctfb Cbei_3834 X276_06745 (4,401,178..4,401,843) (4,541,420..4,542,085) adc Cbei_3835 X276_06740 (4,401,916..4,402,656) (4,542,158..4,542,898) Identities 1368/1407 (97%) 635/654 (97%) 647/666 (97%) 722/741 (99%) Although the expression of the sol operon genes cannot be properly derived from the data for different strain, due to their high sequence similarity, we are at least able to map the reads to the C. beijerinckii NRRL B-598 sol operon and demonstrate the expression as coverage normalized to the genome-wide total number of unambiguously mapped reads for each sample, as in Fig 3. In 2 h and 26.5 h samples, transcription of genes in the sol operon is almost inactive. In the remaining samples, these genes are activated while exhibiting very similar levels of transcriptional activity. By the proposed sampling for the strain NRRL B-598, we will be able to cover sol operon activity and further investigate its sequence structure, aimed at the single nucleotide variants. Moreover, by finding the housekeeping genes, we will be able to properly analyze its expression over the whole fermentation process. Figure 3. Putative transcriptional profiles of the C. beijerinckii NRRL B-598 sol operon during the entire fermentation process, based on C. beijerinckii NCIMB 8052 RNA-Seq data. Conclusion Public databases can provide a wide range of genome as well as transcriptome sequencing data. Although it is evident that any final conclusion cannot be derived from data for different organisms, utilization of data for closely related organism can still be highly beneficial to support current conclusions or to better plan for future experiments. Here, we confirmed the recent reidentification of C. beijerinckii NRRL B-598 by successfully mapping C. beijerinckii NCIMB 8052 RNA-Seq reads to its genome sequence. Moreover, we modified the collection points of our strain for RNA-Seq by comparing its cell growth and ph curves during fermentation. Besides the analysis of fermentation kinetics, we also compiled putative transcriptional profiles of the genes involved in solventogenesis, which further supports the design for our future experiments.

5 Acknowledgement This work has been supported by grant project GACR S. Computational resources were partially provided by the CESNET LM and the CERIT Scientific Cloud LM , provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures". References 1. Sedlar K., Kolek J., Provaznik I., Patakova P.: J. Biotechnol. 244, 1 (2017). 2. Kolek J., Sedlar K., Provaznik I., Patákova, P.: Genome Announc. 2, e00192 (2014). 3. Poehlein A., Grosse-Honebrink A., Zhang Y., Minton N. P., Daniel, R.: Genome Announc. 3, e01591 (2015). 4. Sedlar K., Skutkova H., Kolek J., Patakova P., Provaznik I.: Proceedings Of 2nd International Conference on Chemical Technology, 435 (2014). 5. Sedlar K., Kolek J., Skutkova H., Branska B., Provaznik I., Patakova P.: J. Biotechnol. 214, 113 (2015). 6. Goris J., Konstantinidis K. T., Klappenbach J. A., Coenye T., Vandamme P., Tiedje J. M.: Int. J. Syst. Evol. Microbiol. 57, 81 (2007). 7. Auch A. F., Jan M., Klenk H.-P., Göker M.: Stand. Genomic Sci. 2,117 (2010). 8. Meier-Kolthoff J. P., Auch A. F., Klenk H.-P., Göker M.: BMC Bioinformatics 14, 1(2013). 9. Kolek J., Sedlar K., Provaznik I., Patakova P.: Biotechnol. Biofuels 9, 1 (2016). 10. Wang Y., Li X., Blaschek H. P.: Biotechnol. Biofuels 6, 138 (2013). 11. Kolek J., Branska B., Drahokoupil M., Patakova P., Melzoch K.: FEMS Microbiol. Lett. 363, fnw031 (2016). 12. Wang Y., Li X., Mao Y., Blaschek H. P.: BMC Genomics 13, 102 (2012). 13. Langmead B., Salzberg S. L.: Nat Methods 9, 357 (2012). 14. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Bioinformatics 25, 2078 (2009). 15. Carver T., Berriman M., Tivey A., Patel C., Böhme U., Barrell B. G., Parkhill J., Rajandream M. A.: Bioinformatics 24, 2672 (2008). 16. Carver T., Thomson N., Bleasby A., Berriman M., Parkhill J.: Bioinformatics 25, 119 (2009). 17. Wang Y., Li X., Mao Y., Blaschek H. P.: BMC Genomics 12, 479 (2011).