Introduction on Several Popular Nucleic Acids Databases

Size: px
Start display at page:

Download "Introduction on Several Popular Nucleic Acids Databases"

Transcription

1 Introduction on Several Popular Nucleic Acids Databases Changmin Liao Library, China West Normal University, Nanchong City, P. R. Abstract-Nucleic acids are major biological molecules essential for life, and everyday, there are large number of nucleic acids sequences that are released and searched. Now, many nucleic acids databases were used to assembly and distribute structural information related nucleic acids on the internet. In order to find or submit quickly needed nucleic acids sequences, this present paper introduced seven popular nucleic acids databases, including EMBL-Bank (European Molecular Biology Laboratory Nucleotide Sequence Database), GenBank, DDBJ (DNA Data Bank of Japan), GDB (Human Genome Database), EID (Exon-Intron Dabase), EPD (Eukaryotic Promoter Data) and RNAiDB (RNA Interference Database). Keywords- Nucleic Acids; Database; Sequence; Submission And Search I. INTRODUCTION Nucleic acids are the most important biological molecules in all living things, allowing organisms to transfer genetic information from one generation to the next, and including DNA (deoxyribonucleic acid) and RNA (ribonucleic acid). In 1871, they were first discovered by Friedrich Miescher [1]. The research on nucleic acids is a major part of modern biological and medical fields, as well as forms groundwork for genomics, biotechnology and pharmaceutical industries. To date, a large number of DNA and RNA sequences are being released and available on the internet, to submit or find easily desired nucleic acids on the internet, here, seven popular databases including EMBL-Bank (European Molecular Biology Laboratory Nucleotide Sequence Database), GenBank, DDBJ (DNA data Bank of Japan), GDB (Human Genome Database), EID (Exon- Intron Dabase), EPD (Eukaryotic Promoter Data) and RNAiDB (RNA Interference Database) were introduced briefly. II. POPULAR NUCLEIC ACIDS DATABASES A. EMBL-Bank EMBL was founded by European Molecular Biology Laboratory in Currently, the database was in the charge of the European Bioinformatics Institute (EBI). EMBL is at the forefront of innovation in life sciences research, technology development and transfer, and provides outstanding training and services to the scientific community in its member states [2]. The publicly-funded non-profit institute is housed at five sites in Europe whose expertise covers the whole spectrum of molecular biology. The EMBL-Bank, included in EMBL, incorporates, organizes and distributes nucleotide sequences derived from public sources. The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA). Data are exchanged between these collaborating databases on a daily basis to achieve optimal synchrony. The web-based tool is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data. Automatic submission procedures are used for submission of data from large-scale genome sequencing centers and from the European Patent Office. Database releases are produced quarterly. The latest data collection can be accessed via FTP, and WWW interfaces. The EMBL-Bank s Sequence Retrieval System integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases. For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL-Bank [3-5]. The homepage of EMBL-Bank is available at (Figure 1). Fig. 1 The homepage of EMBL-Bank

2 B. GenBank GenBank is established by the National Center for Biotechnology Information (NCBI). As a division of the National Library of Medicine (NLM), it is located at the campus of the US National Institutes of Health (NIH) in Bethesda (MD, USA) [6]. GenBank is a part of the International Nucleotide Sequence Database Collaboration, containing a large number of publicly available nucleotide sequences database from almost formally described species [7]. The most important source of new data for GenBank is directly submitted by scientists, and then, these accession numbers are assigned by GenBank staff upon receipt. More importantly, the daily data are exchanged with the EMBL-Bank and DDBJ ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system which integrates data from the major DNA and protein sequence databases together with taxonomy, mapping, genome, protein structure and its domain information, and the biomedical literature via PubMed. BLAST in NCBI provides sequence similarity searches of GenBank and other sequence databases. To be mentioned, complete bimonthly releases and daily updates of the GenBank database are available in reading and downloading by internet [8-11]. To access GenBank and its related retrieval, as well as analysis services, we can begin at the NCBI Homepage. Which homepage is (Figure 2). Fig. 2 Homepage of GenBank within NCBI C. DDBJ DDBJ is the abbreviation of DNA Data Bank of Japan, founded in 1986, at present, DDBJ is organized and managed by CIB-DDBJ (Center for Information Biology and DNA Data Bank of Japan of National Institute of Genetics) [12]. DDBJ is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. DDBJ not only collects sequence data mainly from Japanese researchers, but also accepts data and issue the accession number to researchers in any other countries. Since exchanging the collected data with EMBL-Bank and GenBank each other on a daily basis, the three data banks share virtually the same data at any given time. The virtually unified database is called INSD (International Nucleotide Sequence Database). The DDBJ collected and released 3,637,446 entries / 2,272,231,889 bases between July 2009 and June % of INSD data from Japanese researchers are submitted through DDBJ. The principal purpose of DDBJ operations is to improve the quality of INSD, as public domains [13, 14]. The homepage of DDBJ is available at

3 Fig. 3 Homepage of DDBJ D. GDB In 1989, the Howard Hughes Medical Institute provided funding to establish a central repository for human genetic mapping data. This project ultimately resulted in the creation of the GDB in 1990 [15]. It was a key database in the human genome project (HGP). Established under the leadership of Dr. s Peter Pearson and Dick Lucier, GDB received significant financial support from the US Department of Energy and the NIH. Located at the Johns Hopkins University School of Medicine, GDB became a source of high quality mapping data which were made available both on-line as well as through numerous printed publications. In 1998, the change of focus in the HGP redirected funds which were previously available for GDB. However that same year, Dr. A. Jamie Cuticchia obtained funding from Canadian public and private sources to continue the operations of GDB [16, 17]. GDB is a database that only gathers the genome map data from HGP, including all research results of human DNA structure and over 100, 000 human gene sequences around the world. Currently, it is in the charge of Canada Children s Hospital Biological Information Center. Which contents include (1) a large number of sequence information consist of genes, cloning, breaking points, cell genetic markers, break easily loci, repeat fragments, etc; (2) The human genome schematic diagrams, including cell genetic map, relation diagrams, radiation hybrid figures, collective diagrams and so on; (3) the variation within human genome, involving gene mutation and gene polymorphism; and (4) frequency data of allele [18]. The related information with GDB is available at (Figure 4). Fig. 4 Homepage of Human Genome Project including GDB

4 E. EID EID was established by the National University of Singapore, University of Chicago in USA and Ludwig Cancer Research Institute in USA on August 1st, 1999 [18]. EID is a database of protein-coding intron-containing genes. It contains gene information from humans, mice, rats, and other eukaryotes, as well as genes from species whose genomes have not been completely sequenced. The EID database stores information for sequences of introns and exons and where in each gene those introns and exons are located. The database also contains information on the intron nucleotide sequence, the amino acid sequence of the corresponding protein, and the position of the introns at the amino acid level. The EID database incorporates information on the exon-intron structure of eukaryotic genes. Features in the database include: intron nucleotide sequence, amino acid sequence of the corresponding protein, position of the introns at the amino acid level and intron phase. From EID, there have also generated four additional databases, each entries together with EID containing predicted introns, introns experimentally defined, organelle introns or nuclear introns. Moreover, EID collects and summarizes the relative information derived from internet, and provides search service. EID is accessible through a retrieval system with pointers to GenBank. The database can be searched by keywords, locus name, NID, accession number or length of the protein [19, 20]. The relevant knowledge with EID is freely available at (Figure 5). Fig. 5 The web including relevant knowledge with EID F. EPD EPD was founded by Israeli Weizmann Science Institute in 1988, mainly recorded these promoter sequences existed in EMBL nucleic acids database. The original purpose of creating the database is to provide rich data resources used for comparative analysis, and offer some help in determination of transcription control elements and prediction algorithm of eukaryote promoters [18]. EPD is an annotated non-redundant collection of eukaryotic polymerase Ⅱ(including multicellular animals and plants), for which the transcription startpoint has been determined by experiment [21, 22]. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. Currently, the number of promoters in EPD displays positive correlation with the development of in EMBL-Bank. EPD is a specialized annotation database of the EMBL Data Library. It provides information about eukaryotic promoters available in the EMBL Data Library and is intended to assist experimental researchers, as well as computer analysts, in the investigation of eukaryotic transcription signals [23]. The homepage of EPD is available at (Figure 6)

5 Fig. 6 The homepage of EPD G. RNAiDB RNAi (RNA interference) is defined as double-strand RNA suppresses gene expression at transcription and translation levels, controlling cellular advanced life activities [18]. Therefore, it is very significant in theory and practice for RNAi technology, related with the development of medical biological research. RNAiDB was first established in 2000 by Fabio Piano at Cornell University and Marco Mangone and Lincoln Stein at Cold Spring Harbor Laboratory in United States. RNAiDB is usually used for the archival, distribution and analysis of phenotypic data. The database contains a compendium of publicly available data and provides information on phenotypic results and experimental methods, including raw data in streaming time-lapse movies and the form of images. Together with graphical displays of RNAi to gene mappings, phenotypic summaries allow quick intuitive comparison of results derived from different RNAi assays and visualization of the gene product(s). RNAiDB can be searched by using combinatorial queries and the novel tool named PhenoBlast, which ranks genes on the basis of their overall phenotypic similarity. RNAiDB could serve as a model database for navigating and distributing in vivo functional information of gene in large-scale systematic phenotypic analyses for different organisms [24, 25]. RNAiDB can be entered by the website (Figure 7). Fig. 7 The homepage of RNAiDB

6 III. CONCLUSIONS In this present paper, seven popular and important nucleic acids databases were briefly introduced, including their foundation, available sequence resources, and how to submit and search nucleic acids. To be mentioned, every database is continuously updating their data, because of rapid development of biology. So, we should choose corresponding database to submit or search desired nucleic acids sequences according to our studies. ACKNOWLEDGMENTS This research was funded by Scientific Research Fund of Sichuan Provincial Education Department of China (13ZA0012). I am grateful to Dr. Xiaohong Liu (College of Life Science, China West Normal University, P. R. China) for his critical reading to my manuscript. REFERENCES [1] R. Dahm, Discovering DNA: Friedrich Miescher and the early years of nucleic acid research, Human genetics, vol. 122, no. 6, pp , [2] T. Kulikova, P. Aldebert, N. Althorpe, W. Baker, K. Bates, P. Browne, et al., The EMBL nucleotide sequence database, Nucleic Acids Research, vol. 32, no. Database Issue, pp. D27-D30, [3] T. Kulikova, R. Akhtar, P. Aldebert, N. Althorpe, M. Andersson, A. Baldwin, et al., EMBL nucleotide sequence database in 2006, Nucleic Acids Research, vol. 35, no. Database Issue, pp. D16-D [4] A. Danchin, Request from the international advisory committee to DDBJ/EMBL/GenBank, Journal of Virological Methods, vol. 135, no. 2, pp , [5] H. Sugawara, O. Ogasawara, K. Okubo, T. Gojobori, and Y. Tateno, DDBJ with new system and face, Nucleic Acids Research, vol. 36, no. Database Issue, pp. D22 D24, [6] D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and E. W. Sayers, GenBank, Nucleic Acids Research, Vol. 41, no. Database Issue, pp. D36-D42, [7] [8] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and D. L. Wheeler, GenBank, Nucleic Acids Research, Vol. 36, no. Database Issue, pp. D25-D30, [9] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and E. W. Sayers, GenBank, Nucleic Acids Research, Vol. 37, no. Database Issue, pp. D26-D31, [10] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and E. W. Sayers, GenBank, Nucleic Acids Research, Vol. 38, no. Database Issue, pp. D46-D51, [11] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and E. W. Sayers, GenBank, Nucleic Acids Research, Vol. 39, no. Database Issue, pp. D32-D37, [12] H. Sugawara, O. Ogasawara, K. Okubo, T. Gojobori, and Y. Tateno, DDBJ with new system and face, Nucleic Acids Research, Vol. 36, no. Database issue, pp. D22-D24, [13] E. Kaminuma, J. Mashima, Y. Kodama, T. Gojobori, O. Ogasawara, et al., DDBJ launches a new archive database with analytical tools for next-generation sequence data, Nucleic Acids Research, Vol. 38, no. Database Issue, pp. D33-D38, [14] E. Kaminuma, T. Kosuge, Y. Kodama, H. Aono, J. Mashima, T. Gojobori, et al., DDBJ progress report, Nucleic Acids Research, Vol. 39, no. Database Issue, pp. D22-D27, [15] S. Hideaki, Data release in human genome sequencing projects, Tanpakushitsu Kakusan Koso / Protein, Nucleic Acid and Enzyme, vol. 48, no. 13, pp , [16] G. Sapp, Drawing the map of life: inside the human genome project, Library Journal, vol. 136, no. 4, pp , [17] J. Witkowski, Drawing the map of life: inside the human genome project, Nature, vol. 466, no. 7309, pp , [18] H. Xiao, et al., Search for life science and medicinal information, 1st ed, Science Press, Beijing, 2007, pp [19] M. Sakharkar1, F. Passetti, J. E. de Souza, M. Long, and S. J. de Souza, ExInt: an Exon Intron Database, Nucleic Acids Research, vol. 30, no. 1, pp , [20] V. Shepelev, and A. Fedorov, Advances in the exon-intron database (EID), Briefings in Bioinformatics, vol. 7, no. 2, pp , [21] R. C. Perier, V. Praz, T. Junier, C. Bonnard, and P. Bucher, The eukaryotic promoter database (EPD), Nucleic Acids Research, vol. 28, no. 1, pp , [22] V. Praz, R. Perier, C. Bonnard, and P. Bucher, The eukaryotic promoter database, EPD: new entry types and links to gene expression data, Nucleic Acids Research, vol. 30, no. 1, pp , [23] C. D. Schmid, V, Praz, M, Delorenzi, R, Perier, and P. Bucher, The eukaryotic promoter database EPD: the impact of in silico primer extension, Nucleic Acids Research, vol. 32, no. Database Issue, pp. D82-D85, [24] V. Stribinskis, and K. S. Ramos, Decoding the Riddle: The Dawn of RNAi for the Study of Gene Gene and Gene Environment Interactions, Environmental Health Perspectives, vol. 112, no. 4, pp. A210-A211, [25] K. C. Gunsalus, W. C. Yueh, P. MacMenamin, and F. Piano, RNAiDB and PhenoBlast: Web tools for genome-wide phenotypic mapping projects. Nucleic Acids Research, vol. 32, no. Database Issue, pp. D406-D410,