The MethDB DAS Server

Size: px

Start display at page:

Download "The MethDB DAS Server"

Monica Sims
6 years ago
Views:

1 [Epigenetics 1:2, e1-e5, EPUB Ahead of Print: March/April 2006]; 2006 Landes Bioscience Research Paper The MethDB DAS Server Adding an Epigenetic Information Layer to the Human Genome Vincent Negre Christoph Grunau* Institut de Génétique Humaine; CNRS UPR 1142; Montpellier, France *Correspondence to: Christoph Grunau; Institut de Génétique Humaine; CNRS UPR 1142; 141 rue de la Cardonille; Montpellier; France; Tel.: ; Fax: ; Received 10/03/06; Accepted 04/05/06 This manuscript has been published online, prior to printing for Epigenetics, Volume 1, Issue 2. Definitive page numbers have not been assigned. The current citation is: Epigenetics 2006; 1(2): Once the issue is complete and page numbers have been assigned, the citation will change accordingly. KEY WORDS biological database, distributed annotation system, DNA methylation, human genome ABBREVIATIONS MethDB DAS LDAS CGI ACKNOWLEDGEMENTS DNA methylation database distributed annotation system lightweight DAS server CpG island This work was supported by a grant of the BioSTIC Languedoc-Roussilon. We are grateful for technical support from the bioinformatics group of the Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier (LIRMM). ABSTRACT The DNA methylation database MethDB ( was developed in order to standardize and collect the dispersed data about this epigenetic phenomenon in a common resource. In the first version of MethDB, data was gathered by annotators and the database could only be queried. In a second step, we added an on-line data submission system that is open to the public. Here we present the DAS annotation server of MethDB that allows integration of MethDB into the network of biological databases via the Distributed Annotation System (DAS) and the representation of DNA methylation data as an epigenetic information layer to the human genome. In order to validate our system and to incorporate the data of the first large scale methylation analysis of the human genome, we assembled the sequences of the human CpG island tagging project into CpG islands and imported them into MethDB. The database contains now methylation content data and 5382 methylation patterns or profiles for 48 species, 1511 individuals, 198 tissues and cell lines and 79 phenotypes. INTRODUCTION Methylation of cytosine residues is a covalent modification of genomic DNA that adds epigenetic information to the primary DNA sequence. The goal of the DNA methylation database MethDB ( is to collect, standardize and annotate the available DNA methylation data and to make them available. The database is the major source for experimentally confirmed DNA methylation data. These data are regularly consulted via a dedicated web-server ( Recently, MethDB provided the entire training data set for the development of an algorithm to predict methylated sites in the human genome. 1 Here we describe the establishment of a new service that allows the integration of epigenetic data stored in MethDB into the network of biological databases via the Distributed Annotation System (DAS). 2 METHODS Annotation files for the MethDB LDAS were generated with el-dasionator ( atgc.lirmm.fr/cgi-bin/ldas/form.pl). 3 CpG island sequences were downloaded from as FASTA file and assembled into contigs with the Staden package. 4 Details are shown in Table 1. RESULTS Establishment of a distributed annotation system annotation server for MethDB. Methylation can be thought of as an additional information layer on the DNA. The idea of information layers is also applied in the Distributed Annotation System where annotations are anchored to a reference sequence (e.g. the human genome) and superposed in appropriate browsers as layers of annotations. Data exchange between reference server and annotation server follows standardized protocols, and consequently, several independent annotation servers can be connected to a reference sequence. We established an annotation server for MethDB using the Lightweight DAS (LDAS) server package ( servers/). Human DNA sequences for which methylation data are available in MethDB were aligned to the Ensembl reference sequence, and LDAS compatible annotation files were generated with el-dasionator. The MethDB DAS server is updated in regular intervals and can be accessed at using the DAS protocol. The Ensembl ContigView ( 5 is a popular genome browser that uses the DAS e1 Epigenetics 2006; Vol. 1 Issue 2

2 protocol. It provides a reference server and allows for the attachment of external DAS sources. A detailed description of how the MethDB DAS server can be attached to the Ensembl ContigView is available on the MethDB home page. For a given computer and a given web browser this attachment procedure has to be done only once. Since several JavaScripts have been included in the Ensembl browser recently, it appears that for compatibility reasons Firefox (www. mozilla.com) is the web-browser of choice. A list of web browsers that have been successfully tested is available on our instruction page ( After attachment, regions for which methylation data are available will be represented in the Ensembl ContigView as colored rectangles superimposed with the corresponding genomic sequence. These rectangles are hyperlinked to MethDB; clicking on them displays basic information and provides a direct access to the corresponding methylation data in MethDB. For a particular human locus, MethDB can now be queried by two alternative approaches: either directly via the generic query form of MethDB or via the Ensembl browser (or any other DAS compatible system). To our knowledge, MethDB is the only DNA methylation database that can be directly integrated into Ensembl. Integration of human CpG island sequences into MethDB. In mammalia, methylation is predominantly found in cytosine residues followed by a guanine residue (CpG pair). CpG pairs are underrepresented in the genome except for CpG islands (CGI) where they occur in statistically expected frequency. CpG islands in the 5' region of genes can be free of methylation while the rest of the genome is methylated. In an experimental approach based on this Table 1 Assembly of sequences from the CpG island tagging project treatment number of sequences after treatment download First filter ( 100 bp, 5% unknown bases) pre-gap with automatic vector clipping Gap4 alignment with 20 nucleotides initial contigs assembled match, 25 maximum pads per read, average length 163 bp 5% maximum mismatch (2265 without CpGs) Final filter ( 50 bp, 5% unknown bases) (1618 without CpGs) characteristic hypomethylation, Cross and colleagues had generated a library of human CpG islands using an affinity column and dividing the genome into methylated and unmethylated parts (6). Later, this library was sequenced by the CpG island tagging project, and sequences were made available for download at HGP/cgi.shtml as a single file of concatenated unordered FASTA sequences. In this form, the sequences could not be queried, and no information about the location on the genome was present. In addition, the sequence files still contained vector contaminations. In order to make the information accessible that these particular DNA sequences are hypomethylated, and to show that the MethDB DAS server is capable of handling large data sets, we downloaded the sequences, filtered them and assembled them into Figure 1. Representation of independent cross-confirmation of experimental results. Ensembl ContigView representation of a region of chromosome 17 ( In the upper lane of the annotations for the reverse strand the exon-intron structure ( Ensembl trans. ). The following lane ( CpG islands ) shows predicted CpG islands. The annotation layer CPG island clones is empty. The following layer contains DNA methylation annotations in MethDB and is labeled MethDB. Clicking on each colored rectangle provides additional information (not shown). Annotations in the MethDB layer are hyper-linked to MethDB. The predicted CGI and the independent methylation data in MethDB point toward the same region around Mb. Epigenetics e2

Figure 2: Superposition of independent annotation layers allows for the reconstruction of CGI. ContigView of a region of chromosome 11 around the CALCA gene (www.ensembl.org/homo_sapiens/contigview?

These 13786 sequences representing experimentally confirmed hypomethylated areas in the genome were imported into MethDB and an arbitrary methylation score of 0 was assigned to them (0 = no

3 Figure 2: Superposition of independent annotation layers allows for the reconstruction of CGI. ContigView of a region of chromosome 11 around the CALCA gene ( The green rectangle corresponds to the CGI assembled from annotation layers MethDB and CPG island clones. contigs. These sequences representing experimentally confirmed hypomethylated areas in the genome were imported into MethDB and an arbitrary methylation score of 0 was assigned to them (0 = no methylation, 1 = maximum methylation) % (1618 sequences) do not contain CpG pairs and must probably be considered as false positives. Data upload to MethDB was accomplished through a simple script and further processing to the DAS server was done as outlined above. DISCUSSION Using a large CGI sequence data set we showed that MethDB is capable of integrating several thousand sequence-anchored methylation data and representing them as annotation layer via DAS. Because it is still not entirely clear what defines the genomic regions that become unmethylated or methylated during development, we believe that the superposition of different information like sites of transcription, exon positions and epigenetic data will help to put forward experimentally provable hypothesis. We will give in the following examples of how MethDB data can be used in combination with other information layers to cross-confirm methylation data from different sources, identify the borders of CGI and visualize the methylation within, reconstruct methylation profiles along a gene, rapidly identify problems in the experimental set-up, identify new CGI, and better plan experiments based on readily available data. Example 1: CGI of BRCA1 Cross-confirmation of experimental data. Figure 1 shows the superposition of three information sources representing the exon-intron structure of BRCA1, the location of a predicted CGI, and experimental methylation data in MethDB. The location of the transcription start, and G + C content and CpG density suggests a CGI around Mb (Fig. 1). The MethDB annotation CpG_island:10521 confirms the existence of an experimentally confirmed hypomethylated area in this region. MethDB annotations 5mC:101, 102 and 106 represent independent experiments Figure 3. Methylation data along the GSTP1 gene ( &h=ensg ). Clicking on the MethDB annotation links leads to detailed methylation data for each sequence segment represented by blue rectangles. The information can be used to reconstitute methylation profiles using different data sources. e3 Epigenetics 2006; Vol. 1 Issue 2

Figure 4. Direct visualization of strand specific data. Ensembl ContigView of a region around the GLA gene (www.ensembl.org/homo_sapiens/ contigview?

4 Figure 4. Direct visualization of strand specific data. Ensembl ContigView of a region around the GLA gene ( contigview?region=x&vc_start= &vc_end= &h=otthumg ). Two annotations are available in MethDB confirming the presence of a CGI. Annotation 5mC:6 points towards experimental data that were obtained analyzing the GLA locus, but the reverse strand was actually investigated. that analyzed the methylation state of this region. These are literature data linked to the corresponding publication for further details. All independent data sources point toward the same area resulting in mutual confirmation. Example 2: CALCA Reconstruction of a CGI. CGI can be predicted with bioinformatics tools. However, the sensitivity and specificity of such a prediction depends naturally on the chosen parameters, and as any prediction needs experimental confirmation. A more appropriate approach would be to identify CGI based on their hypomethylation state. In large-scale CGI cloning approaches, large CGI will for technical reasons not be covered entirely but must be reconstructed. Superposition of different data sources will facilitate this task. An example is shown in Figure 2. Three different annotation layers are shown: intron-exon structure of CALCA, CPG island clones of an independent CGI cloning project and annotations of MethDB. No CGI is predicted in this region by means of bioinformatics. Merging the data from the independent CGI cloning projects shows that the CGI actually spans from approximately position 14,949,850 to 14,953,050. For a subregion, further methylation data are available in MethDB (5mC:71). None of the individual project alone would have delivered this data but combination of data results in a natural reconstruction of the most probable CGI. Example 3: GSTP1 Generation of methylation profiles along a region. Technical constraints limit the length of sequence stretches that can analyzed for site-specific methylation. In general, overlapping or neighboring sequence fragments are analyzed. The reconstruction of methylation profiles can be relatively fastidious. The representation as annotation layer makes this task straightforward. An example is shown in Figure 3. Analysis of MethDB annotation 5mC:37-56 shows that the hypomethylated area spans actually from position 67,107,300 to 67,108,200. Example 4: Problematic data. Misassignment of methylation patterns to specific genes or genomic locations are surprisingly frequent in the literature. Since MethDB is a curated database, most of these incorrect results have be eliminated during data processing. However, incorrectness cannot be entirely excluded. Representation as information layer facilitates the identification of problematic data. The MethDB annotation 5mC:6 in Figure 4 refers actually to a methylation analysis of the GLA gene on the reverse strand. Here, the wrong strand was investigated. Since DNA methylation is probably symmetric, this does not invalidate the conclusions drawn in the original publication. However, annotation CpG_island:597 (Fig. 4) confirms the existence of a CGI in this area that is shared between GLA and HNRPH2, and critical re-evaluation of the original findings might be useful. Example 5: Identification of new CGIs. The superposition of different annotation layers allows for the identification of new CGI with potentially interesting methylation profiles. An example is shown in figure 5. In a region upstream of p16-ink4a (CDKN2A), both the CPG island clones and MethDB contain several hints for previously unidentified CGI. We decided to further analyze this region and extracted a FASTA sequence of chromosome 9 from position 21,954,000 to 21,960,000. This procedure can easily be done using the corresponding feature of the Ensembl ContigView. We next used MethPrimer ( for the prediction of CGI. This combination of experimental data and subsequent detailed bioinformatics analysis allowed the identification of three new CGI. The most 5' CGI is probably associated with a gene for a mrna with unknown function (EMBL ac.nr. AK128836) that was isolated from a testes library (Fig. 5, annotation layer Ensembl mrna), the two 3' CGI are located within CDKN2A and might have a function in the regulation of this gene. Investigation of their methylation status appears to be worthwhile. Example 6: MethDB allows better experiment planning. Predicted CGI often span several hundred basepairs, and a detailed analysis of the entire region is costly and labor-intensive. Additional experimental information would be welcome to narrow down unmethylated candidate regions. We have recently investigated the methylation status of the promotor region of the human gene for cyclin D1 (CCND1) ( start= &vc_end= ). The predicted CGI had a length of more than 3 kb and the analysis of the full sequence was not feasible. MethDB data from the CGI mapping project indicated two hypomethylated subregions within the predicted CGI (CpG_island: Epigenetics e4

5 Figure 5: Identification of new CpG islands. Upper lane Ensembl ContigView of a region downstream of the CDKN2A gene with experimental data in annotation layers CPG island clones and MethDB that suggest the existence of three previously unidentified CGI represented by green rectangles ( In the lower lane representation of the bioinformatics analysis of the same region using MethPrimer. Both experimental data and the prediction software point towards the same regions and CpG_island: ). Further examination of the sequence with bioinformatics tools and reporter assays allowed to identify the actual promotor region adjacent to CpG_island:13960, and the DNA methylation status was studied in a sequence fragment overlapping this annotated region. A detailed description of the experiments has been published, 7 and data will be available in MethDB with the next regular update. From a technical point of view it is noteworthy that now probably the sequences of most CpG islands are available in MethDB, and further methylation data that relate in general to CpG islands (e.g. from CGI micro-arrays) can be anchored to these sequences. External data can also be linked to MethDB using the sequence ID (example: The MethDB DAS server is hosted by the Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) on a recently acquired mainframe resulting in excellent performance of this server. The speed of information display in the Enseml ContigView depends on the number of information layers the used chooses to view. The user should be aware that each time the display is refreshed, an enormous amount of data is requested via the network, and should restrict the displayed features correspondingly to shorten reply times. The increased amount of data in MethDB had also led to increased access times using the conventional web-form, and we decided to update the server hardware. Even complex searches take now less than 6 seconds. Large-scale DNA methylation mapping projects will in general develop their proper databases and often provide some way to represent their date in relation to the genomic sequence. However, mediumsize and single-locus studies will generally not make their data available in a database-compatible format. This is regrettable, since these studies usually deliver high-quality data that must be unearthed by conventional bibliographic searches. The availability of a DAS server for MethDB allows now integrating such data into an epigenetic information layer for the human genome, to use them concomitantly with other data sources, and to profit from mutual confirmation or the highlighting of conflicting results. References 1. Bhasin M, Zhang H, Reinherz EL, Reche PA. Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett 2005; 579: Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The distributed annotation system. BMC Bioinformatics 2001; 2:7 3. Negre V, Grunau C. el-dasionator: an LDAS upload file generator. BMC Bioinformatics 2004; 5:55 4. Staden R, Beal KF, Bonfield JK. The Staden package, Methods Mol Biol 2000; 132: Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Birney E. Ensembl Nucleic Acids Res 2005; 33 Database Issue:D Cross SH, Charlton JA, Nan X, Bird AP. Purification of CpG islands using a methylated DNA binding column. Nat Genet 1994; 6: Krieger S, Grunau C, Sabbah M, Sola B. Cyclin D1 gene activation in human myeloma cells is independent of DNA hypomethylation or histone hyperacetylation. Exp Hematol 2005; 33: e5 Epigenetics 2006; Vol. 1 Issue 2

Browsing Genomes with Ensembl

April Feb 2006 2007 Browsing Genomes with Ensembl Joint project Ensembl - Project EMBL European Bioinformatics Institute (EBI) Wellcome Trust Sanger Institute Produce accurate, automatic genome annotation