Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE. Andrea M. Foerster, Jennifer Hetzl, Christoph Müllner, and Ortrun Mittelsten Scheid

Size: px
Start display at page:

Download "Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE. Andrea M. Foerster, Jennifer Hetzl, Christoph Müllner, and Ortrun Mittelsten Scheid"

Transcription

1 Chapter 2 Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE Andrea M. Foerster, Jennifer Hetzl, Christoph Müllner, and Ortrun Mittelsten Scheid Abstract Amplifying and sequencing DNA after bisulfite treatment of genomic DNA reveals the methylation state of cytosine residues at the highest resolution possible. However, a thorough analysis is required for statistical evaluation of methylation at all sites in each genomic region. Several software tools were developed to assist in quantitative evaluation of bisulfite sequencing data from complex methylation patterns occurring in plants. This chapter describes the application of Cytosine Methylation Analysis Tool for Everyone (CyMATE). From aligned sequences, CyMATE quantifies and illustrates general and pattern-specific methylation at CG, CHG, and CHH (H = A, C, or T) sites, both per sequence and per position. CyMATE is also able to perform a quality control of sequences and to detect redundancy among individual clones. The software is able to reveal methylation patterns on complementary strands by handling data from hairpin bisulfite sequencing. The tool is freely available for non-commercial use at Key words: DNA methylation, 5-methylcytosine (5mC), Bisulfite sequencing, Hairpin sequencing, Symmetric/asymmetric DNA methylation, Methylation context, CyMATE 1. Introduction DNA methylation is an important component of epigenetic gene regulation. It is a more complex process in plants than in other eukaryotes, since it can modify cytosines in every sequence context. Patterns of 5-methylcytosine (5mC) can be detected by sodium bisulfite treatment which leads to the conversion of nonmethylated cytosine to uracil, whereas 5mC remains unchanged. Following PCR amplification from bisulfite-treated genomic DNA, the converted positions appear as thymine in the amplified sequences. Individual clones are sequenced and compared to the Igor KovaIchuk and Franz Zemp (eds.), Plant Epigenetics: Methods and Protocols, Methods in Molecular Biology, vol. 631, DOI / _2, Springer Science + Business Media, LLC

2 14 Foerster et al. original, unmodified genomic template. Changes from C to T residues indicate cytosines that were not methylated, whereas remaining Cs specify methylated cytosines in the genomic template. Multiple clonal sequences obtained from the same biological sample are usually compared to get a statistical representation of DNA methylation at the genomic site under investigation. Manual evaluation is laborious and error-prone due to large data sets. Although there are several tools for computer-assisted evaluation of methylation patterns in mammalian DNA, their analysis is restricted to CG sites. Only a few software tools are available to analyse plantspecific DNA methylation patterns (1 3). This chapter focuses on the application of Cytosine Methylation Analysis Tool for Everyone (CyMATE), which is designed to distinguish and quantify different DNA methylation classes occurring in plants in the context of CGN, CHG, or CHH. CyMATE allows for analysis of cytosine methylation by quantifying total and class-specific methylation as well as patterns of individual templates or patterns at specific positions within the master sequence. A color- and shapecoded graphical output in pattern matrix view is supplied together with a detailed statistical analysis and histograms representing the quantitative evaluation (Fig. 1). CyMATE is freely available for non-commercial use at 2. Program Input CyMATE analyses cytosine methylation patterns based on bisulfite sequencing and evaluation of transitions from C to T nucleotides in sequences representing individual genomic templates. The protocol for the bench work is described in the chapter Analysis of DNA Methylation in Plants by Bisulfite Sequencing. The following paragraph describes how to prepare input data for analysis with CyMATE. 2.1.Organization of Input CyMATE is expected to read pre-aligned sequence data, i.e. multiple sequence alignment (MSA) files, either in sequential (standard FASTA), interleaved (standard CLUSTAL) or NEXUS format. Steps 1 3 describe how to create MSA from experimental raw data. 1. Define the reference sequence (the master genomic sequence without any conversion). We recommend to use genomic DNA data as a reference sequence, e.g. from NCBI s Nucleotide ( nuccore&itool=toolbar). The reference sequence should not encompass flanking PCR primers, as they do not represent genomic DNA that has undergone conversion.

3 Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE 15 Bisulfite conversion, PCR, cloning and sequencing Generation of multiple sequence alignment CyMATE Backend Output Class I (CGN): Pos M (81.58%) (71.05%) Fig. 1. A workflow in CyMATE analysis 2. Define the sample sequence(s) (clone(s)) and the master sequence with unambiguous file names to support later identification and sorting. The sequences of experimental clones should not be manually edited or clipped, since trimming regions outside of the master sequence and detection of sequencing errors are much easier after alignment. 3. Combine the data by aligning the master and clonal sequences with appropriate software, e.g. ClustalW ( Tools/clustalw2/index.html) or the desktop version ClustalX The region of interest can be manually selected using the option Save Sequences as... from the Edit Menu in ClustalX. This feature of ClustalX is also useful to remove any remaining primer/vector sequences from the alignment and to generate blunt ended MSAs as needed for the use of CyMATE (see Note 1 and Fig. 2). Save MSA with the master on top (see Note 2) in FASTA, CLUSTAL or NEXUS format. Do not save the file in any other format, e.g. binary DOC or DOCX format.

4 16 Foerster et al. Fig. 2. Generation of a multiple sequence alignment (MSA) Fig. 3. A CyMATE analysis web form for single-strand analysis 2.2. Submission for Evaluation The MSA files prepared in the previous section can now be submitted to CyMATE through the website org. The program allows an unlimited number of requests. 1. Open the website and in the Perform Analysis section, select start analysis. Select single strand for the standard analysis (Fig. 3); alternatively, choose double strand

5 Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE 17 (if sequences were generated by hairpin bisulfite sequencing) or other parameters as appropriate (see Note 3). 2. Complete the form field with your address and use the Choose file button to select and upload your MSA file. 3. Click the Analyse this! button. CyMATE will process your request (see Note 4), write the results into separate text and graphical files, and deliver them to your address. Within a short period of time (see Note 5), the files will be available for further evaluation. 3. Program Output To analyse data, open the entitled CyMATE Analysis Request Analysis Results which contains analysis results as attachments. A successful run of the program will produce up to four different files with the name of your MSA and extensions.pdf,.txt,.fasta and.afa. The PDF file contains graphic results of the bisulfite analysis with filled symbols for methylated and blank symbols for nonmethylated cytosines, with red circles for CG, blue squares for CHG, and green triangles for CHH sites (see Note 6). The plain text file includes complete methylation analysis of the uploaded data file, mostly in a tabular form (see Note 7). The FASTA file corresponds to an original input file, the AFA file to the converted MSA file. 4. Additional Features of CyMATE 4.1. Options for Single Strand Mode Most routine applications will require only the described simple and straightforward basic queries. For specific applications, however, a number of additional features can be selected through the CyMATE web interface. This section describes optional features of CyMATE at a glance. In the Single strand mode, pre-selection of only specific (any one, any two, or all three) methylation classes is possible. This option is available after selecting the analysis mode under Enter Parameters for the single strand analysis. For example, de-selecting the Class 2 and Class 3 checkboxes restricts the analysis to Class 1 (CG) only (Fig. 3). Another option is Mutation search. By selecting this option, the text output will be extended by an additional part entitled rvdiff, providing a detailed mismatch analysis for each individual sample sequence indicating every heterogeneous position apart from a C-to-T transition with reference to the master sequence.

6 18 Foerster et al Redundancy Check 4.3. Double Stranded Analysis 4.4. Analysis of Two Complementary Strands CyMATE offers useful features for analyzing sequences apart from differences in their methylation state. Selecting the analysis mode Redundancy (together with selecting the group-output option and excluding the master sequence) can be used to detect identical clones. In the case of methylated sequences, these clones indicate redundancy produced most likely by PCR rather than representing identical genomic templates and thereby reducing the significance of the results obtained. In the group-wise analysis of CyMATE, identical clones turn up first and appear together in a group. All but one member of this group should be removed from the data set. The analysis mode Mismatch operates similarly, revealing differences with respect to a master sequence. As for the single strand mode, the reference sequence must be on top of the MSA file. This feature can either be used independently or in addition to the single-strand data analysis mode by selecting Mutation search (see Subheading 4.1). If the feature is used independently, a detailed text file will be created, showing all mismatches in MSA. The mismatch analysis will include C-to-T conversions in each sequence for each position. While it is usually sufficient to analyze methylation patterns at one DNA strand (especially for symmetric methylation sites), sometimes it may be interesting to gain information about modification at the anti-parallel strand. The elegant method of hairpin bisulfite sequencing (4, 5), in which two strands are ligated prior to denaturation and bisulfite conversion via a linker with a unique sequence fingerprint, allows the analysis of complementary strands from the same genomic template (see Note 8). CyMATE can process sequence information obtained by double strand analysis. A module called CyMATEads has been implemented and described recently (6). It is available in the Perform analysis section under double strand data and the analysis mode Double strand. It requires entering the hairpin linker (HPL) sequence in the field Hair-Pin-Linker in the Enter Parameters for the double strand-analysis section. CyMATE will automatically discriminate between the top and the bottom strand. The HPL, single-stranded overhang regions, and regions of pairing between HPL and genomic complementary sequence will be excluded from the analysis. CyMATE will deliver detailed results in PDF and plain text format by . CyMATE can further handle sequences generated by different primer sets, which amplify specifically the top or bottom strand. These do not necessarily represent strands from the same genomic template but are complementary. The analysis mode entitled Two strand will analyse the forward and reverse strand and deliver a detailed analysis in PDF and TXT format, similar to the double strand mode described earlier.

7 Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE CyMATE Updates As other software, CyMATE may be developed further and adapted to new needs if necessary. Therefore, please also consult the actual information on the website (see Note 9). 5. Notes 1. Blunt-ended means that each sequence in the alignment has the same length. If required, leading or trailing gaps will be inserted at the start or the end of the sequence during the alignment procedure. 2. The master is expected to be the first sequence in the alignment. ClustalX and ClustalW offer the possibility to conserve the input order by checking the input option in the Alignment Output Format Options menu. There are no restrictions in the length of sequences and their total number following the master sequence. 3. A basic analysis can be done using default parameters. A detailed description of alternative settings is available, (see Subheading 4). 4. CyMATE operates in three major phases. (1) For input reading and error detection, CyMATE reads aligned data and identifies each object by its label and its sequence data. CyMATE also differentiates between the master and clone type of the sequences. Furthermore, CyMATE considers data objects either as single strand (default for most analyses), double strand (for hairpin-bisulfite data) or two strand (complementary single strand data). It performs a number of consistency checks, e.g. for the file format. (2) During data analysis, CyMATE first determines all cytosine residues in the master sequence as potential methylation sites with their location (position index) and sequence context (methylation class). Subsequently, each clone is analysed separately with reference to the master. All clone profiles of one MSA are used to create statistics, e.g., the average number of methylated CHG sites at a specific position or the relative number of methylated CG sites in a specific sample. Multiple error checks are performed simultaneously with the above described evaluation procedures. (3) For the production of output files, methylation profiles of individual clones are summarized and written into a text file, including frequency and specificity of methylation per site, per sequence, per methylation class, and globally. Individual profiles are also written into a graphics file to yield the colored matrix-like plot. 5. While it usually takes only a few seconds, the actual time depends on the internet connection and the number of other simultaneous CyMATE operations.

8 20 Foerster et al. Fig. 4. The CyMATE pattern matrix output 6. This output file (Fig. 4) can be opened using any image processing software and can be edited and inserted into other documents. Besides the matrix-like plot with shape- and colour-coded symbols, it shows a ruler on the top indicating the relative location and eventual clustering of cytosine residues in the sequence. At the bottom, numbers specify a position index of each cytosine residue within MSA. 7. The text output delivered by CyMATE is divided in three parts. (1) The first part refers to the master sequence and lists the sum of all possible methylation sites in absolute and relative numbers. In addition, cytosines are assigned to class I for CGN, class II for CHG and class III for CHH sequence context based on two nucleotides following cytosine in the sequence. For every class, methylation sites within the master sequence as well as pattern frequency within the master are indicated in absolute and relative values. (2) The second part of the text output contains information about the position of methylation. It specifies the occurrence of methylated (M) vs. nonmethylated (NM) cytosines at each potential methylation site, separately for each class and site in absolute and relative values. Furthermore, it states the average methylation degree per class and in total. The OK column provides an additional quality control. A value less than 100% at a specific position indicates e.g. a sequencing error in MSA. (3) The third part of the analysis report represents the examination of each individual cloned sequence, divided into relative values as a percentage and absolute values for methylated (M) and non-methylated (NM) sites. Relative values indicate the degree of methylation as a percentage for every single methylation class and in total (AVG). As described for the position-wise analysis, an OK column is included as a quality control. In an additional table, absolute and relative values indicate how many of all methylated residues of each individual clone are found in each methylation class. The plain text data output can be easily transferred into spreadsheets, e.g., Microsoft Excel, to generate histograms for an overview over the degree of methylation at a specific position (Fig. 5) or in total (Fig. 6).

9 Analysis of Bisulfite Sequencing Data from Plant DNA Using CyMATE 21 %mc Fig. 5. A histogram of position-based methylation analysis Position total CGN CHG CHH %mc Fig. 6. A histogram of global methylation analysis 0 8. Hairpin bisulfite PCR (4) is performed after cutting genomic DNA using a restriction enzyme (no cutting within the sequence to be analyzed) and ligating complementary strands with each other with a stem-loop structure hairpin linker. During bisulfite treatment of the ligated DNA, the doublestranded target is denatured and can be amplified by PCR with primers specific for the top and bottom strand. The PCR products contain both complementary strands in linear but inverted orientation from which the methylation status of the original double strand template can be deduced. A refinement of the technique (5) was achieved by inserting a degenerate sequence in the hairpin region, thereby distinguishing each genomic DNA template by its individual barcode tag and allowing redundancy reduction. 9. CyMATE is currently modified to produce additional graphical output for statistical data and additional numerical data (in CSV format) for import into spreadsheet programs like Excel. For more information, updates and feedback, a detailed user guide, several example files, contact details and a Frequently Asked Questions section are available on www. cymate.org.

10 22 Foerster et al. References 1. Hetzl J, Foerster AM, Raidl G, Mittelsten Scheid O (2007) CyMATE: a new tool for methylation analysis of plant genomic DNA after bisulphite sequencing. Plant J 51(3): Gruntman E, Qi Y, Slotkin RK, Roeder T, Martienssen RA, Sachidanandam R (2008) Kismeth: analyzer of plant methylation states through bisulphite sequencing. BMC Bioinformatics 9: Grunau C, Schattevoy R, Mache N, Rosenthal A (2000) MethTools a toolbox to visualize and analyze DNA methylation data. Nucleic Acids Res 28(5): Laird CD, Pleasant ND, Clark AD, Sneeden JL, Hassan KM, Manley NC et al (2004) Hairpin-bisulphite PCR: assessing epigenetic methylation patterns on complementary strands of individual DNA molecules. Proc Natl Acad Sci USA 101(1): Miner BE, Stoger RJ, Burden AF, Laird CD, Hansen RS (2004) Molecular barcodes detect redundancy and contamination in hairpin-bisulphite PCR. Nucleic Acids Res 32(17):e Muellner C, Hetzl J (2008) CyMATEads: Reliable analysis of cytosine methylation in plant and animal DNA using bisulphite sequence data. Schriftenreihe Informatik 26:43 52

11