Detection of allele-specific methylation through a generalized heterogeneous epigenome model

Similar documents
Estimating Fish Abundance - Mark Recapture Method

Annual Review APPLICATIONS AND STATISTICAL PROPERTIES OF MINIMUM SIGNIFICANT DIFFERENCE-BASED CRITERION TESTING IN A TOXICITY TESTING PROGRAM

Production Planning under Hierarchical Workforce Environment

Reporting Checklist for Nature Neuroscience

An Evaluation of Negative Selection in an Artificial Immune System for Network Intrusion Detection

Oregon and Washington). The second major study used for this report is "Parks and Recreation Information

Phd Program in Transportation. Transport Demand Modeling

RETRACTED ARTICLE. The Fuzzy Mathematical Evaluation of New Energy Power Generation Performance. Open Access. Baoling Fang * , p 2

THE VALUE OF GRID-SUPPORT PHOTOVOLTAICS IN PROVIDING DISTRIBUTION SYSTEM VOLTAGE SUPPORT 2. OBJECTIVE

Numerical Simulation of Gas Tungsten Arc Welding in Different Gaseous Atmospheres. Tashiro, Shinichi; Tanaka, Manabu; Ushio, Masao

Optimization of the Brass Melting

MULTI-OBJECTIVE OPTIMIZATION OF PLANNING ELECTRICITY GENERATION AND CO 2 MITIGATION STRATEGIES INCLUDING ECONOMIC AND FINANCIAL RISK

The 23 rd. Corresponding author: Tel ,

Fig. 1; Study Area, Source (PPC), Meteorological Stations (Souda and Airport) and Modeling Domain.

Theoretical Investigation on Condensing Characteristics of Air and Oil Vapor Mixtures

AN IDEA BASED ON HONEY BEE SWARM FOR NUMERICAL OPTIMIZATION (TECHNICAL REPORT-TR06, OCTOBER, 2005) Dervis KARABOGA

Fuzzy evaluation to parkour social value research based on AHP improved model

Quick Reference: Amplifier Equations

UNIVERSITY OF CINCINNATI

Concentric Induction Heating for Dismantlable Adhesion Method

Available online at ScienceDirect. Procedia Engineering 122 (2015 )

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Solar in Wetlands. Photo credit: a k e.org/blog/2012/08/15mw solar field near philadelphia.html

THE EFFECT OF SHEAR STRENGTH NORMALISATION ON THE RESPONSE OF PILES IN LATERALLY SPREADING SOILS

An Energy-Economy Model to Evaluate the Future Energy Demand-Supply System in Indonesia

machine design, Vol.6(2014) No.3, ISSN pp

ACCEPTED VERSION. Published version available via DOI:

Blind Multi-Channel Estimation of Arterial Input Function in Dynamic Contrast-Enhanced MRI

SCHEDULING FOR YARD CRANES BASED ON TWO-STAGE HYBRID DYNAMIC PROGRAMMING

Numerical Analysis of Current Attachment at Thermionic Cathode for Gas Tungsten Arc at Atmospheric Pressure

Inventory Strategy of Dual-Channel Supply Chain from Manufacturer's Perspective

4 STRUCTURAL MODELLING

A Note on Void Ratio of Fibre-Reinforced Soils

Production Policies of Perishable Product and Raw Materials

Enterprise Systems and Revenue Recognition: The Missing Link FINANCIAL EXECUTIVE BENCHMARKING SURVEY. Enterprise Systems Edition

Managing Accounting Information Quality: An Australian Study

The Research of Risk Management in Two Non-Independent IT System

FDSS Training Manual

Aquatic Vegetation of Wilkins Lake (DOW ) Aitkin County, Minnesota

Common up Regulated and down regulated Genes for Multiple Cancers using Microarray Gene Expression Analysis

Re-Designing a Customer Satisfaction and Loyalty Program via Linkage

ScienceDirect. Measuring solid liquid interfacial energy by grain boundary groove profile method (GBG)

CASE-BASED REASONING MODEL OF THE FISH DISEASE DIAGNOSIS

Gene Targeting: Altering the Genome in Mice Mario Capecchi

Integrating conflict analysis in drought risk management: Some hints from the Mediterranean area

ASSESSMENT OF MATERIAL BEHAVIOUR AND STRUCTURAL INTEGRITY OF ENGINEERING STRUCTURES BASED ON R6 PROCEDURE

A biomechanical model for the study of plant morphogenesis: Coleocheate orbicularis, a 2D study species.

Quantifying the First-Flush Phenomenon: Effects of First-Flush on Water Yield and Quality

Transcriptome-based distance measures for grouping of germplasm and prediction of hybrid performance in maize

The effect of hitch-hiking on genes linked to a balanced polymorphism in a subdivided population

Surface Water Hydrology

Predicting future changes in climate and evaporation by a stepwise regression method

FUZZY LOGIC BASED UAV ALLOCATION AND COORDINATION

Analysis of Preschool Linguistic Education Based on Orienting Problem Algorithm

SANITARY ENGINEERING ASSISTANT, 7866 SANITARY ENGINEERING ASSOCIATE, 7870 SANITARY ENGINEER, 7872

Distributed autonomous systems: resource management, planning, and control algorithms

CYCLIC TRIAXIAL TESTING OF FULLY AND PARTIALLY SATURATED SOIL AT SILCHAR

Out-of-Merit-Order Dispatch

Investigation of a Dual-Bed Autothermal Reforming of Methane for Hydrogen Production

Emad Alhseinat and Roya Sheikholeslami

Resource manager for an autonomous coordinated team of UAVs

Evaluating the Effectiveness of a Balanced Scorecard System Implemented in a Functional Organization

SURVEY OF MOVING CAR OBSERVER AND MKJI 1997 APPLICATIONS AGAINST THE TRAFFIC FLOW SECTION IN MERAUKE REGENCY

Intrinsic Viscosity Measurement for Optimal Therapeutic Formulation

Proposal of a Measuring Method of Customer s Attention and Satisfaction on Services

Design of Microarray Experiments for Genetical Genomics Studies

POSITIVE METALLURGICAL RESULTS FOR THE TAMPIA GOLD PROJECT

Business Law Curriculum Coordination at Berkeley Law (Boalt Hall) For last several (~10) years

SIMILARITY SOLUTION ON UNSTEADY AXI-SYMMETRIC VISCOUS BOUNDARY LAYER FLOW

GenomeLab GeXP. Troubleshooting Guide. A53995AC December 2009

Analysis of the Internal Pressure in Tube Hydroforming and Its Experimental Investigation

Wage Differentials, Fairness and Social Comparison: An experimental study of the Co-Employment of Permanent and Temporary Agency Workers

1. The Experiments of Gregor Mendel

( τα ) = Product of transmittance and absorptance

Interindustry Wage Differentials, Technology. Adoption, and Job Polarization

Developing the PAGE2002 model with Endogenous Technical Change

INTERNAL FRICTION AND MAGNETIC LOSSES IN CoPt ALLOY DURING PHASE TRANSITION

Market Potential and Distance Decay

GEO-SLOPE International Ltd, Calgary, Alberta, Canada Salt Flow Example

Water Management of Heat Pump System for Hot Water Supply in a Medium Size Hospital

Distribution and Risk Assessment of Soil Heavy Metals in the Main Grain Producing Area of China

A Shaking Table Test Design on Seismic Strengthened Brick Masonry Structures

The impact of velocity on thermal energy storage performance of tube type thermocline tank

Interactive Exploration of Fuzzy Clusters using Neighborgrams

ADVANCEMENT OF BRIDGE HEALTH MONITORING BASED ON DISTRIBUTED FIBER OPTIC SENSORS

International Journal of Solids and Structures

Adjoint Modeling to Quantify Stream Flow Changes Due to Aquifer Pumping

Citation 京都大学経済学部 Working Paper (1993), 21.

Stacking Spheres. The dimensions of a tetrahedron. The basic structure of a stack of spheres is a tetrahedron.

GeXP Chemistry Protocol

Progress towards Modeling Red Tides and Algal Blooms

QUANTITATIVE RESEARCH REGARDING PERFORMANCE MEASURES FOR INTERMODAL FREIGHT TRANSPORTATION

GHG Emissions Reduction by Improving Efficiency of Utilities Transport and Use and Cross-Sectorial Energy Integration

TRAINING NEEDS ANALYSIS and NATIONAL TRAINING STRATEGIES

Effect of Water and Nitrogen Stresses on Correlation among Winter Wheat Organs

Energy-Output Linkages in Australia: Implications for Emissions Reduction Policies

The impact of technology innovation on green chemistry and lowcarbon

Two different strategies for baker s yeast fermentation process simulation

Chapter 5 Holes and notches Contents

Transcription:

BIOINFORMATICS Vol. 28 ISMB 2012, pages i163 i171 doi:10.1093/bioinfomatics/bts231 Detection of allele-specific methylation though a genealized heteogeneous epigenome model Qian Peng 1,2, and Joseph R. Ecke 2,3,4 1 Depatment of Compute Science and Engineeing, Univesity of Califonia-San Diego, 9500 Gilman Dive, La Jolla, CA 92093 and 2 Plant Biology Laboatoy, The Salk Institute fo Biological Studies, La Jolla, CA, 92037, USA. 3 Genomic Analysis Laboatoy, The Salk Institute fo Biological Studies, La Jolla, CA, 92037, USA. 4 Howad Hughes Medical Institute, The Salk Institute fo Biological Studies, 10010 Noth Toey Pines Road, La Jolla, Califonia 92037, USA ABSTRACT Motivations: High-thoughput sequencing has made it possible to sequence DNA methylation of a whole genome at the singlebase esolution. A sample, howeve, may contain a numbe of distinct methylation pattens. Fo instance, cells of diffeent types and in diffeent developmental stages may have diffeent methylation pattens. Alleles may be diffeentially methylated, which may patially explain that the lage potions of epigenomes fom single cell types ae patially methylated, and may have majo effects on tansciptional output. Appoaches elying on DNA sequence polymophism to identify individual pattens fom a mixtue of heteogeneous epigenomes ae insufficient as methylcytosines occu at a much highe density than SNPs. Results: We have developed a mixtue model-based appoach fo esolving distinct epigenomes fom a heteogeneous sample. In paticula, the model is applied to the detection of allele-specific methylation (ASM). The methods ae tested on a synthetic methylome and applied to an Aabidopsis single oot cell methylome. Contact: qpeng@cs.ucsd.edu 1 INTRODUCTION The advancement of high-thoughput sequencing has opened up many impotant aeas of applications, one of which is epigenome sequencing. DNA methylation may epess o activate tansciption, and is known to be involved in embyogenesis, genomic impinting and tumoigenesis in mammals, and tansposon silencing in plants (Besto, 2000; Li et al., 1992; Lippman et al., 2004; Rhee et al., 2002; Zhang et al., 2006; Zilbeman et al., 2007). To undestand the egulation and dynamics of DNA methylation, the locations of the modified cytosines need to be identified. The fist single-base esolution mappings of DNA methylation wee poduced fo the whole Aabidopsis thaliana genome (Cokus et al., 2008; Liste et al., 2008) and fo selected subsets of sites in the mouse genome (Meissne et al., 2008) using vaious bisulfite sequencing technologies. DNA methylation is the modification of DNA base cytosine (methylcytosine). A map of DNA methylation at single-base esolution is efeed to as methylome o epigenome. Cells of diffeent types and in diffeent developmental stages may have diffeent methylation pattens. It has been obseved that lage potions of the Aabidopsis methylomes ae patially methylated (Liste et al., 2008). This may be as a esult of the To whom coespondence should be addessed. sample containing a numbe of distinct methylomes. Abeant methylation is also a geneal featue of cance genomes. A bette undestanding of methylation pattens in cance genomes may lead to both new diagnostic makes and theapies based on the detection of methylation changes occuing ealy in tumoigenesis (Laid, 2003). In a tumo tissue paticulaly of an ealy stage, howeve, canceous cells and nomal cells ae often mixed togethe. DNA methylation pattens can also act as makes fo tacing stem cell expansion and tumo gowth (Kim et al., 2005; Shibata and Tavaé, 2006; Yatabe et al., 2001). Making use of methylation pattens in this way equies detemining methylation pattens associated with individual cells o cells fom the same clone. When compaing human fiboblast cell IMR90 and H1 embyonic stem cell (ESC) lines, it is obseved that IMR90 has a lowe level of methylation than H1 ESC (Liste et al., 2009). Both IMR90 and H1 ae of a single cell type. While it is expected that lage potions (80%) of the X chomosome ae patially methylated as the IMR90 cell line is fom a female and the DNA methylation is known to play an impotant ole in X chomosome inactivation (Riggs, 1975), it is unexpectedly obseved that aound 38% of IMR90 autosomes ae identified as patially methylated domains (PMD). What is the natue of patial methylations in a single cell type? Might allelic diffeences contibute to the patial methylations? Answeing these questions equies detecting allele-specific methylation (ASM) pattens. The methylcytosine is sometimes efeed to as the fifth DNA base (Liste and Ecke, 2009). Applying methods fo detection of single nucleotide polymophism (SNP) to methylation, howeve, may pesent difficulties. Methylcytosine is much moe dynamic than nucleotides and obsevations geneally suggest that methylation of a cytosine site is a statistical event. Unlike SNPs whee a nucleotide occus at a ate of 0, 50 o 100% in a diploid individual (if sequencing eos may be ignoed), the methylation level at a paticula site may fall anywhee in the ange fom 0% to 100%. Whethe a methylcytosine is allele-specific theefoe cannot be detemined by the site alone. It needs suppoting evidence fom the neighboing nucleotides. If a patially methylated cytosine is in the close vicinity of a SNP such that eads ae long enough to cove both sites, then it is staightfowad to detemine fom which allele the eads ae oiginated thus detemining the methylation level of the espective alleles. Some studies have shown that ASM is associated with SNPs (Kekel et al., 2008; Shoemake et al., 2010). The density of methylation acoss the whole genome, howeve, is much highe than that of DNA sequence polymophism. Fo instance, while thee ae close to 3 million SNPs discoveed in the human genome 62 million and 45 million methylcytosines wee detected in H1 and The Autho(s) 2012. Published by Oxfod Univesity Pess. This is an Open Access aticle distibuted unde the tems of the Ceative Commons Attibution Non-Commecial License (http://ceativecommons.og/licenses/ by-nc/3.0), which pemits unesticted non-commecial use, distibution, and epoduction in any medium, povided the oiginal wok is popely cited.

Q.Peng and J.R.Ecke IMR90 cells (Liste et al., 2009). It has also been obseved that changes in cytosine methylation occu at a fequency much geate than that of the DNA sequence mutations (Ossowski et al., 2010; Schmitz et al., 2011). As a esult, SNPs ae absent in lage potions of the methylomes. We will descibe a method in this aticle that detects ASM without the assistance of SNPs. In addition, even though the functionalities of ASM ae not well undestood except that they play an impotant ole in impinting (Hellman and Chess, 2007; Kekel et al., 2008), it seems that the methylation level of an individual cytosine is less impotant than the oveall levels of methylations within a egion, which is also in contast to the SNPs. We, theefoe, focus ou effot in detecting egions of ASM. The emainde of the aticle is oganized as follows. A mixtue model is descibed in Section 2 fo modeling the outcome of a methylation sequencing expeiment whee the sample may contain a mixtue of heteogeneous epigenomes. It aims at pedicting methylation levels fo each cytosine in each individual epigenome. Section 3 lays out the details fo detecting egions of ASM based on the mixtue model and validates the methods on a synthetic methylome. The methods ae then applied to an Aabidopsis oot cell methylome and the esults ae listed in Section 4. Section 5 discusses the esults and offes some futue diections. 2 A MIXTURE MODEL FOR HETEROGENEOUS EPIGENOMES In bisulfite sequencing expeiments, DNA fagments ae teated with sodium bisulfite. The pocess convets unmethylated cytosines into uacils. The sequence of nucleotides (eads) in the conveted fagments ae subsequently detemined by a sequence. The eads poduced by the sequence ae aligned to a efeence genome. Usually only uniquely mapped eads ae etained. As a esult, what we have is a set of eads that ae most simila in sequence to thei espective mapped locations in the efeence genome, which ae pesumably the genomic oigins of the fagments that poduced the eads. In addition, each cytosine on evey mapped ead is labeled as eithe methylated o unmethylated. The methylation level of a paticula cytosine is computed as follows: if thee ae x eads that map to the position, and y out of the x eads have at this position a methylcytosine, then the methylation level is y/x. Note that if the ead depth at a cytosine position is below a cetain theshold, which is detemined by the allowed false positive ate, the methylation is not called, i.e. y =0. (Liste et al., 2008) If the oiginal sample is composed of a mixtue of epigenomes, be it fom a set of diffeent cell types, tissues o alleles, the mapped eads will eflect the mixtue. Ou goal is to infe the oiginal makeup of the mixtue fom the mapped eads. It should be noted that the attainment of the goal depends on whethe the oiginal epigenomes ae sufficiently heteogeneous so that we may distinguish them. As we ae only concened with methylation, we estict the eads to the genomic positions whee the methylation level is geate than zeo. The epigenomes and eads may be epesented as binay stings, whee methylcytosine is set to 1 and the emainde to 0. Let R be a set of binay stings, which we assume ae the eads poduced by a bisulfite sequencing expeiment futhe esticted to methylation sites. Fo sting R, let x i be the lette appeaing at position i fom ; let [ a, b ] be the positions that spans. Let C ={c j j =1 k} be the set of natual fequencies of epigenomes, whee c j is the fequency of the j-th epigenome, and k is the total numbe of epigenomes. When the model is used to detect the ASM of a diploid oganism, k equals to 2. Let M ={m ij i =1 n,j =1 k}, whee m ij is the pobability of methylation of epigenome j at position i, and n is the length of the epigenome. The pobability of obseving sting is k P()= c j p j, whee p j is the pobability that sting oiginates fom epigenome j, and b ( p j = mij x i +(1 m ij )(1 x i ) ). (1) i= a The pobability of obseving the set R is theefoe P(M,C,R)= RP(), o equivalently the log likelihood l(m,c,r)= RlogP(). The optimization goal is to detemine paametes C and M such that the pobability of obseving the set R is maximum, thus best explaining the eads. We estimate aay C and matix M by maximizing the likelihood l, agmax M,C logp(), R which can be solved by using the expectation-maximization (EM) algoithm (Dempste et al., 1977). Fist, we define a membeship matix A={a j R,j =1...k}, whee { 1 if j, a j = 0 if / j. As the membeship of sting with espect to epigenome j is unknown, it is estimated by its expected value as a j = c jp j j c j p. (2) j The likelihood can then be ewitten as l(m,c,r,a)= k a j log(c j p j ), (3) and the optimization becomes k agmax M,C a j log(c j p j ). Solving the maximization constained by k c j =1 yields the update equations at each M-step iteation of the EM algoithm as follows (see Appendix A fo the detailed deivation), c j = a j j a, m ij = j a jx i a. j When the algoithm conveges, the matix M contains the pedicted methylation levels fo each epigenome in the mixtue. i164

Detection of Allele-Specific Methylation 3 DETECTION OF ASM In this section, we descibe how the model is used to detect allele-specific methylated egions in a diploid oganism. Notice that ASM is not a pecisely defined tem. It geneally efes to a significant diffeence between the methylation levels of the two alleles. Fist, the methylome of a diploid individual is scanned fo patially methylated egions (PMRs) as candidates fo futhe analysis. Second, fo each candidate egion, the eads that align to this egion ae computationally assigned to the two alleles and the methylation levels of individual cytosines fom each allele ae estimated. Last, egions ae classified to allele-specific o nonspecific methylated egions, based on ead assignments and the pedicted methylation levels fom the pevious step. A synthetic methylome is used to test the model and to illustate the details of each step. We emak that detemining whethe a ead along with its methylcytosines has a highe pobability to oiginate fom one allele o the othe elies on the diffeences between the eads, i.e. the methylation states of the cytosines on the eads. The density of methylcytosines of a genome elative to the ead length in a sequencing expeiment is theefoe citical. Fo instance, if, on aveage, a ead coves at most one methylcytosine, then thee is vey little hope to deconvolve the allelic methylation states without additional infomation. While anticipating the apid gowth of the ead length in high-thoughput sequencing technology, we fist tested ou method on Aabidopsis thaliana, which has a easonably high methylation density. The median genomic distance, fo instance, between consecutive methylcytosines on Chomosome 1ofA. thaliana Col-0 is 15, while the typical ead length of an Illumina sequence is between 100 and 150 bp pesently. 3.1 Identify PMRs as candidates To detect ASM egions, the whole methylome is scanned fo PMRs as candidates, as thee is obviously not much diffeence between the two alleles if the methylation level of a egion is nea nil o complete. A contiguous methylated egion (CMR) efes to a genomic egion whee the genomic distance between any two consecutive methylcytosines is no lage than a sepaation theshold s, which is set at aound a width compaable to the ead length. Each CMR is scanned with a fixed-width sliding window whee the window width is the numbe of methylcytosines. A fixed-width window inside a CMR is classified as a PMR if no fewe than 90% of the methylcytosines ae at most 70% methylated (Liste et al., 2009). A PMR may be called with o without a specific lowe bound fo methylation levels. In the data we have analyzed, the aveage level of methylations in a CMR is at least 25%, due to that (i) the egion has contiguous methylcytosines; (ii) the allowed false positive ate fo calling a methylcytosine (Liste et al., 2008) in combination with the sequencing coveage dictates an implicit lowe bound on the methylation levels. Consecutive PMRs within the same CMR ae meged into a single egion. The dataset being tested is a synthetic methylome that is made up by combining the eads of two methylomes fom Aabidopsis oot cells: epidemis (We+) and endodemis (Sc+). The oot cells ae obtained by flow soting; thei methylomes by MethylC-Seq bisulfite sequencing (Liste et al., 2008). The genomic length of the eads is 83 bp. The eads fom one cell type in the mixtue ae teated as if they ae fom one allele of the synthetic methylome. The ASM Table 1. Samples of patially methylated egions fo classification 20 mc 35 mc 50 mc Samples CS NS CS NS CS NS Taining 255 255 42 44 10 10 Testing 100 100 19 21 6 6 mc: methylcytosine; CS: cell-specific; NS: non-specific of the synthetic methylome is, theefoe, simulated by cell-specific methylation in the mixtue. Fowad stand and evese stand ae pocessed sepaately. As ASM and cell-specific methylation aise fom diffeent biological pocesses, the pattens of diffeential methylations might, theefoe, cay signatues unique to each type. The afoementioned synthetic methylome is appopiate fo testing cellspecific methylations; yet whethe it is a good suogate fo testing ASMs may be questioned. We ague that since the classification is lagely based on oveall methylation levels within a egion athe than elations among methylation levels of individual cytosine sites, the citeia similaly apply to both cell-specific methylation and ASM. The sepaation theshold s is set to 100 bp. Each CMR is scanned with a sliding window of width 20, 35 and 50 methylcytosines espectively. Each PMR in the synthetic methylome is labeled as eithe cell-specific (allele-specific), if no fewe than 90% cytosines in the egion ae methylated in only one of the two cells, o nonspecific othewise. Fo the pupose of leaning, consecutive windows within the same CMR ae meged into one egion only if they ae labeled as the same type. Fo a sliding window of width 20, both cell-specific egions and non-specific egions have a median width of 23 methylcytosines. All of the cell-specific egions and aound the same numbe of andomly selected non-specific egions ae kept as samples fo futhe analysis. The samples ae divided into taining and testing samples (Table 1). 3.2 Pedict allelic methylation levels Fo each PMR, the algoithm descibed in Section 2 is applied with k =2. Fo each sample of the synthetic methylome, the eads fom the two individual methylomes ae mixed togethe. One potential poblem with the EM algoithm is that it may convege to a local optimum. Thee ae vaious ways to initialize the paametes. One initialization is to andomly assign each ead to a cluste, i.e. set a j =1 at pobability 1/k. Anothe option is to set c j =1/k, and then andomize matix M. A thid option is to andomize the membeship matix A, which appeas to be the best option afte testing with simulated data. We un the algoithm L times, each with a new andom matix A as the initialization. Let A 1, A 2,...,A L be the membeship matix when each individual un conveges, fom which two new initializations ae deived: A L+1 and A L+2, fo two additional uns. 1 They ae defined as 1 A subtlety hee concens the odeing of the clustes at the end of each un. The details ae omitted. i165

Q.Peng and J.R.Ecke Fig. 1. Expeimental (a and b) vesus pedicted (c and d) methylation levels of the synthetic methylome. Expeimental methylation levels fom We+ ae shown in cicles, Sc+ in stas. Two symbols in the bottom figues (c and d) indicate two pedicted individual methylomes fo the synthetic methylome. The left (a and c) is of an allele-specific egion: chomosome 1: [7313026, 7313482] on fowad stand. The egion contains 50 methylcytosines (mc); 160 eads ae aligned to this egion. The ight (b and d) is of a non-specific egion: chomosome 3: [12698733, 12699133] on fowad stand. The egion contains 42 mc; 367 eads ae aligned to the egion follows: fo R, j =1...k, a L+1 j a L+2 j = = 1 L a L j i ; i=1 { 1 if j =agmax k j =1 (al+1 j ) 0 othewise. Of the total L+2 uns, the one that conveges to the lagest likelihood is selected. The matix M yields a pedicted level of methylations at each cytosine site. Figue 1 illustates two samples of PMRs fom the synthetic methylome: one cell-specific and the othe non-specific. Both expeimental and pedicted methylation states ae shown. The expeimental data ae fom the individual methylomes and ae thus teated as golden. The pedicted methylation states ae used fo futhe classification. 3.3 Classify candidate egions with a suppot vecto machine classifie Once the methylation levels of individual methylcytosines in each allele ae estimated and the membeship of eads pedicted by the model, what emains is to chaacteize and classify each egion based on the estimations. Recall that we used a athe simplistic ule to automatically label the samples in the synthetic methylome when pepaing the taining and testing samples. The labels ae deived fom the knowledge of the two individual methylomes making up the synthetic methylome, and theefoe ae independent of the pedictions made by the mixtue model. One option is to use the same ule to classify a egion fom pedicted methylation levels. We hope to captue moe chaacteistics of the two classes, howeve, so multiple measues ae employed fo this task when it is applied to the eal epigenome. We also use the synthetic methylome to tain a suppot vecto machine (SVM) classifie that is late used fo a eal single cell methylome. Notice that the value a j gives the pobability that the ead is fom allele j. If a ead needs to be assigned to a single allele, it should be assigned to the j that has the lage value of a j, o fomally, to cluste j =agmax j (a j ). The featues fo SVM ae extacted fom the estimated methylation levels (M ) and allele fequencies (C). A total of nine featues ae used (details omitted due to exigencies of space). Both linea and adial basis function (RBF) kenels ae tested; the latte yield bette pefomance. Five-fold coss-validation is used to select the best paametes fo the kenel function fom each taining set. The testing esults ae shown in Table 2. In some of the false positive (FP) samples, both cells have completely unmethylated eads and these eads ae clusteed togethe in the synthetic methylome, which ae then classified as allele-specific (cell-specific). We hypothesize that these egions have ASM in both individual cell types. i166

Detection of Allele-Specific Methylation Table 2. Testing esults fo classification of patially methylated egions 20 mc 35 mc 50 mc Samples CS NS CS NS CS NS Pedicted CS 88 (TP) 13 (FP) 17 3 5 1 Pedicted NS 12 (FN) 87 (TN) 2 18 1 5 Accuacy 87.5% 87.5% 83.3% CS: cell-specific; NS: non-specific Table 3. Classes based on diffeential aveaged methylation d Class name Label [0.0,0.2] Similaly methylated d s (0.2,0.9) Modeately diffeentially methylated d m [0.9,1.0] Highly diffeentially methylated d h Table 4. Pedicted ASM egions Method svm lm dm mww Intesection Regions 277 39 d h : 19 362 18 (1) d m : 368 16 (2) d m 192 (3) 3.4 Identify ASM egions with multiple filtes Using an SVM classifie has its limitations. A classifie tained on one oganism will cetainly not be appopiate fo othe oganisms. A classifie tained on one cell type may be questionable fo anothe vey diffeent type. When paametes fo the sequencing expeiment change, the classifie should be etained. In addition, thee may not be sufficient data samples fo leaning at all. Additional measues ae theefoe necessay fo eal data. The Mann Whitney Wilcoxon (mww) test is pefomed on the two clustes esulting fom the mixtue model. If the null hypothesis is ejected, one cannot eadily claim that the methylations ae allelespecific. But if on the othe hand the null hypothesis is not ejected, it is unlikely that the methylations ae allele-specific. Two additional filtes ae based on methylation ates. One filte is called low methylation ate (lm). It computes whethe one of the two clustes is 10% methylated at 80% sites. The othe is called diffeential aveaged methylation ate (dm). The aveaged methylation ate fo each cluste in the egion is defined as j = 1 n m ij, n m i=1 i1 +m i2 and the diffeential aveaged methylation as d = 1 2. We define thee classes based on this measue shown in Table 3. 4 DETECT ASM REGIONS FOR ARABIDOPSIS EPIDERMIS METHYLOME The methods ae applied to the methylome of a single cell type, a oot epidemis (We+) cell fom A. thaliana Col-0. Thee ae a total of 452 patially methylated egions as defined in Section 3.1. The total genomic length of the egions is 84 397 bp, coveing 9980 methylcytosines. The 22926 eads ae aligned to these egions. Once the methylation levels of the two alleles and the membeship of all eads ae pedicted by the models, the esults ae subject to fou methods fo classification as mentioned in Sections 3.3 and 3.4. Recall that the SVM model is tained on the synthetic methylome made up of two Aabidopsis oot methylomes, We+ being one of them. Table 4 shows the numbe of PMRs pedicted to be allelespecific by each method and the intesection is taken so as to obtain the most consevative pedictions. The fist two ows of the table eflect the intesections taken between the d h class and all othe citeia, and d m class and othe citeia, espectively, totaling 34 egions. Table 5 lists the details and annotations fo goup (1), and Table 6 fo goup (2). Based on the pedicted methylation levels, both goups ae highly allele-specific. While one of the two alleles has nealy no methylation, in the fist goup of 18 egions, the othe allele is in geneal highly methylated; and in the second goup of 16 egions, the methylated allele is moe patially methylated. An example fom each goup is shown in Figue 2. If the citeia fo ASM ae elaxed a bit by emoving the athe stingent filte lm, many moe egions [the last ow (3) in Table 4] ae admitted Table 5. Pedicted ASM egions, goup (1) in Table 4 ch Cood stat Cood end No. of mc st Gene model 1 6173395 6173517 20 + < AT1G17940 1 17295458 17295563 24 + AT1TE57315 1 17824930 17825022 20 + AT1TE59180 1 18450185 18450342 20 AT1TE61145 1 21929675 21929885 21 + > AT1G59660 < AT1G59670 2 7089119 7089281 21 AT2G16380 3 UTR + 2 7340522 7340662 23 + AT2TE29970 < AT2G16930 2 14386759 14387012 23 + > AT2G34060 > AT2G34070 3 12092053 12092151 20 + AT3TE50300 3 16726743 16726938 29 < AT3G45570 AT3TE67795 + 4 2367251 2367352 20 + < AT4G04670 4 10912792 10913005 25 + > AT4G20210 AT4TE50030 4 11932172 11932467 24 + AT4TE55215 > AT4G22690 < AT4G22700 4 13269127 13269376 28 AT4TE62330 5 5128508 5128716 22 AT5TE18530 > AT5G15725 + 5 8215115 8215303 24 + AT5TE29690 5 22267311 22267477 25 AT5TE80145 + < AT5G54810 5 26117272 26117538 20 + AT5TE94040 ch: chomosome; st: stand; <: upsteam; >: downsteam; Signs in the last column indicate opposite stands. i167

Q.Peng and J.R.Ecke Table 6. Pedicted ASM egions, goup (2) in Table 4 ch Cood stat Cood end No. of mc st Gene model 1 18054504 18054609 20 < AT1G48820 + < AT1G48810 1 21249227 21249472 20 + AT1TE70195 AT1TE70200 1 29696108 29696354 26 + < AT1G78960 2 2168534 2168695 21 AT2TE09960 > AT2G05752 2 11844725 11844944 20 AT2TE51520 > AT2G27780 2 11844762 11845016 25 AT2TE51520 > AT2G27780 3 10865759 10866031 21 + > AT3TE45185 3 12092064 12092269 25 + AT3TE50300 3 16266681 16266822 21 + AT3TE65915 < AT3G44718 3 16934173 16934369 20 AT3TE68630 + > AT3G46110 + 4 4097347 4097458 20 + > AT4TE17760 4 4547945 4548048 39 AT4TE19110 < AT4G07747 5 13812896 13813035 21 + AT5TE49235 5 15205421 15205719 23 + AT5TE55020 5 15268332 15268498 21 AT5TE55235 < AT5G38220 + 5 17406449 17406557 22 + AT5TE62865 ch: chomosome; st: stand; <: upsteam; >: downsteam; Signs in the last column indicate opposite stands. Fig. 2. Read assignments (a and b) and methylation levels (c and d) of the pedicted ASM egions of Aabidopsis We+ cell. Cicle and sta symbols in c and d epesent two alleles. Lines in top (a and b) figues ae esticted eads (solid and dotted lines epesent two alleles); small diamonds on lines ae methylations. Left (a and c) : a egion of Table 4(1): chom 4: [13269127, 13269376] stand. Right (b and d) : a egion of Table 4(2): chom 5: [17406449, 17406557] + stand i168

Detection of Allele-Specific Methylation Table 8. GO annotation summay fo potein coding genes in Table 7 Functional categoy Gene body Upsteam Downsteam Unknown cellula components 2 15 16 Chlooplast 2 8 6 Othe intacellula components 2 6 6 Othe cellula components 7 6 Othe cytoplasmic components 2 6 4 Othe membanes 3 4 3 Nucleus 2 2 5 Plastid 1 5 1 Plasma membane 1 3 3 Unknown molecula functions 2 12 19 Othe enzyme activity 5 7 8 Othe binding 3 6 6 Potein binding 1 4 3 Tansfease activity 2 4 Othe molecula functions 4 2 Tansfease activity 5 Tanspote activity 1 4 Hydolase activity 3 DNA o RNA binding 3 Nucleotide binding 3 Fig. 3. An example fom Table 4 goup (3): egion: chom 1: [77393, 77508] stand, 20 mc. (a) eads assignments. (b) pedicted methylation levels Othe metabolic pocesses 8 19 12 Unknown biological pocesses 1 17 18 Othe cellula pocesses 4 18 12 Response to stess 4 5 7 Response to abiotic o biotic stimulus 5 6 Tansciption,DNA-dependent 1 3 4 Potein metabolism 5 1 Tanspot 1 4 Othes 10 15 10 Table 7. Annotation summay of all ASM egions in Table 4 Annotation Ovelap Upsteam Downsteam Potein coding gene 4 (4) exon 50 (24) 56 (25) 9 (7) inton 1 (1) 3 UTR TE gene 19 (8) 11 (3) 4 (1) Pseudogene 2 2 3 mi, t, othe RNA 1 5 (2) 3 Tansposon (only) 66 (39) Numbes in paenthesis ae on the opposite stands. Upsteam and downsteam ae within 1kb. fo futhe examination. An example fom this goup is shown in Figue 3. Table 7 summaizes the TAIR9 annotations (TAIR, 2009) fo all thee goups of a total of 226 pedicated ASM egions. Many egions ovelap with natual tansposons as expected; the last ow in the table epots the numbe of egions that have no othe annotation than tansposon. Fo the potein coding genes, the gene ontology annotations ae summaized in Table 8. One egion fom goup (3), on fowad stand of chomosome 1: 11267775 11268327, is on the antisense of an exon of potein coding gene AT3G29360, an impinted gene in A. thaliana seed epoted by McKeown et al. (2011). The egion has 26 methylcytosines. 5 CONCLUSION AND DISCUSSION We have descibed a computational model fo esolving distinct epigenomes fom a heteogeneous sample. In paticula, we applied this model to identify allele-specific methylated egions. The eads fom potentially multiple epigenomes ae mapped to a common efeence genome. The goal is essentially to infe the distinct methylation pattens fom the mapped eads. Ou appoach is diffeent fom pevious attempts in that it does not ely on SNPs, which ae few and fa between when compaed with methylcytosines. The model was tested on a synthetic methylome. The classification based on the mixtue model in conjunction with an SVM classifie yielded an accuacy of 87.5%. Even though the SVM appoach is not always applicable to eal methylomes, since all the featues ae deived fom the pedicted methylation levels and cluste fequencies, the test esults eflect the eliability of the pedictions made by the model. Additional multiple filtes fo ASMs may futhe educe the numbe of false positives. Additional appoaches may be employed to validate the methods fo ASM detection. One appoach is to use ASMs detemined with SNP data as gound tuth, although such data ae pesently spase and anecdotal. Anothe appoach to validate ASM is to use heitable epi-alleles in combination with phasing infomation obtained fom cossing of plants. i169

Q.Peng and J.R.Ecke We applied the methods to egions of the genome with elatively high density of methylcytosines with each egion being teated independently. By taking advantage of pai-end eads and othe infomation, it will also be possible to do phasing and extend and connect the egions. Ou model assumes that methylations ae independent of each othe. Methylations in some egions have a tendency to occu in clustes, which indicates a cetain dependency. While ou model gives a easonably good fist-ode appoximation, Makov chain-based models pehaps may be exploed to take the dependency into consideation. Identifying ASM is still only at initial eseach stages. Anothe diection fo futue eseach is to focus on undestanding the functionalities of ASM, fo instance, how they ae elated to allele-specific expession. Moe complicated heteogeneous epigenome samples may aise fom a mixtue of vaious cell types, o a mixtue of canceous cells at vaious stages, which pesent yet moe and geate challenges than the allelic methylations of a diploid cell. Such samples will enable an ultimate test fo the powe of the methods. The initial steps ae to develop scenaios and citeia fo validation as it becomes less obvious what defines cell-specific methylations in the context of multiple cell types. ACKNOWLEDGEMENT We would like to thank Ds Ryan Liste and Ronan O Malley fom the Salk Institute fo poviding the expeimental data, and Pofesso Sanjoy Dasgupta fom UC-San Diego fo insightful discussions. Funding: This wok was suppoted by the Howad Hughes Medical Institute and the Godon and Betty Mooe Foundation to JRE. JRE is an HHMI-GBMF Investigato. Conflict of Inteest: none declaed. REFERENCES Besto,T.H. (2000) The DNA methyltansfeases of mammals. Hum. Mol. Genet., 9, 2395 2402. Cokus,S.J. et al. (2008) Shotgun bisulphite sequencing of the Aabidopsis genome eveals DNA methylation pattening. Natue, 452, 215 219. Dempste,A. et al. (1977) Maximum likelihood fom incomplete data via the EM algoithm. J. R. Stat. Soc. Se. B, 39, 1 38. Hellman,A. and Chess,A. (2007) Gene body-specific methylation on the active X chomosome. Science, 315, 1141 1143. Huala,E. et al.(2001) The Aabidopsis Infomation Resouce (TAIR): a compehensive database and web-based infomation etieval, analysis, and visualization system fo a model plant. Nucleic Acids Res., 29, 102 105. Kekel,K. et al. (2008) Genomic suveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nat. Genet., 40, 904 908. Kim,J.Y. et al. (2005) Counting human somatic cell eplications: Methylation mios endometial stem cell divisions. Poc. Natl. Acad. Sci. USA, 102, 17739 17744. Laid,P.W. (2003) The powe and the pomise of DNA methylation makes. Nat. Rev. Cance, 3, 253 266. Li,E. et al. (1992) Tageted mutation of the DNA methyltansfease gene esults in embyonic lethality. Cell, 69, 915 926. Lippman,Z. et al. (2004) Role of tansposable elements in heteochomatin and epigenetic contol. Natue, 430, 471 476. Liste,R. and Ecke,J.R. (2009) Finding the fifth base: Genome-wide sequencing of cytosine methylation. Genome Res., 19, 959 966. Liste,R. et al. (2008) Highly integated single-base esolution maps of the epigenome in Aabidopsis. Cell, 133, 523 536. Liste,R. et al. (2009) Human DNA methylomes at base esolution show widespead epigenomic diffeences. Natue, 462, 315 322. McKeown,P. et al. (2011) Identification of impinted genes subject to paent-of-oigin specific expession in aabidopsis thaliana seeds. BMC Plant Biol., 11, 113. Meissne,A. et al. (2008) Genome-scale DNA methylation maps of pluipotent and diffeentiated cells. Natue, 454, 766 U91. Ossowski,S. et al. (2010) The ate and molecula spectum of spontaneous mutations in Aabidopsis thaliana. Science, 327, 92 94. Rhee,I. et al. (2002) DNMT1 and DNMT3b coopeate to silence genes in human cance cells. Natue, 416, 552 556. Riggs,A. (1975) X inactivation, diffeentiation, and DNA methylation. Cytogenet. Cell Genet., 14, 9 25. Schmitz,R.J. et al. (2011) Tansgeneational epigenetic instability is a souce of novel methylation vaiants. Science, 334, 369 373. Shibata,D. and Tavaé,S. (2006) Counting divisions in a human somatic cell tee: How, what and why? Cell Cycle, 5, 610 614. Shoemake,R. et al. (2010) Allele-specific methylation is pevalent and is contibuted by CpG-SNPs in the human genome. Genome Res., bf 20, 883 889. Yatabe,Y. et al. (2001) Investigating stem cells in human colon by using methylation pattens. Poc. Natl. Acad. Sci. USA, 98, 10839 10844. Zhang,X. et al. (2006) Genome-wide high-esolution mapping and functional analysis of DNA methylation in Aabidopsis. Cell, 126, 1189 1201. Zilbeman,D. et al. (2007) Genome-wide analysis of Aabidopsis thaliana DNA methylation uncoves an intedependence between methylation and tansciption. Nat. Genet., 39, 61 69. APPENDIX A Section 2 descibed the model fo a mixtue of epigenomes. This section details the deivation of the update equations at each M-step iteation fo the EM algoithm. Recall fom Section 2 that the optimization goal is to detemine the methylation pobability matix M and the epigenome fequency aay C such that the likelihood l(m,c,r,a) as defined by Equation (3) is maximized given the set of obseved esticted eads R. The membeship matix A is estimated by Equation (2) at each iteation. As the maximization of likelihood l(m,c,r,a) is constained by k c j =1, i.e. the epigenome fequencies sum up to 1, we will intoduce Lagange multiplie λ and maximize the unconstained function l(m,c,r,a,λ)= k k a j log(c j p j ) λ c j 1. Substituting in p j given by Equation (1), we have l = k k (a j logp j +a j logc j ) λ c j 1 = k b ( a j log mij x i +(1 m ij )(1 x i ) ) i= a + k k a j logc j λ c j 1 = k b ( a j xi logm ij +(1 x i )log(1 m ij ) ) i= a + k k a j logc j λ c j 1. i170

Detection of Allele-Specific Methylation Maximizing l with espect to M,C,λ by setting the espective patial deivatives to zeo yields the following set of equations l =0 ( ) x i a j 1 x i =0 m ij m ij 1 m ij a j (x i m ij )=0, l =0 a j λ=0 c j c j c j = 1 a j, λ l λ =0 ( 1 a j )=1 λ j λ= a j. j Solving fo m ij and substituting in λ fo c j lead to the update equations at each M-step iteation of the EM algoithm as follows, m ij = a jx i a, c j = a j j j a. j i171