Measuring methylation: from arrays to sequencing

Size: px
Start display at page:

Download "Measuring methylation: from arrays to sequencing"

Transcription

1 Measuring methylation: from arrays to sequencing Jovana Maksimovic, github.com/jovmaksimovic Bioinformatics Winter School, 3 July 2017

2 Talk outline Epigenetics DNA methylation Measuring DNA methylation Methylation arrays How do they work? What do they measure? Example analysis Methylation sequencing What are the challenges? How does it work? Suggested analysis pipeline Summary

3 I work at MCRI I mostly work on human development & disease Me! sometimes using mice or other models ChIPseq ATACseq BSseq Microarrays I write software for analysing methylation array data RNAseq Microarrays I analyse a lot of epigenetic data missmethyl and a lot of gene expression data

4 What is epigenetics? Epigenetics refers to stable heritable traits not explained by changes in DNA sequence Greek prefix epi means on top of genetics Chromosome modifications that affect gene expression Histones, DNA methylation Anything that isn t DNA! Essential for normal development Can be modified by environment Can be disrupted in disease

5 Epigenetics brings DNA to life! identical DNA in every cell gland cell hormonesecreting cell red blood cell B cell embryogenesis sperm zygote blastocyst egg Important in all species embryonic stem cells neuronal progenitor cell astrocyte haematopoietic stem cell fat cell lung cell germ cell skin cell neuron sperm cell T cell intestine cell kidney cell muscle cell different epigenetic patterns Modified from

6 Epigenetics is CRAZY complicated! Roy et al. (2010), Science Me Me New sequencing & microarray technologies are enabling us to learn A LOT more about epigenetics Different data types need different analysis Today I m only focussing on DNA methylation

7 What is DNA methylation? C C C G A T DNA methylation primarily occurs at CpG dinucleotides

8 DNA methylation in the genome The human genome contains ~30,000,000 CpGs (~1%) VERY different between different species CpGs are not evenly spaced across the genome Tend to be present in clusters called CpG islands CpG methylation is spatially correlated Patterson et al. 2011, J Vis Exp Methylation correlation with distance ~500bp Eckhardt et al. 2007, Nature Genetics

9 Methylation can regulate gene expression Methylation at a single CpG vs. gene expression Each point is one sample Plot from Peter Hickey

10 Methylation changes coat colour of Agouti mice Dolinoy 2008, Nutr Rev. This gene controls coat colour in Agouti mice These CpG sites in the promoter change PS1A expression depending on methylation These mice are genetically identical Hypomethylated Hypermethylated Coat colour different due to different maternal diet i.e. environment!

11 Methylation makes worker bees! These larvae are genetically identical Hypomethylated Cridge et al. 2015, Nutrients Hypermethylated

12 Methylation is cool What do we usually want to know about it?

13 Finding methylation differences can tell us a lot Methylation is critical in determining cell type Regulatory T-cell vs. Naïve T-cell Methylation can be disrupted in disease Cancer vs. Normal Methylation is affected by the environment Smokers vs. Non-smokers Collect appropriate samples Extract DNA and measure methylation Statistical analysis Normal Cancer

14 Epigenome-wide association studies (EWAS) Similar to GWAS Compare lots of cases to lots of controls Often looking for small effects e.g. complex disease or environmental effects Need lots of samples 100s or 1000s of cases & controls

15 How do we measure methylation? Bisulphite conversion Create SNPs Single nucleotide resolution Array Sequencing Enrichment of methylated DNA Restriction enzymes Affinity Regional resolution Array Sequencing

16 What is bisulphite conversion? PCR Chemical process Unmethylated Cs get converted to Ts Methylated Cs are protected Creates SNP Used to call methylation

17 Methylation arrays What are they and how do they work?

18 Illumina Infinium HumanMethylation BeadChips 27k array (2009) 450k array (2011) 850k array (2015) 1 chip = 12 samples >27,000 unique CpG sites measured in each sample 1 chip = 12 samples >450,000 unique CpG sites measured in each sample 1 chip = 8 samples >850,000 unique CpG sites measured in each sample Human only Gene biased; selected to be relevant to human development & disease eg. TSS, promoters, CpG islands, enhancers,... Modified slide from Belinda Phipson

19 Methylation arrays are based on SNP array technology What is this base? Measure fluorescence intensity Methylation array SNPs (C/T) are created by bisulphite conversion Comparing the intensity of C/T gives the proportion of methylation at single CpG

20 What methylation values can we get? A sample CH 3 CH 3 CH 3 On an array, we measure methylation in a population of cells Individual cell can be either 0, 0.5 or 1 at one CpG Across a population we get a continuous measurement between [0-1] Many cells in single sample

21 Beta value Measures of methylation Arrays measure both methylated (C) and unmethylated (T) signal to get proportion of methylation at a CpG β = Meth Meth+Unmeth M = log 2 Meth Unmeth M = log 2 β 1 β Du et al. 2011, BMC Bioinformatics Intuitive, easy to interpret, great for visualisation Better statistical properties, recommended for statistical testing M value Can convert between them via a logit transformation

22 What does the data look like? Table of M-values CpG sites Sample A1 Sample A2 Sample A3 Sample B1 Sample B2 Sample B

23 Array analysis pipeline Software QC: b density plots, control probes, MDS/clustering plots, Normalization: within and between arrays Statistical testing for differential methylation, CpGs & regions Annotation to genes, gene set testing, visualization, Remove bad samples and poor performing probes (CpGs) Transform data to remove unwanted variation Estimate means and variances and borrow information across probes Think about biological interpretation minfi, methylumi, limma minfi, missmethyl, watermelon limma, bumphunter, DMRcate missmethyl, Gviz Combine with other data types e.g. gene expression GenomicRanges

24 M28 M29 M30 rtreg naive activated rtreg activated naive

25 Dimension 2 Dimension 4 After QC, data exploration is your friend! MDS plots showing largest sources of variation in the data Dimension 1 Dimension 3 Clustering by individual and cell type

26 Statistical testing: Look for differences at single CpGs Differential methylation Linear model : y = Xβ + ε Can take into account any other covariates moderated t = തy can തy norm ǁ s v sǁ is the empirical Bayes variance Adjust the p-values using Benjamini and Hochberg s FDR Phipson & Oshlack 2015, Genome Biology Lots of differences between immune cell types! One test per CpG! Smyth, 2004 Modified slide from Belinda Phipson & Alicia Oshlack

27 Statistical testing: Differences across CpG dense region Recall: CpG methylation is spatially correlated Can we find consistent group-average level differences between CpGs that are close together? More functionally relevant than differences at individual CpGs? Aryee et al. 2014, Bioinformatics Lots of DMRs between immune cell types!

28 You can do other cool stuff! Unmethylated regions in rtreg compared to naïve cells enriched for FOXP3 binding motifs! DMR consensus motif matches Forkhead-binding motif Forkheadbinding motif Consensus motif from DMR seqs. Differences in cell types controlled by FOXP3! Modified slide from Alicia Oshlack

29 Methylation array analysis is very mature: lots of methods!

30 Methylation sequencing AKA bisulphite sequencing: the good, the bad and the ugly

31 Two main types of bisulphite sequencing Whole-genome bisulphite sequencing (BS-seq) Gold standard Genome-wide (~30,000,000 CpGs in human) Expensive but covers almost everything Need high (10-30x) coverage to reliably call methylation Targeted BS-seq Only sequence regions of interest Reduced representation BS-seq (restriction enzyme) Capture BS-seq (similar principal to exome) Cheaper but can miss a lot of stuff Can usually do higher (20-60x) coverage

32 What was bisulphite conversion again? DNA fragment All four of these can be sequenced!

33 What are the challenges? Like calling SNPs, methylation in BS-seq inferred by comparison to unconverted reference sequence Correct alignment is critical More challenging than usual! Aligned sequences do not exactly match reference Complexity of libraries is reduced Many Cs become Ts, so less info for mapping! Methylation is not symmetrical Two strands of DNA in the reference genome must be considered separately

34 DNA fragment Mapping (Bismark) BS conversion & PCR

35 DNA fragment Mapping (Bismark) BS conversion & PCR TCGGTATGTTTAAACGTT

36 DNA fragment Mapping (Bismark) BS conversion & PCR TCGGTATGTTTAAACGTT In silico read conversion C-to-T TTGGTATGTTTAAATGTT G-to-A TCAATATATTTAAACATT

37 DNA fragment Mapping (Bismark) BS conversion & PCR In silico read conversion TCGGTATGTTTAAACGTT C-to-T G-to-A TTGGTATGTTTAAATGTT TCAATATATTTAAACATT Align to in silico bisulphite converted genome Fwd strand C-to-T converted genome TTGGTATGTTTAAATGTT AACCATACAAATTTACAA Reverse complement Fwd strand G-to-A converted genome CCAACATATTTAAACACT GGTTGTATAAATTTGTGA Reverse complement

38 DNA fragment Mapping (Bismark) BS conversion & PCR In silico read conversion TCGGTATGTTTAAACGTT C-to-T G-to-A TTGGTATGTTTAAATGTT TCAATATATTTAAACATT Align to in silico bisulphite converted genome Fwd strand C-to-T converted genome TTGGTATGTTTAAATGTT AACCATACAAATTTACAA Reverse complement Fwd strand G-to-A converted genome CCAACATATTTAAACACT GGTTGTATAAATTTGTGA Reverse complement Read all alignment outputs simultaneously to determine if sequence can be mapped uniquely TTGGTATGTTTAAATGTT TCAATATATTTAAACATT AACCATACAAATTTACAA TCAATATATTTAAACATT CCAACATATTTAAACACT TCAATATATTTAAACATT x x x x x x x x x x x x x x x x x x x x x x x x x x x GGTTGTATAAATTTGTGA TCAATATATTTAAACATT x x x x x x x x x x x x x x x x x x

39 DNA fragment Mapping (Bismark) BS conversion & PCR In silico read conversion TCGGTATGTTTAAACGTT C-to-T G-to-A TTGGTATGTTTAAATGTT TCAATATATTTAAACATT Align to in silico bisulphite converted genome Fwd strand C-to-T converted genome TTGGTATGTTTAAATGTT AACCATACAAATTTACAA Reverse complement Fwd strand G-to-A converted genome CCAACATATTTAAACACT GGTTGTATAAATTTGTGA Reverse complement Read all alignment outputs simultaneously to determine if sequence can be mapped uniquely TTGGTATGTTTAAATGTT TTGGTATGTTTAAATGTT TCAATATATTTAAACATT AACCATACAAATTTACAA TCAATATATTTAAACATT TTGGTATGTTTAAATGTT CCAACATATTTAAACACT TTGGTATGTTTAAATGTT TCAATATATTTAAACATT x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x GGTTGTATAAATTTGTGA TCAATATATTTAAACATT TTGGTATGTTTAAATGTT x x x x x x x x x x x x x x x x x x

40 Calling methylation CCGGCATGTTTAAACGCT TCGGTATGTTTAAATGTT = 80% Genome reference 10 TTG TATGTTTAAATGTT TCGGTATGTTTAAATGTT TCGGTATGTT AAACGTT TCGGTATGTTTAAATGTT TCGGTATGTTT ATGTT TCGGTATGTTTAAATGTT TCGGTATGTTTAAAT TT TTGGTATGTTTA ATGTT 2 TCGGTATGTTTAAACGT = 20%

41 Calling methylation CCGGCATGTTTAAACGCT TCGGTATGTTTAAATGTT TTG TATGTTTAAATGTT TCGGTATGTTTAAATGTT Genome reference Good coverage is very important for reliable methylation calls!

42 Some real BS-seq mapping results

43 Methylation calling output No. methylated reads Position of C in genome chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr chr % methylation No. unmethylated reads 80% = This is what we work with! Sum for total coverage

44 Analysis pipeline Thorough QC is VERY important for BS-seq Krueger et al. 2012, Nature Methods Need to be brutal with trimming off poor quality bases and adapters As with SNP calling, removing PCR duplicates is a good idea for better methylation calling Other stuff to find cool biology!

45 Summary Methylation arrays very popular Only for human Great for EWAS Analysis very mature Bioconductor is the place to go! BS-seq best option for genome-wide single nucleotide resolution Only option for species other than human Pre-processing, mapping, etc. pretty good Statistical analysis still developing Bioconductor is a valuable resource Downstream analysis dependent on biological question Methylation is interesting & we know how to measure it Best technology for the job depends on what you want to know!

46 Acknowledgments Murdoch Childrens Research Institute Alicia Oshlack Belinda Phipson MCRI Bioinformatics group! Johns Hopkins University Peter Hickey github.com/jovmaksimovic missmethyl release/bioc/html/missmethyl.html