DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION

Size: px
Start display at page:

Download "DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION"

Transcription

1 DOUBLE-STRAND DNA BREAK PREDICTION USING EPIGENOME MARKS AT KILOBASE RESOLUTION Raphaël MOURAD, Assist. Prof. Centre de Biologie Intégrative Université Paul Sabatier, Toulouse III

2 INTRODUCTION

3 Double-strand DNA breaks Double-strand breaks (DSBs) result from the attack of both DNA strands by multiple sources: Ionizing radiation, Reactive oxygen species, Replication, Nucleases, Exposure to multiple sources DSB

4 Risks of double-strand DNA breaks DSBs are particularly hazardous to the cell because they can lead to as chromosomal rearrangements and are linked to cancer.

5 DNA repair: two main mechanisms in mammals

6 Chromatin remodeling induced by doublestrand breaks DSBs induce opening of chromatin, acetylation of histones H2A and H4, spreading of γh2a.x, exchange histone H2A for the H2A variant H2A.Z, recruitment of repair proteins (ATM, p53 )

7 Genome-wide mapping of DSBs is important Because of their mutagenic potential, it is important to identify the precise locations of DSBs. DSBs were recently mapped genome-wide for the first time (Crosetto, Nat. Met. 2013, Lensing, Nat. Met. 2016). To date, only very few DSB mapping data were generated due to high sequencing costs and experimental difficulties!

8 Can we predict DSBs from the epigenetic and chromatin context? Endogenous DSBs were recently mapped at <1kb resolution by DSBCapture technique and were shown to be enriched for many histone marks and protein binding sites (Lensing et al., Nat. Methods 2016).

9 A large number of publicly available data to address these questions! Thousands of NGS data from GEO and ENCODE. ChIP-seq and DNase-seq, but also protein-binding motifs from JASPAR

10 DSB PREDICTION USING COMPUTATIONAL MODELS

11 The predictive approach

12 Chromosome Input data for predictions DSBCapture Dnase + ChIP-seq data Model variables Breaks Openess Prot A Prot X Y X1 X2 Xp

13 Random forests Machine learning prediction models. Very accurate models able to deal with high dimensional data (genomic data) and to capture non-linear relations and interactions. DNase Yes No CTCF H3K4 me3 Yes No No No DSB Non- DSB Non- DSB Non- DSB

14 Lasso logistic regression Another efficient predictive model from statistics. Model: ln Prob(Y=1 X) 1 Prob(Y=1 X) = β 0 + βx Variable Y indicates if the genomic locus has a double-strand break (Y = 1) or not (Y = 0). Variable set X = X 1,, X p is the set of p proteins of interest and β = β 1,, β p denotes the set of corresponding slope parameters (one parameter β for each protein).

15 Prediction from epigenome and chromatin Endogenous DSBs can be very accurately predicted using epigenome and chromatin data (AUC=0.97). Accurate predictions using only Dnase and H3K4me1 (AUC>0.95).

16 Prediction from epigenome and chromatin DNase, CTCF, p63 and H3K4me1/me2/me3 are the best predictors among those tested.

17 Predictions outperform BLESS mapping technique Model predictions outperform BLESS (Crosetto et al., Nat. Methods 2013), a common technique to map DSB sites.

18 Predictions outperform BLESS mapping technique The model learned from BLESS data predicts DSBs that were not identified by BLESS but which were detected by DSBCapture!

19 Genome-wide predictions Accurate predictions for all regions of the genome.

20 Model learned in one cell type to predict DSBs in another cell type Predictions in another cell type are accurate. We find same predictive features in both NHEK and U2OS.

21 Model learned in one cell type to predict DSBs in another cell type DSBs predicted over the genome in U2OS are enriched in U2OS BLESS reads (DSB data), in DSB repair proteins RAD51 and XRCC4, and depleted in DSB histone mark γ- H2A.X.

22 CONCLUSION

23 Conclusion Here we show, for the first time, that DSBs can be computationally predicted using public epigenomic data, even when the availability of data is limited (e.g. DNase I and H3K4me1). By using state-of-the-art computational models, we achieve excellent prediction accuracy, paving the way for a better understanding of DSB formation depending on developmental stage or cell-type specific epigenetic marks.

24 R code availability Code is available on Github:

25 Contributions Centre de Biologie Intégrative, Team Cuvier (Toulouse): Raphael Mourad Olivier Cuvier Main results Centre de Biologie Intégrative, Team Legube (Toulouse): Krzysztof Ginalski (now University of Warsaw, Poland) Gaelle Legube BLESS results Raphael Mourad, Krzysztof Ginalski, Gaelle Legube and Olivier Cuvier. Predicting double-strand DNA breaks using epigenome marks or DNA at kilobase resolution. Genome Biology, 2018, 19:34. Corresponding author: Raphael Mourad,

26 BIBLIOGRAPHY

27 Bibliography Stefanie V. Lensing, et al., and Shankar Balasubramanian. DSBCapture: in situ capture and sequencing of DNA breaks. Nature Methods, 13(10): , August Nicola Crosetto, et al., and Ivan Dikic. Nucleotide-resolution DNA double-strand break mapping by next-generation sequencing. Nature Methods, 10(4): , April Anthony Mathelier, Beibei Xin, Tsu-Pei Chiu, Lin Yang, Remo Rohs, and WyethW. Wasserman. DNA shape features improve transcription factor binding site predictions invivo. Cell Systems, 3(3): e4, Brendan D. Price and Alan D. D Andrea. Chromatin Remodeling at DNA Double-Strand Breaks. Cell, 152(6): , March John W. Whitaker, Zhao Chen, and Wei Wang. Predicting the human epigenome from DNA motifs. Nature Methods, 12(3): , March Leo Breiman. Random forests. Machine Learning, 45(1):5 32, October Pierre Caron, et al., and Galle Legube. Cohesin protects genes against γ-h2ax induced by DNA double-strand breaks. PLOS Genetics, 8(1):1 17, Andrea Kinner, Wenqi Wu, Christian Staudt, and George Iliakis. γ-h2ax in recognition and signaling of DNA double-strand breaks in the context of chromatin. Nucleic Acids Research, 36(17):5678, 2008.