CircadiOmics: Integrating Circadian Genomics, Transcriptomics, Proteomics, and Metabolomics

Size: px
Start display at page:

Download "CircadiOmics: Integrating Circadian Genomics, Transcriptomics, Proteomics, and Metabolomics"

Transcription

1 CircadiOmics: Integrating Circadian Genomics, Transcriptomics, Proteomics, and Metabolomics Vishal R Patel, Kristin Eckel-Mahan, Paolo Sassone-Corsi, Pierre Baldi

2 Supplementary Figure 1: CircadiOmics Webserver A B B,"&-./0(&"$%"#&'()$*("+$!"#"$%"#&'()$*("+$ 8>?:@$ 1'-#2)'(34/#$5-)&/'$!"#"$ 6789$ :#;<6"$/'$='/&"(#$,"&-./0(&"$ :?!:@$ 15$.(#A(#B$2(&"$ 15$.(#A(#B$2(&"$/#$>3"#$ %C'/6-4#$ $ 15$.(#A(#B$-))/'A(#B$&/$ %CE=D2"F$/'$%CE=D%CE=$ $ 1'-#2)'(34/#$ 1'-#20-4/#$ ='/&"(#D='/&"(#$$ :#;<6"D,"&-./0(&"$,"&-./0(&"$%/D'"-)4/#$ (A) Screenshot from the CircadiOmics website showing the time courses for Uracil in wild-type liver, Clock-/- liver, and wild-type muscle. The website allows users to enter any metabolite or gene transcript of interest, select a subset of available tissues and conditions, and view and compare the corresponding time series. Clicking on View Network brings up the associated subnetwork, which in this case would be a metabolite-centric view centered on Uracil. Also shown are the statistics from JTK_Cycle 1, a nonparametric algorithm for detecting rhythmic components. (B) Symbolic representation of the metabolite centric and gene centric subnetworks.

3 Supplementary Figure 2: Subnetwork of the Ass1 gene with available time courses. The black node represents the DNA region with the coordinates of the gene, including the flanking promoter. The green mrna node shows liver transcript levels obtained from DNA microarray experiments. The red enzyme node shows liver protein levels obtained from mass spectrometry experiments. The blue metabolite nodes show liver metabolite levels from mass spectrometry experiments. The brown transcription factors that bind to the promoter region of Ass1 are connected to the black DNA node by grey arrows representing the existence of a binding site. Thicker grey arrows represent binding sites that fall in open chromatin regions obtained from mouse liver ENCODE data).

4 Supplementary Methods CircadiOmics uses a three-tier computing architecture to integrate specific circadian experimental data with relevant background data to generate and display tissue specific networks. This data integration is designed to generate new hypotheses related to circadian rhythms. 1. Specific Circadian Experimental Data Recently, the interplay between metabolism and circadian rhythms has been studied in various contexts, including the role of hormones as nutrient signals 27 and the reproductive fitness of the organism 26. While each of these studies is looking at a specific piece of the puzzle, understanding the molecular links between metabolism and the regulation by the circadian clock requires a systems approach 28. To enable such a systems approach, our laboratory and others are generating tissue and condition specific metabolomics, transcriptomics, and proteomics high-throughput data related to circadian rhythms. However, there is no single resource currently that stores, integrates, and mines this data. CircadiOmics aims precisely at filling this gap. CircadiOmics starts by compiling the corresponding time series from these approaches and analyzes them using the statistical package JTK_Cycle 1. JTK_Cycle implements a nonparametric approach for detecting rhythmic components and estimating periodicities, amplitudes, and time lags. Data from new high-throughput studies of the circadian rhythms are integrated into the system as they become publicly available. Hence CircadiOmics aims to provide a platform for comparison of time series for transcripts and metabolites as well as proteins (although multiple time point data is currently scarce for protein data) across both multiple tissues and multiple conditions. Table A: List of high-throughput specific circadian experimental data currently in CircadiOmics. Tissue, Condition Transcriptome Metabolome Liver, Wild-Type Yes 2,3 Yes 4 Liver, Clock Altered Yes 2 (Clock mutant) Yes 4 (Clock knock-out) Liver, High Fat Diet No Yes* Skeletal Muscle, Wild-type Yes 2 No Skeletal Muscle, Clock Altered Yes 2 (Clock mutant) No Muscle-Tibialis Anterior, Wild-Type Yes* Yes* Muscle-Tibialis Anterior, Bmal1 KO Yes* Yes* Stars indicate that the dataset is available internally and will be made available upon publication of the corresponding papers. 2. Background Data In addition to high-throughput specific circadian experimental data, CircadiOmics extracts information from a variety of biological databases and webservers to build an extensive knowledge base. While all data sets come with intrinsic noise and limitations, it is through their integration that different lines of

5 evidence can be combined to achieve a more accurate view of the underlying network. Table below provides a list of the main data sources and tools that are integrated into CircadiOmics together with a brief description and the corresponding URL. Table B: Table of databases, web services, and tools used by CircadiOmics Database Description URL TRANSFAC 5 Databases of transcription factors and transcription factor weight matrices. JASPAR 6,7 Databases of transcription factors and transcription factor weight matrices. MotifMap 8,9 KEGG 10,11 Genome-wide maps of regulatory binding sites. MotifMap uses transcription factor weight matrices and the Bayesian Branch Length Score to assess evolutionary conservation and identify DNA regulatory elements across entire genomes. Metabolic pathway information, knowledgebase for metabolites Circa 2,3 Circadian gene expression profiles NCBI Nucleotide 12 UCSC Genome Browser 13,14 Curated information about mouse mrna and genes, transcriptional and translational relationships between genes, mrna and proteins Mouse genome, alignments, ChIP-seq, and other genomic datasets BioGRID 15 Protein-protein interactions IntAct 16 Protein-protein interactions Mint 17 Protein-protein interactions UniProtKB 18 Protein knowledgebase GEO 19 Gene expression and genomic datasets

6 DAVID 20,21 Mouse Genome Informatics 22 Cytoscape Web 23 ENCODE 24 Proteome 25 ID conversions, name resolution, and Gene Ontology enrichment analysis Gene knowledgebase Tool for network visualization and manipulation, as well web deployment and display within a web browser DNaseI Hypersensitivity Data, Chip-seq data etc Protein levels from mouse liver Computing Technology CircadiOmics uses a three-tier architecture comprising a database backend for information storage and retrieval, a software middle layer for intelligent data integration to build comprehensive networks, and a webserver frontend to provide public access to these networks and the circadian high throughput experimental data. 3.1 Database Backend Information extracted from all the data sources listed in Table B is stored in a MySQL database and the local file system. The information is periodically updated in an automated way to keep the database updated with current information. All the available external data sources (e.g. KEGG, BioGRID, IntAct, MotifMap, ENCODE, UCSC Genome Browser) are periodically downloaded and all the CircadiOmics networks are recomputed, to reflect any updated information. For example, if a new enzymatic reaction is entered in KEGG, after the subsequent update the reaction will be displayed in all the CircadiOmics networks where it is relevant. Integration of relevant datasets from published articles requires human curation and is done by the members of our team. Furthermore, users can submit new datasets or information that ought to be integrated into the system through the address listed on the website. 3.2 Data Integration and Artificial Intelligence Module First, using an extensive set of rules, this module integrates information from the background data to produce comprehensive biological maps. These maps are then pruned and customized using the circadian specific experimental data described in Section 1. Finally, to manage the complexity, these maps are subdivided and stored as subnetworks centered on individual metabolites (for metabolite centric views) or genes (for gene centric views). This module is implemented primarily in Python and run over a large computer cluster. More specifically, in the first step, all the genes, mrna transcripts, proteins, and metabolites are instantiated as nodes. The global network is progressively built by adding the following different types of edges: TF binding site, between transcription factors and the corresponding target genes ChIP-seq or ChIP-ChIP, between transcription factors and the corresponding target genes Transcription, between genes and the corresponding gene transcripts Translation, between gene transcripts and the corresponding protein products

7 Protein-Protein, between different proteins, including transcription factors and enzymes, that physically interact with each other (or self-interactions in the case of, for instance, homodimers) Enzyme-Metabolite, between enzymes and corresponding metabolites that participate in a common reaction Metabolite Co-reaction, between metabolites that participate in a common reaction In the second step, this large graph is modulated using tissue or condition specific information. For instance, genes that are not expressed in the tissue of interest are removed while relevant tissue specific ChIP-seq and other datasets are added. The time series measurements from the circadian specific experiments are then extracted and displayed over the remaining nodes. Hence, this step customizes the network for the given tissues and conditions. In the final step, this large map is divided and stored in the backend database as metabolite centric or gene centric subnetworks. The metabolite centric view focuses on a specific metabolite and shows all the metabolites that react with it, the enzymes that regulate its reactions, and the transcription factors that regulate the expression of these enzymes (Supplementary Fig. 1B). The gene centric view focuses on a specific gene and visualizes the flow of information starting from the transcription factors that regulate the gene, down to the transcripts and proteins it produces, and the metabolites that these proteins regulate (Supplementary Fig. 1B). In a gene centric view, the transcripts and proteins associated with the central gene are displayed as separate nodes. However for other proteins in the same graph, the corresponding transcript and protein nodes are collapsed into a single node in order to avoid excessive cluttering. For any such transcript-protein combined node, the solid curve inside the node represents mrna levels whereas the dotted curve represents protein levels (when these are available). In all cases, blue curves correspond to WT Webserver The webserver allows users to visualize and interactively manipulate the metabolite and gene centric subnetworks within their web browser, hence removing the need to download the entire graph. This module is implemented using Apache, Python CGI, and Cytoscape Web 23. Users can also search the high-throughput experimental data and interactively plot time courses across different conditions and tissues. For instance, Supplementary Fig. 2 shows a screenshot from the CircadiOmics website ( for the metabolite Uracil. Uracil is rhythmic in the wild-type liver, while it is arrhythmic in the Clock-/- liver and shows no oscillation in the muscle. Users can click on View Network to interactively visualize the Uracil-centric subnetwork, showing the enzymes and metabolites directly related to Uracil and the transcription factors that regulate those enzymes. References: [1] Hughes, M. E., Hogenesch, J. B., Kornacker, K., Journal of Biological Rhythms 25, 372 (2010). [2] Miller, B. H. et al. Proceedings of the National Academy of Sciences of the United States of America 104(9), February (2007). [3] Hughes, M. E. et al. PLoS genetics 5(4), e April (2009). [4] Eckel-Mahan, K. L., Patel, V.R., Mohney, R. P., Vignola, K. S., Baldi, P. & Sassone-Corsi, P., Proceedings of the National Academy of Sciences of the USA, in press, (2012). [5] Matys, V. et al. Nucleic Acids Research 34, D108-D110 (2006).

8 [6] Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W. & Lenhard, B. Nucleic Acids Eesearch 32, D91-94 (2004). [7] Portales-Casamar, E. et al. Nucleic Acids Research 38, D105-D110 (2010). [8] Xie, X., Rigor, P., and Baldi, P. Bioinformatics 25(2), January (2009). [9] Daily, K., Patel, V., Rigor, P., Xie, X., and Baldi, P. BMC Bioinformatics 12(1), 495+ (2011). [10] Kanehisa, M. and Goto, S. Nucleic Acids Research 28(1), January (2000). [11] Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M. Nucleic Acids Research 40(Database issue), D109 D114 January (2012). [12] Pruitt, K. D., Tatusova, T., Klimke, W., and Maglott, D. R. Nucleic Acids Research 37(Database issue), D32 D36 January (2009). [13] Kent, W. J. et al. Genome Research 12(6), June (2002). [14] Fujita, P. A. et al. Nucleic Acids Research 39(suppl 1), D876 D882 October (2010). [15] Stark, C. et al. Nucleic Acids Research 34(Database issue), D535 D539 January (2006). [16] Kerrien, S. et al. Nucleic Acids Research 40(D1), D841 D846 January (2012). [17] Ceol, A. et al. Nucleic Acids Research 38(Database issue), D532 D539 January (2010). [18] The UniProt Consortium. Nucleic Acids Research 40(D1), D71 D75 January (2012). [19] Edgar, R., Domrachev, M., & Lash, A. E. Nucleic Acids Research 30(1), January (2002). [20] Huang, D. W., Sherman, B. T., and Lempicki, R. A. Nat. Protocols 4(1), December (2008). [21] Huang, D. W. et al. Nucleic Acids Research 35(suppl 2), W169 W175 July (2007). [22] Blake, J. A. et al. Nucleic Acids Research 39(suppl 1), D842 D848 January (2011). [23] Lopes, C. T. et al. Bioinformatics 26(18), July (2010). [24] ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science Oct 22;306(5696): [25] Reddy, A. B. et al, Circadian Orchestration of the Hepatic Proteome, Current Biology, 16(11), [26] K. Xu, J. R. DiAngelo, M. E. Hughes, J. B. Hogenesch, A. Sehgal, Cell Metabolism 13, 639 (2011). [27] C. B. Peek, K. M. Ramsey, B. Marcheva, J. Bass, Trends in endocrinology and metabolism: TEM (2012). [28] J. E. Baggs, J. B. Hogenesch, Current opinion in genetics & development 20, 581 (2010).