The PHOENIX Center: the hub of proteomics in the age of big data

Size: px
Start display at page:

Download "The PHOENIX Center: the hub of proteomics in the age of big data"

Transcription

1 PHOENIX PHOENIX NCPS (National Center for Protein Sciences) The PHOENIX Center: the hub of proteomics in the age of big data AUTHOR Fuchu He National Center for Protein Sciences (Beijing), 38 Life Science Park Road, Changping District, Beijing, , China. Beijing Proteome Research Center (BPRC), a non-profit research organization, was established a decade ago when proteomics and proteome research had just started. Proteome is a collective term for all proteins in a cell or tissue and proteomics is the science of studying the proteome. In the past ten years, BPRC has spearheaded proteomic research and development in China and has come to the realization that a new facility is needed because of the ever-increasing demand for powerful mass spectrometry. BPRC was awarded an infrastructure grant of US$102.4 million from central government to lead the design and construction of the National Center for Protein Sciences (Beijing), which will focus on proteomics and has been dubbed as the PHOENIX Center (Pilot Hub Of ENcyclopedic proteomix). The PHOENIX Center is mandated by the funding agency to strengthen and facilitate ongoing protein science research in China, contribute to national bio-economic growth, provide a support system for protein scientists, and to stimulate and attract investigators from China and around the world to apply proteomic approaches in their own research areas. The PHOENIX Center is housed in a newly constructed, 400,000 square feet research building at Zhongguancun Life Science Park (ZLS Park) in the northwest suburb of Beijing. After the official launch of the PHOENIX Center on the 14th October 2015, BPRC became the PHOENIX Center s sister institute and an important partner in the talent exchange, and the two institutions will share an important cause in protein sciences. Big science, big data and lifeomics The twentieth century was remarkable in the history of science and technology in many ways. It was the century when the human race entered the information age, an age made possible by the revolution in computation hardware and software. It was also the century that, for the first time in history, humans started to design, organize and execute projects in science and technology that were too enormous and complex to be carried out by individuals. The success of big projects such as the Manhattan Project and the Apollo Project not only has a far reaching impact in our society and our way of life, but also demonstrates ways to approach scientific and technological research that had been unthinkable in the past. We call it big science. A direct consequence of big science is big data the production of amounts of data that are so huge that they exceed our imagination. And yet grand discoveries can and are being made with big science and big data. Human life as a biological system is the most complex physical system known thus far. From a reductionist point of view, the human body can be divided into four layers: organs, tissues, cells and molecules. From there, the subject of biomedical research can be divided into tens or millions of individual parts, depending on the layer being viewed. By taking the reductionist approach to the extreme, we can get to the bottom of the problem of cellular composition and turn it into a theory or a general principle. From a systems point of view, to understand the physiology and pathophysiology of the human body it is imperative that we study not only the multi-layered human biology,

2 Figure 1 An artist s expression of the PHOENIX Center Building. but also the complex physical and chemical factors of the human environment, the numerous symbiotic microbes of human ecology, and the various psychosocial factors of the human society. Such a holistic approach seeks to integrate all constituents into one single system, in which a simple question can lead to a highly complex project. The completion of the Human Genome Project (HGP) ushered in a new era in life sciences, an era that is witnessing the rising of the holistic approach on the molecular level. The births of genomics, transcriptomics, metabolomics and proteomics, which are studying all DNA, RNA, small metabolites and proteins in the cell or organ, are pushing biological research into a different level from the reductionist approach. The reductionist approach begins with ome and ends with o, m, e, whereas the holistic approach begins with o, m, e and ends with ome 1. Now we have come to the realization that one gene or protein at a time is far from sufficient in our quest to interpret and understand the phenomenon of life. It is the collective function of all molecules in a cell, including DNA, RNA, protein and small molecules that makes a cell a cell with its unique characteristics and function. Without a full description of the molecules and their dynamics in a cell, one would not be able to understand a cell as the basic unit of life. It is time to reference genomics, transcriptomics, metabolomics and proteomics collectively as lifeomics. Lifeomics will intrinsically produce big data; its full potential will not be realized unless we look at it from the big data perspective and study it with tools from other scientific disciplines, especially mathematics and chemistry. Although still in its infancy, lifeomics will lead to grand scientific discoveries not only in our lifetime, but also in the future. Of the lifeomics, proteomics is relatively young and less developed than genomics, but it is predicted to take the centre stage as proteins execute the fundamental functions of life. Instructions coded in DNA and RNA are decoded into proteins; proteins and their modified forms reflect the physiological and pathological states of an organism. Proteins as the last component of the central dogma of molecular biology are the key to understanding life dynamics. It is not surprising that proteomics is rapidly evolving into the largest data generator after DNA and RNA sequencing. A rich collection of omics data has been generated by lifeomics. With the cost of omics experiments per data unit generated rapidly shrinking, and the throughput of omics technology platforms steadily increasing, the volume of omics data has exploded. The rapid growth of omics data in types and volume, further complicated by their complex relationships and diversity of biological systems, have brought us at the early stage of omics studies an unprecedented informatics challenge in data collection, storage, processing, analysis, distribution and application. We need to take a data-driven discovery approach to translate both reductive and holistic experimental measurements into information, scientific knowledge and, ultimately, into biomedical applications.

3 If the past ten years of research and development in BPRC has taught us anything, it is the realization that proteomics produces big data and we must approach it with a set of tools different to those traditionally used in the biological sciences. Big projects: HPP and HLPP After the human genome sequence was uncovered, protein scientists around the world recognized the opportunity to decipher the human proteome and formed the Human Proteome Organization (HUPO) 2 in A prototypical human proteome project (HPP) was initiated as an international collaboration with the ultimate goal to systematically map the entire human proteome using currently available and emerging techniques such as mass spectrometry, antibodies and knowledge bases (three pillar technology platforms of HPP) 3,4. Chinese protein scientists played a very active and significant role in initiating and leading the HPP. We led the Human Liver Proteome Project (HLPP) 5,6, the first human organ proteome project. This is the first international scientific project led by a Chinese team. We designed and advocated the road map for HLPP (that is, two profiles: expression and modification; two maps: interaction and localization; three repositories: sample, antibody and data; and two outputs: physiology and pathology), which had been presented in the first HUPO Workshop for Human Proteome Initiatives (Bethesda, Maryland, US in April 2002). Our first version of the human liver proteome (HLP), which took 5 years to finish, comprised 6,788 proteins and was the largest dataset of any human organ proteome 7,8. This HLPP dataset showed that proteins involved in liver-specific functions, such as bile transport, bile acid synthesis and bilirubin metabolism, are well represented as expected. Proteins involved in metabolism, nutrient transport and blood coagulation, as well as the complement systems, constitute the majority of the HLP dataset. By utilizing this reference map and yeast two-hybrid technology, we drafted a protein protein interaction map of liver proteins. A network of 3,484 interactions among 2,582 proteins was identified 9. This dataset represented the first comprehensive description of the human liver protein interaction network. The Chinese team also developed a high-efficiency antibody and a workflow to enrich acetylated peptides from liver samples, and found that lysine acetylation is a prevalent modification of enzymes involved in metabolism, including almost all of the enzymes in glycolysis, gluconeogenesis, tricarboxylic acid cycle, urea cycle, fatty acid metabolism and glycogen metabolism 10. These findings established lysine acetylation as another major form of post-translational modification that rivals phosphorylation. Big centre: Introduction to the PHOENIX Center In the past decade, China has undergone phenomenal growth in proteomics research, encompassing both methodology development and the biomedical application of these techniques. Proteomics was identified as a national strategic priority area targeted for development. The nation is in an exciting expansion phase for research and development in science and technology. Big projects, including HLPP and HPP, call for big research centres that have a top-down design to manage and execute the projects. Uniform experimental protocols and data quality control can be implemented in such centres. Currently, proteomics research in China is conducted in multiple institutions, programmes and centres with substantial redundancy and suboptimal support. The PHOENIX Center will facilitate a close interaction and collaboration between these diverse research entities. As current proteomics methods, especially those of highthroughput technologies, are technically demanding and the cost prohibitive for individual investigators, the establishment of key proteomics technology platforms in the PHOENIX Center will encourage and promote protein research in China, support existing protein science investigators, attract and recruit new investigators to benefit from sophisticated proteomics techniques, and create new partnerships and stimulate the development of new projects in biomedical research. We aim at building a research centre that (1) integrates the capacity for discovery, validation, functional analysis and translational medicine research, (2) encourages and validates the practice of discovery-driven research, (3) produces standardized, high-quality data from which knowledge can be extracted and mined, (4) solves important biomedical problems to impact human health, and (5) shares and disseminates information and knowledge to the world. To accomplish the goal, we have designed the PHOENIX Center with two central components: (1) technology platforms that include analytical proteomics, functional proteomics and bioinformatics, and (2) support facilities. These two central components will support scientific activities of the centre, which include extramural and intramural researchers and academic activities (Figure 2). The component researchers will work closely with scientific investigators to promote collaboration, scientific exchange and programme enrichment. Proven and mature state-of-the-art technologies will be employed as the pillar technologies of these platforms. However, equal attention will be paid to keep up with the development of cutting-edge technologies so that research undertaken at the PHOENIX Center will remain relevant and be of world-class quality. Two highlights in the PHOENIX Center are its proteomics and bioinformatics platforms. With a high concentration of modern high resolution and high-mass-accuracy mass spectrometers, the proteomics platform has the ability to produce proteomes per day at a proteome coverage of 6,000 to 8,000 gene products. The bioinformatics platform has set up the hardware to meet these big data challenges, featuring a 200+ node high-performance computer (HPC) cluster with peak computing power at 200 teraflops or distributable computing tasks; a group of 64-core servers for RAM-intensive sequential computational applications such as databases; and a total of 4 petabyte (PB) tiered-network storage system connected to both of these computing systems by highspeed networks. A suite of system software tools deployed (or to be deployed) include a laboratory information management system to facilitate the data and metadata capturing; a task-scheduler to support concurrent computing tasks and workflows running a HPC environment; a distributed computing ecosystem such as Apache Hadoop framework will be developed to provide a big datafriendly environment for large and complex data. A Galaxy-based proteomics data analysis platform standardizes the data process and analysis pipelines in a high-throughput environment. Galaxy is a scientific workflow, data integration, analysis and publishing

4 - Analytic proteomics - Functional proteomics - Preclinical translational medicine - Bioinformatics - Program utilization and enrichment - Request for applications - Pilot and feasibility program - Seminar and training program Intramural Research Extramural Research Academic Activities Technology Platforms Support Facilities - Proteomics - Functional proteomics - Bioinformatics - Animal facility - Biobank Figure 2 The overall structure of the PHOENIX Center. platform that aims to make computational biology accessible to research scientists who do not have computer programming experience. Novel data analysis and mining algorithms and tools are being developed by bioinformaticians to facilitate the transformation of the experimental data to biomedical knowledge. A sophisticated database system is required to meet the big data information management and retrieval requirements of big projects such as the Chinese Human Proteome Project (CNHPP). Component databases include a raw data repository as the data source for reanalysis; a metadata repository essential for data interpretation and annotation; and a knowledge base for an integrated view of well-annotated, quantitative experimental results. The internal knowledge discovery gateway is the web-based primary entry point to access HPP-related information and tools. It has an integrated browsing and search interface of HPP raw data repository, analyzed proteomics data and reference databases and knowledge base; an interface to access standard data analysis platform; as well as a collection of data mining tools for discovery. Based on submissions to the international ProteomeXchange database collaboration, China is already the fourth largest proteomics data producer worldwide. The extensive computer resources at PHOENIX will allow workflows to rapidly progress through data processing, which is increasingly the rate-limiting step in modern proteomics. A PHOENIX national node of the international ProteomeXchange collaboration will provide an efficient, high quality resource of public proteomics data to the world, as well as optimize national access to global proteomics data 11. The critical mass of PHOENIX expertise will foster a much stronger contribution of national proteomics science to international efforts not only in the context of the Human Proteome Project, but also in standardization efforts like the HUPO Proteomics Standards Initiative. Exploiting the excellent facilities in the new PHOENIX building, we will develop the centre into a hub for international exchange, communication and professional training for the local, national and international proteomics community. Although a large amount of data and knowledge has been generated in healthcare and biomedical research in recent decades, progress in translating those exciting findings to novel diagnosis and therapy is very limited. The development of an effective therapeutic molecule is still regarded as an intimidating endeavour because of the high failure rate of target selection and validation. The goal of the division of translational medicine in the PHOENIX Center is to transform this process by bridging proteomics and informatics for innovation in the biomarker and drug discovery process. In the lifeomics era, big data analysis of genetic links and biological networks in disease transforms the biomarker and drug research and development strategy from screening isolated targets to integrated network analysis of clusters of potential target genes. We believe that the close collaboration of different divisions in the

5 PHOENIX Center and with other research organizations will offer unprecedented opportunities for the creation of effective diagnostic and therapeutic molecules. For example, cancer immunotherapy is a major focus in drug development. Genomic, proteomic and functional research data from broad and in-depth analysis of mutations of cancer driver genes and how the immune system tracks those mutations will be a great resource to identify novel targets for the development of biomarkers and therapeutics in the PHOENIX Center. The first phase of the CNHPP is being executed in the PHOENIX Center. CNHPP aims to build the encyclopedia of human proteomes in health and disease. The pilot experiments in the initial phase are to map the proteomics landscapes of liver, lung and gastric cancers. The management and execution of CNHPP epitomizes the concepts behind big projects, big centres, big data, and grand discoveries. We took a centralized approach to use uniform experimental protocols to process tumour samples for sample preparation, to use uniform parameters to run proteomics and mass spectrometry to acquire data on the proteomics platform, and to process the data for protein identification, quantification and bioinformatics analysis on the bioinformatics platform. As all the experiments and analyses are done in the same centre with standardized protocols and analysis algorithms, we are starting to see directly comparable and uniform lifeomics datasets that cover three cancers, leaving infinitive imagination space for the next stage of big data analysis and integration. Because the PHOENIX Center is the protein-centric lifeomics research facility and a significant computational infrastructure has been laid down as its bioinformatics platform, an upgrade to its big data capability is being considered. The PHOENIX Center has acquired another important role in building up Chinese big data capacity: the National Center for Omics Data (NCOD). Once built, NCOD will have a database system with more than 100 various kinds of databases, hosting 1,000 genomics and proteomics experimental data sets at PB level, and 15 or more data analysis pipelines. It will become the Chinese hub for Chinese biomedical data in data archiving, sharing and dissemination. REFERENCES 1. He, F. Lifeomics leads the age of grand discoveries. Sci. China Life Sci. 56, (2013) He, F. Human liver proteome project: plan, progress, and perspectives. Mol. Cell Proteomics 4, (2005). 7. Sun, A. et al. Liverbase, a comprehensive view of human liver biology. J. Proteome Res. 9, (2010) Wang, J. et al. Toward an understanding of the protein interaction network of the human liver. Mol. Syst. Biol. 7, 536 (2011). 10. Wang, Q. et al. Acetylation of metabolic enzymes coordinates carbon source utilization and metabolic flux. Science 327, (2010)