INTERACTIVE E-SCIENCE CYBERINFRASTRUCTURE FOR WORKFLOW MANAGEMENT COUPLED WITH BIG DATA TECHNOLOGY

Similar documents
TDWI Analytics Fundamentals. Course Outline. Module One: Concepts of Analytics

Installation and Configuration for Microsoft Dynamics AX 2012

Hydrologic Information System Status Report

Digital Transformation

Cyberinfrastructure: Opportunities and Challenges

White Paper. Non Functional Requirements of Government SaaS. - Ramkumar R S

Automated Service Builder

NSF {Program (NSF ) first announced on August 20, 2004} Program Officers: Frederica Darema Helen Gill Brett Fleisch

SAS BIG DATA ANALYTICS INCREASING YOUR COMPETITIVE EDGE

An overview of TEAM strategies for integrating the product realization process

Smart Distribution Applications and Technologies - Program 124

THE FUTURE OF PROCESS HAS BEGUN

Fixed Scope Offering for Implementation of Oracle Fusion CRM in Cloud

IMPLEMENTATION, EVALUATION & MAINTENANCE OF MIS:

Agent Based Reasoning in Multilevel Flow Modeling

Predict the financial future with data and analytics

CONVERGENCE OF CLOUD COMPUTING, SERVICE ORIENTED ARCHITECTURE AND ENTERPRISE ARCHITECTURE

Decision Resource Management and Scheduling on the Grid

20332B: Advanced Solutions of Microsoft SharePoint Server 2013

IN COMPLEX PROCESS APPLICATION DEVELOPMENT

Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW

SOFTWARE AGENT AND CLOUD COMPUTING: A BRIEF REVIEW Gambang, Pahang, Malaysia 2 College of Information Technology, Universiti Tenaga Nasional,

Design and Implementation of Heterogeneous Workflow System Integration Mode Based on SOA Framework

Adobe Cloud Platform

Workflow Advisor on The Grid

IT Service Management with System Center Service Manager

Siveillance Vantage secures your critical infrastructure

IBM Spectrum Scale. Advanced storage management of unstructured data for cloud, big data, analytics, objects and more. Highlights

INTEGRATION OF AUTONOMOUS SYSTEM COMPONENTS USING THE JAUS ARCHITECTURE

I. Interoperable Data Discovery, Access, and Archive

Design of Information Systems 1st Lecture

Model-Driven Design-Space Exploration for Software-Intensive Embedded Systems

What Do You Need to Ensure a Successful Transition to IoT?

BACSOFT IOT PLATFORM: A COMPLETE SOLUTION FOR ADVANCED IOT AND M2M APPLICATIONS

Exam /Course 20332B Advanced Solutions of Microsoft SharePoint Server 2013

Microsoft reinvents sales processing and financial reporting with Azure

QualiMaster Project Presentation

Oracle s Hyperion System 9 Strategic Finance

Administering System Center Configuration Manager and Intune (NI114) 40 Hours

IT Service Management with System Center Service Manager

GENERAL OVERVIEW STEFANOS GOGOS (UNIFE) PROJECT COORDINATOR. InnoTrans Presentation - 20/09/2018

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

IBM ICE (Innovation Centre for Education) Welcome to: Unit 1 Overview of delivery models in Cloud Computing. Copyright IBM Corporation

Virtual Planning Room - Analytical Mine Planning Work, Based on Tools Integration and a Robust Cloud Information System

GRM VALUATION REUTERS/INA FASSBENDER

Deep Learning Acceleration with

Technical Brief. SBS Group offers: Solution Differentiators. Rapid Deployment and Adoption. A Truly Global Solution. Scalable and Agile

Industry 4.0 What does it Mean for CAPIEL Manufacturers?

Integrating Market and Credit Risk Measures using SAS Risk Dimensions software

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Research of the Social Media Data Analyzing Platform Based on Cloud Mining Yi-Tang ZENG, Yu-Feng ZHANG, Sheng CAO, Li LI, Cheng-Wei ZHANG *

can boost global IoT success

COURSE 20332B: ADVANCED SOLUTIONS OF MICROSOFT SHAREPOINT SERVER 2013

Overview: Nexidia Analytics. Using this powerful toolset, you will be able to answer questions such as:

From Isolation to Insights

HARMONIZATION OF STANDARDS FOR ENTERPRISE INTEGRATION AN URGENT NEED. Martin Zelm

Siveillance Vantage secures your critical infrastructure

15th ICCRTS. The Evolution of C2. Title of Paper: Network Centric Simulation Architecture. Topic(s): 6, 1, 8

AXIO ProServ: Optimized Operations for the Global Project Management-based Enterprise

Service Oriented Architecture

Analytics in Action transforming the way we use and consume information

Course 20332A Advanced Solutions of Microsoft SharePoint Server 2013 Course Duration: 5 days Course Type: Instructor-Led/Classroom

POLOPOLY V9 TECHNICAL OVERVIEW. System Architecture Templates and Presentation Modules

The Case to Modernize Storage in Media and Entertainment

Enabling GIS in Distribution Utility for Energy Audit and Load flow analysis

BOARD 10: the future of decision making

Overview. A New Approach for Leveraging Enterprise Information

LIAISON ALLOY HEALTH PLATFORM

Web Objects Based Energy Efficiency for Smart Home IoT Service Provisioning

WELCOME TO. Cloud Data Services: The Art of the Possible

Installation and Configuration for Microsoft Dynamics AX 2012 Course 80221A: 3 Days; Instructor-Led

Top 10 Compelling Reasons to Adopt Microsoft SharePoint 2013

Dell Advanced Infrastructure Manager (AIM) Automating and standardizing cross-domain IT processes

Product presentation. Fujitsu HPC Gateway SC 16. November Copyright 2016 FUJITSU

Oracle DataRaker The Most Complete, Most Reliable Solution for Transforming Complex Data into Actionable Insight

Efficiently Develop Powerful Apps for An Intelligent Enterprise

80221: Installation and Configuration for Microsoft Dynamics AX 2012

Active Analytics Overview

Administering System Center Configuration Manager and Intune (20696C)

Information Integration of Engineering Construction Project Based on Management Process. Cui Bo

Workflows Recognition through Multi Agents in Surveillance systems

Cisco Connected Asset Manager for IoT Intelligence

Author: Potarin_A_E_1 doc, The Direction of Industrial Automation Direction, Phd, CJSC NVision group, Moscu, Russia

Some Aspects concerning the Cyber-Physical Systems Approach in Power Systems


Fujitsu s Leading Platform for Digital Business

ANNUAL TECHNICAL PROGRESS REPORT

Course Content Advanced Solutions of Microsoft SharePoint Server Course ID#: W Hours: 35. Course Description:

RESOLUTE project presentation

Overview: Nexidia Analytics. Using this powerful toolset, they will be able to answer questions such as:

You can plan and execute tests across multiple concurrent projects and people by sharing and scheduling software/hardware resources.

Design of logistics tracking and monitoring system based on internet of things

Advanced Solutions of Microsoft SharePoint Server 2013

Oracle Planning and Budgeting Cloud Service

Research on Architecture and Key Technology for Service-Oriented Workflow Performance Analysis

EMERGENCY OBSERVATION CONSTELLATION PLANNING PLATFORM

20332B: Advanced Solutions of Microsoft SharePoint Server 2013

Next Gen ERP for Freight and Logistics

12) MARK GLIKSON. Deploying a Semantic Operating System in a small financial services company

Distribution System Operator (DSO) Simulation Studio

Transcription:

Section Name INTERACTIVE E-SCIENCE CYBERINFRASTRUCTURE FOR WORKFLOW MANAGEMENT COUPLE WITH BIG ATA TECHNOLOGY enis Nasonov 1 Alexander Visheratin 1 Konstantin Knyazkov 1 Sergey Kovalchuk 1 1 ITMO University, Russian Federation ABSTRACT The paper presents the technology for building e-science cyberinfrastructure which enables integration of regular cloud computing environment with big data facilities and stream data processing. The developed technology is aimed to support uniform dynamic interaction with the user during composite application building and execution, as well as result analysis. The core concept of the proposed approach is based on the set of domain-specific knowledge including description of a) semantics of problem domain objects; b) used software and data services; c) data formats and access protocols. Linking all these knowledge parts together facilitates automatic solution of technological integration issues. It enables providing the user with high-level domainspecific tools for complex tasks description, which can be automatically translated into particular calls of cloud computing services or Bigata analytics tasks. The developed technology uses interactive workflow (IWF) technique to interconnect services of different kind: computation, data analytics, external data sources, interactive visualization systems Keywords: Big ata, workflow, cloud computing, e-science, cyberinfrastructure INTROUCTION Contemporary e-science toolbox is often build around the concept of workflow and workflow management systems [1] which are focused on providing high-level access to the computational resources usually organized within a grid or cloud computing infrastructure [2]. Geoinformatics is one of the scientific area that has necessity in the similar systems, especially in meteorology and hydrometeorology as well as in the GIS (Geographic Information Systems). On the other hand contemporary tasks are often related to the processing of large data sets (e.g. see the idea of Fourth paradigm in science by Microsoft [3]. Today a set of technologies for processing large data arrays is intensively developed within the area of Bigata [4] which produces a set of new issues related to the scientific tasks and solving them using existing infrastructure. One of the ideas behind the Bigata principles is implementation of the code-to-data approach (moving pieces of application to the resources where data is stored) instead of more common data-to-code approach (transferring data and parameters to the computational resource)[5]. This difference in the paradigms leads to the need of development of joined architecture which provides the capability for development of solutions

15 th International SGEM GeoConference on exploiting both paradigms (see e.g. [6], [7]). Still most of the coupled solutions divide the parts of composite applications (e.g. by providing Bigata infrastructure as one on the services available to be incorporated into the workflow). The goal of the presented work is to develop and implement the architecture of the platform which enables seamless integration of different resources (computational resources within infrastructure, data storages and data services) within a single composite application defined at high domain-specific level according to basic requirements of e-science area. COUPLING BIG ATA WITH WORKFLOWS Requirements analysis. After the consideration of e-science tasks (mainly within simulation-based approach) and current challenges within this area [4], [8] we have defined the following issues to be managed by the developed platform. 1. The proposed solution should provide the capability to develop high-level task description without explicit relationship to the particular architecture or data processing styles. 2. The developed platform should incorporate different classes of resources: computational resources, data storages, services, etc. All the resource management procedures should be performed implicitly in an automatic way. 3. The resources should have high-level unified access on the workflow level with further automatic separation and translation into particular data-to-code or codeto-data requests. 4. The processing of tasks should be performed dynamically with real-time exchange of data sets and parameters enabled. This should involve exploiting data streams to support data processing immediately after its appearance. 5. The interaction with the user should be performed in a unified (domain-neutral) way with involvement of the user into the simulation process. This should make a further step to enable system-level exploration [9]) as a next-generation way of e-science tasks solving. Additionally considering the specificity of Bigata technology within the e-science tasks the following requirements can be defined: 6. The platform should support integration of various data sources with different formats, access protocols, usage rules, etc. Semantically identical data should be available to the user (within the scope of workflow) in a unified way. 7. Implicit integration of data analytics tools for processing large arrays of data in an automatic way (i.e. without direct coding of MapReduce procedures or similar development activities). 8. The platform should provide the capability to integrate specific Bigata visualization tools that can act interactively as a part of workflow executer using the platform. Technological Background. To develop the platform according to the proposed requirements the following technologies and concepts can be used as a basis. ipse (Intelligent Problem Solving Environment) concept [10] was developed to provide knowledge-based conceptual framework for solving e-science tasks using merged knowledge from three basic domains: task-specific problem domain, IT knowledge and simulation domain. Conceptual hierarchy of expressive technologies [11], which

Section Name organize and integrate a set of domain-specific languages with textual or graphical notations for expression and usage of knowledge from different problem domains to enable automatic processing of e-science tasks. CLAVIRE cloud computing platform [12] which enables high-level abstract workflow definition and execution. The platform uses a set of knowledge-based technologies to describe available software and hardware resources as well as domain-specific objects to be investigated within e-science tasks. IWF (Interactive Workflow) technology [13] was created to ensure real-time data exchange within the workflow during the execution process. It extends the basic concept of workflow by introducing ports to enable data exchange using data streams. Additionally it supports building interactive simulation environments involving humancomputer interaction as a part of the workflow. VSO (Virtual Simulation Objects) concept and technology [14] was developed to organize high-level domain-specific simulation environment where the user can describe the investigated system by its structural semantic model (as a set of interconnected objects) which in turn can be automatically translated into executable workflow structure. ynamic SL for Bigata analytics [15] is being developed to support high-level description of Bigata analytic request using a set of domain-specific libraries, which extend the basic structure of the language. It can be used as an intermediate language to translate the parts of coupled workflow into Bigata analytics requests. Workflow classification. Generally, workflows for e-science task are classified into three major categories (Example of different workflows from several categories is shown on the Fig. 1). 1. Targeted (local) workflows. Workflows from this category are running locally on the dedicated resources. If any software which forms the workflow steps is missing on the resource, it is automatically deploying from package repository. Targeted WFs or locally operated WFs are tightly coupled with predefined specific conditions of the local computational environment which force it to be executed on the local resource. Following cases clearly show such conditions. a. ata-driven WF. It can be efficiently used only being processed on data storing nodes, in order to save time on data coping overheads; b. WF that has to take into consideration a certain level of confidentiality. It might be important to run WF locally in the case when an unique information should be processed only on the owner's resource; c. Offline WF that can be executed on the resources which are periodically available to the system like laptops, smart phones and so on. d. Computational (traditional) workflows. These workflows are executed on the most profitable resources according to the system multi-criteria scheduling algorithm based on the available computational facilities. These WFs are typically used in the most spread cases, among them: (1) High performance computing tasks that require involvement of significant computational resources; (2) WF with a lot of fork-join structures and should be immediately processed; (3) WF that contains combination of unique software that are deployed on different resources; (4) Collaborative WF with a group of people engaged in solving one global task.

15 th International SGEM GeoConference on Figure 1. ifferent types of workflows in use 2. System workflows. According to their purposes, system workflows are serving as functional elements to support consistent continuous platform operations. It is significant for such activities as external data source and infrastructure monitoring. System WFs should provide online support of the platform infrastructure or systems that were built above the platform. The crucial aspect for them is uninterrupted and robust execution through the time, even if the platform fails, it should continue running on dedicated resources and restore availability after the system recovery. Common cases are: (a) Processing WF helps to process some incoming data from external resources on periodical or permanent (streaming) basis; (b) Monitoring WFs are used for infrastructure monitoring based on the analyses of environment parameters. 3. Hybrid workflows combine possibilities of all three previous classes. Hybrid WFs are practically used for complex system solution implementation and may contain benefits of all mentioned approaches. In the section IV a prototype of flood preventing system is presented as an example. IMPLEMENTATION ETAILS Solution s Architecture. Considering the proposed idea of coupled Bigata technology and computational workflow management infrastructure the architecture of the platform (see Fig. 2) was developed. It meets all proposed criteria and requirements. The architecture is based on the components from CLAVIRE platform and extends its functional capabilities with high-level processing technologies using a set of metadata storages. The main idea of the proposed architecture is focused on several issues important to be managed.

External data sources Section Name User interfaces Parameter management GUI CLAVIRE/POI Workflow management GUI CLAVIRE/Ginger Interactive visualization Metadata omain semantics CLAVIRE/VSO Composite application WF C ata format description ata management Interactive execution management Software description CLAVIRE/PackageBase ata collecting CLAVIRE/Crawler Bigata request C Computing request Resource description CLAVIRE/ResourceBase istributed data storage CLAVIRE/Storage Resource management CLAVIRE/Executor Available resources ata storing nodes Execution services S S S S S S WF - workflow C - code blocks - existing CLAVIRE components S - available software - datasets - developed components Figure 2. Architecture of the platform Firstly, the architecture is developed to interconnect automatically different classes of available resources: (a) regular cloud computing nodes (executing services), which usually execute available software using data provided by the user and transferred to those services (data-to-code); (b) distributed data storage nodes which can be used either for general data storing, or for distributed Bigata processing using transferred code (code-to-data) and local software; (c) data sources which can be processed either as a streaming sources or as regular external data storages. The seamless integration of all these classes of resources within single composite application is the main goal of the ata management subsystem within the architecture. The system includes control of all the data streams within the platform as well as background management of the data being stored in the storage (including crawling data from external data sources and management of replication within the distributed storage). Secondary, to support the unified work with composite application regardless of the nature of its parts (data-to-code or code-to-data) the user interface of CLAVIRE (including workflow management system and visualization toolbox) platform as well as used workflow language (EasyFlow) should be extended with high-level blocks that will enable automatic interpretation of joined composite application. The composite application in that case should include (a) workflow structure; (b) corresponding data and parameters to be taken from the user; (c) implicit (composed from the EasyFlow constructs) code to be transferred to distributed storage as Bigata requests along with the required parameters. The decomposition and two-ways interpretation of the composite application are performed by the Interactive execution management subsystem. This subsystem controls all the parts of the composite application and interconnects them with the help of IWF technology.

15 th International SGEM GeoConference on Finally all the processes are supported by the knowledge libraries which reside in the metadata storage. Among them the following knowledge libraries are derived from the basic CLAVIRE platform: (a) description of available computation resources (ResourceBase); (b) description of software packages available to call in batch or interactive mode (PackageBase); (c) high-level domain-specific objects which enable description of investigated system (VSO technology). Additionally the library which describes data of different format processing is introduced within the developed platform to support automatic data management. Solution development. To develop the platform described above, the architecture that is based on the mentioned technologies and CLAVIRE core, several main functional blocks should be highlighted for implementation phase. ata unification module. This block includes services that provide all functionality connected with data processing, such as: (a) template-based approach to organize external data management in monitoring workflows. It includes data acquisition, interpretation and organization of storage, (b) unified data management service which should be used for meta-data operations, (c) integration of package base parameters description with created unified database types to provide seamless data usage through the whole platform. EasyFlow extension. This block is crucial for workflow development and use of extended functionality. In spite of required condition and procedure operations, one of the most important features is ability to provide Bigata requests with included data WFs from traditional computation WF. Another good feature is online workflow changing. ata WF module. In order to provide local WF execution on distributed storage nodes, an assembly of supporting services is needed. istributed software deploy service, resource service extension and CLAVIRE storage plug-in are some of them. IWF. There is a need to provide steering capability and online data monitoring to involve users in the execution process when system solution can be improved by user s interactive activity. Also support of IWF reconfiguration is necessary to balance and dynamically change the environment. FLOO PREVENTION SYSTEM EXAMPLE On the Fig. 3.a. flood prevention s application represents a prototype of Early Warning System (EWS) core. It has three different sides of data operation within workflows during system stages execution. The first block Monitoring WF represents activity which helps to detect upcoming hazards. It collects data from external sources such as sensors, web services or remote directories, and then it performs preprocessing, like filtering or recovery and finally saves it to distributed storage. The data is saved according to Bigata principles. Also, in parallel, in order to check hazards occurrence Monitoring WF launches the computation workflow. The first part of Computation WF blocks is required for potential hazards discovery. Using provided aggregated data, Computation WF runs Swan and BSM models in order to get water level forecast which is used to detect flood hazards. If flood isn t detected, Computation WF ceases its execution. Otherwise two functional brunches are executed. First brunch makes uncertainty analysis for provided atmospheric data forecast and its impact on water level prediction model. The main part of this brunch is fork-join structure which implements Monte Carlo method. The results of estimated uncertainty are used in Plan maker step.

Section Name a) b) c) d) Figure 3. emo applications (a, b) flood prevention application built within hybrid workflow The second brunch forms data WF to find out other type of uncertainty within Bigata nodes. Firstly, through the retrospective data the search by atmospheric forecast pattern is performed. etected cases are used to calculate water level forecast and to find out more cases produced by other atmospheric forecasts. Finally, the cases are compared with measurements. The uncertainty results are transferred in Plan maker step. The important feature is a possibility to submit high computation tasks like SWAN and BSM back to the platform in order to get results faster and prevent overloading of data nodes. Apart from that user can be provided by steering option to make decisions during search on inaccurate matches. At the last part of WF Plan maker step produces several plans that are ordered by multi-criteria algorithm. Then a group of experts chooses the best one. On the Fig. 3.b CLAVIRE implementation is shown. On the first screen an abstract composition of described workflow is demonstrated. On the second and third screens, completed workflow and selected safe plan are presented. CONCLUSION The presented work is aimed towards the development of the platform which combines the benefits of traditional workflow-based systems with Bigata solutions keeping the e-science user (domain specialist) away from technical details of two execution paradigms and multiplicity of various technologies. This goal could be reached by combination of the several knowledge-based technologies which enable high-level definition of the task and automatic interpretation of it. The developed platform is capable to integrate seamlessly various resources (computational, data storages, services) within a single workflow form which creates a basis for system-level scientific exploration and implementation of wide range of application classes. ACKNOWLEGEMENTS

15 th International SGEM GeoConference on This paper is financially supported by Ministry of Education and Science of the Russian Federation, agreement #14.578.21.0077 (24.11.2014). This work was financially supported by Government of Russian Federation, Grant 074-U01; and project Big data management for computationally intensive applications (project #14613). REFERENCES [1] Yu J., Buyya R. A taxonomy of workflow management systems for grid computing //Journal of Grid Computing. 2005. Т. 3.. 3-4. С. 171-200.; [2] Foster I. et al. Cloud computing and grid computing 360-degree compared //Grid Computing Environments Workshop, 2008. GCE'08. Ieee, 2008. С. 1-10. [3] Tansley S. et al. (ed.). The fourth paradigm: data-intensive scientific discovery. Redmond, WA : Microsoft Research, 2009. Т. 1. [4] Assunção M.. et al. Big ata computing and clouds: Trends and future directions //Journal of Parallel and istributed Computing. 2014. [5] Manjunatha A. et al. Getting Code Near the ata: A Study of Generating Customized ata Intensive Scientific Workflows with SL. 2010. [6] Baranowski M., Belloum A., Bubak M. MapReduce Operations with WS-VLAM WMS //Procedia Computer Science. 2013. Т. 18. С. 2599-2602. [7] Gil Y. et al. Time-bound analytic tasks on large datasets through dynamic configuration of workflows //Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science. ACM, 2013. С. 88-97. [8] Gil Y. et al. Examining the challenges of scientific workflows //Ieee computer. 2007. Т. 40.. 12. С. 26-34. [9] Foster I., Kesselman C. Scaling system-level science: Scientific exploration and IT implications //Computer. 2006.. 11. С. 31-39. [10] Boukhanovsky A. V., Kovalchuk S. V., Maryin S. V. Intelligent software platform for complex system computer simulation: conception, architecture and implementation //Izvestiya VUZov. Priborostroenie. 2009. Т. 10. С. 5-24. [11] Knyazkov K. V. et al. CLAVIRE: e-science infrastructure for data-driven computing //Journal of Computational Science. 2012. Т. 3.. 6. С. 504-510. [12] Kovalchuk S. V. et al. Knowledge-based Expressive Technologies within Cloud Computing Environments //PAoIS. Springer Berlin Heidelberg, 2014. С. 1-11. [13] Knyazkov K. V. et al. Interactive workflow-based infrastructure for urgent computing //Procedia Computer Science. 2013. Т. 18. С. 2223-2232. [14] Kovalchuk S. V. et al. Virtual Simulation Objects concept as a framework for system-level simulation //arxiv preprint arxiv:1211.7080. 2012. [15] Kovalchuk S. V. et al. A Technology for Bigata Analysis Task escription Using omain-specific Languages //Procedia Computer Science. 2014. Т. 29. С. 488-498.