Flink meet DC/OS. Deploying Apache Flink at Scale. Elizabeth K. Ravi FlinkForward San Francisco

Size: px
Start display at page:

Download "Flink meet DC/OS. Deploying Apache Flink at Scale. Elizabeth K. Ravi FlinkForward San Francisco"

Transcription

1 FlinkForward San Francisco Flink meet DC/OS Deploying Apache Flink at Scale Elizabeth K. Ravi 1

2 Talk Outline Part 1 Part 2 Introduction to Apache Mesos, Marathon, and DC/OS Demonstration of demo data pipeline + Installing Flink on DC/OS Part 3 DC/OS 1.9 key features for data services and beyond 2

3 Apache Mesos: The datacenter kernel 3

4 Marathon Mesos can t run applications on its own. A Mesos framework is a distributed system that has a scheduler. Schedulers like Marathon start and keep your applications running. A bit like a distributed init system. Mesos mechanics are fair and HA Learn more at hon/ 4

5 Introducing DC/OS Solves common problems Resource management Task scheduling Container orchestration Self-healing infrastructure Logging and metrics Network management Universe of pre-configured apps (including Flink, Kafka ) Learn more and contribute at 5

6 DC/OS Architecture Overview Services & Containers HDFS Jenkins Marathon Cassandra Flink Spark Docker Kafka MongoDB +30 more... DC/OS Container Orchestration Security & Governance Monitoring & Operations User Interface & Command Line ANY INFRASTRUCTURE 6

7 Interact with DC/OS (1/2) Web-based GUI t/usage/webinterface/ 7

8 Universe 8

9 Interact with DC/OS (2/2) CLI tool API 9

10 Flink on Apache Mesos and DC/OS According to the December 2016 data Artisans-organized Apache Flink user survey just under 30% of respondents were running Flink on Apache Mesos You may already be using Apache Mesos! Version 1.2 of Flink includes support for Apache Mesos and DC/OS, it is now possible to run an highly available Flink cluster on Mesos & 10

11 DEMOS Demo data pipeline + Installing Flink on DC/OS 11

12 DC/OS Data Services Ecosystem DATA SERVICES ECOSYSTEM Alluxio Couchbase Datastax DSE Elastic (ELK) Redis Apache Flink OPERATIONS Remote Container Shell Unified Metrics Deployment failure analyzer WORKLOADS Pods GPU based scheduling

13 DC/OS Operations DATA SERVICES ECOSYSTEM OPERATIONS Remote Container Shell Unified Metrics Unified Logging Deployment Failure Debugging Upgrades & Configuration updates WORKLOADS Pods GPU based scheduling

14 DC/OS 1.9 DC/OS: OPERATIONS REMOTE CONTAINER SHELL Open encrypted, interactive, remote session to your containers my-laptop$ dcos task exec my-task /bin/bash Starting /bin/bash in my-task... Connecting to remote my-task Remotely execute commands for real time app troubleshooting Provide developers access to their own applications, not the entire host or cluster 14

15 DC/OS 1.9 DC/OS: OPERATIONS UNIFIED LOGGING Access application, DC/OS and OS logs Easily troubleshoot applications with critical metadata such as container id and app id Integrate easily with existing logging systems 15

16 DC/OS 1.9 DC/OS: OPERATIONS UNIFIED METRICS Single API for system, container and application metrics Metadata such as host id and container id are automatically added to assist in debugging Container Integrate easily with existing metrics systems 16

17 DC/OS 1.9 DC/OS: OPERATIONS DEPLOYMENT FAILURE DEBUGGING Understand why your application is not deploying Understand which nodes in the cluster can accommodate the role, constraints, cpu, mem, disk and port requirements for your app 17

18 DC/OS 1.9 DC/OS: OPERATIONS UPGRADES AND CONFIG UPDATES Generate new config for cluster nodes $ dcos_generate_config.sh --generate-node-upgrade-script <installed_cluster_version> Single command upgrade script for individual nodes $ curl -O <Node upgrade script URL> $ sudo bash./dcos_node_upgrade.sh 18

19 DC/OS Workloads DATA SERVICES ECOSYSTEM OPERATIONS Remote Container Shell Unified Logging Unified Metrics Deployment failure analyzer WORKLOADS Pods GPU based scheduling

20 DC/OS 1.9 DC/OS: WORKLOADS PODS Schedule, deploy and scale multiple containers on the same host(s) while sharing IP address and storage volumes All containers in a pod instance run as if they are running on a single host in pre-container world Useful for migrating legacy applications or building advanced micro services (side car containers) 20

21 DC/OS 1.9 DC/OS: WORKLOADS PODS: MIGRATING LEGACY APPS TO CONTAINERS Traditional monolithic apps on VMs usually have support services such as log shipper, message queuing clients Many support services assume col-location on same host, and local-host access to networking and storage Pods simplify moving legacy monolithic apps to containers, reducing risk and accelerating migrations 21

22 DC/OS 1.9 DC/OS: WORKLOADS PODS: SUPPORT SERVICES (SIDE-CAR CONTAINERS) Advanced Micro Services patterns require colocating containers together Support services include for example: Logging or monitoring agents, Backup tooling & Proxies Data change watchers & Event publishers Pods simplify the building and maintenance of complex such microservices 22

23 DC/OS 1.9 DC/OS: WORKLOADS GPU: WHY GPU? GPUs are needed for many machine learning and deep learning applications GPUs are essential for real-time or near real time machine learning models GPUs deliver from 10X to 100X performance for some applications, resulting lower $$$/IOPS and more productivity to data science teams GPU applications include real time fraud detection, genome sequencing, cohort analysis and many others

24 DC/OS 1.9 DC/OS: WORKLOADS GPU BASED SCHEDULING Test Locally with Nvidia-Docker, deploy to production with DC/OS Isolate GPU instances and schedule workloads just like CPU and memory, guaranteeing performance Efficiently Share GPU resources across data science team Simplify migrating machine learning models across from dev to production, and across clouds 24

25 DC/OS 1.9 DC/OS OTHER IMPROVEMENTS Mesos 1.2 Marathon 1.4 Docker 1.12 and 1.13 (17.03-ce) support Centos 7.3 and CoreOS support Performance improvements across all networking features. CNI support for 3rd party CNI plugins. 100s of additional bugfixes and tests 25

26