Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Size: px

Start display at page:

Download "Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation"

Madeleine Foster
5 years ago
Views:

1 Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Roger Ding Cloudera February 3rd,

2 Agenda Hadoop History Introduction to Apache Hadoop Eco-System Transition from Legacy Data Platform to Hadoop Resources, Q & A 2

3 Legacy RDBMS Quick Check Centralized Storage Centralized Computing Send data to compute Bottleneck Network bandwidth Slow disk I/O Scale-Up Add more memory, upgrade CPU, replace server every several years High Cost High-end Processing and Storage Hard to plan Time to Data Structure Data Up-front modeling Schema-on-write Transforms lose data No agility 3

4 Google 1999: Indexing the Web 4

5 The Original Inspirations for Hadoop

6 The Beginning: Building Hadoop 2006 Core Hadoop: HDFS, MapReduce 6

7 Agenda Hadoop History Introduction to Apache Hadoop Eco-System Transition from Legacy Data Platform to Hadoop Resources, Q & A 7

8 Hadoop Eco-System Primer Hadoop consists of 3 core components HDFS(Hadoop Distributed File System): Self-healing, Distributed Storage Framework MapReduce: Distributed Computing Framework YARN(Yet Another Resource Management): Distributed Resource Management Framework Many other projects based around core Hadoop Referred to as the Hadoop Ecosystem projects Spark, Pig, Hive, Impala, HBase, Flume, Sqoop, etc A set of machines running Hadoop Software is known as a Hadoop Cluster Individual machines are known as nodes 8

1 2 Affordable & Attainable $300-$1,000 per TB 3 HDFS 4 2 1 1 2 1 5

9 HDFS: Economically Feasible to Store More Data Self-healing, high bandwidth clustered storage. 1 2 Affordable & Attainable $300-$1,000 per TB 3 HDFS HDFS breaks incoming files into blocks and stores them redundantly across the cluster. 9

10 MapReduce: Power to predictably process large data Distributed computing framework MR Processes large jobs in parallel across many nodes and combines the results. 10

11 A Decade of Hadoop A platform won t stop growing Core Hadoop (HDFS, MapReduce) Solr Pig Core Hadoop HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop

12 Some Hadoop Eco-System Projects Data Storage HDFS, HBase, KUDU Computing Framework MapReduce, Spark, Flink Data Ingestion Sqoop, Flume, Kfaka Data Serialization in HDFS Avro, Parquet Analytics Pig, Hive, Impala Orchestration Zookeeper Workflow, Coordination OOZIE Security (Authorization) Sentry Search Solr 12

13 Hadoop Eco-System Storage Engine HDFS (2006): Large files, block storage HBase (2008): Key-Value store KUDU (2016): Store structured data 13

14 Hadoop Eco-System Computing Framework Spark (2012) Originated at UC Berkeley AMPLab In-memory computing framework Processes data in-memory vs. MapReduce two-stage paradigm Can Perform 10 to 100 times faster than MapReduce for certain applications Flexible (Scala, Java, Python API) vs. MapReduce (Java) Include 4 components on top of Core Spark: Spark Streaming, GraphX, MLLib, Spark SQL 14

15 Hadoop Eco-System Analytics Hive (2010) Originated at Facebook Compile SQL queries to MapReduce or Spark jobs Data warehouse tool in Hadoop Eco-System Good for ETL, batch, long-running job. Impala (2013) Originated at Cloudera MPP(Massively Parallel Processing) SQL Engine Much faster than Hive Query or Spark SQL; Support high concurrency; But no Fault tolerance Good for short-running, BI-Style ad-hoc queries. BI tool like Tableau, MicroStrategy connect to Impala through ODBC/JDBC 15

16 Hadoop Data Processing Pattern Distributed Storage Distributed Computing Send compute to data Scale-Out Add more nodes Cost Effective Commodity hardware Time to Data No Up-front modeling Schema-on-read 100% fidelity of original data Data agility 16

17 Agenda Hadoop History Introduction to Apache Hadoop Eco-System Transition from Legacy Data Platform to Hadoop Resources, Q & A 17

18 Data Silos Engineering Marketing Sales HR Customer Service Slow down your company Limits communication and collaboration Decrease the quality and credibility of data 18

19 Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure BATCH Spark, Hive, Pig MapReduce PROCESS, ANALYZE, SERVE STREAM Spark SQL Impala SEARCH Solr OTHER Kite A new kind of data platform: One place for unlimited data OPERATIONS Cloudera Manager Cloudera Director RESOURCE MANAGEMENT YARN FILESYSTEM HDFS UNIFIED SERVICES RELATIONAL Kudu STORE NoSQL HBase SECURITY Sentry, RecordService OTHER Object Store DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer Unified, multi-framework data access Cloudera makes it: Fast for business STRUCTURED Sqoop STREAMING Kafka, Flume Easy to manage INTEGRATE Secure without compromise 19

20 Data Mgmt. Chain Data Sources Data Ingest Data Storage & Processing Serving, Analytics & Machine Learning Apache Flume Stream ingestion Apache Hive Batch Processing, ETL Apache Spark Batch, Stream & iterative processing, ML Connected Things/ Data Sources Apache Kafka Stream ingestion Apache Sqoop Ingestion of data from relational sources Apache Hadoop Storage (HDFS) & deep batch processing Apache HBase NoSQL data store for real time applications Apache Kudu Storage & serving for fast changing data Apache Impala MPP SQL for fast analytics Cloudera Search Real time search Structured Data Sources Security, Scalability & Easy Management ENTERPRISE DATA HUB Deployment Flexibility: Datacenter Cloud 20

21 The best-in-class organizations use Cloudera #1 Largest Payer in the US will be covering 123 million lives and pay out $950B to providers in out of the top 10 cancer drugs by 2020 are being made by Cloudera customers. Over 150 health & life science organizations use enterprise-class Cloudera software. #1 Largest health data company, with 500M+ anonymous patient records. #1 Largest Biotech in the world. #1 commercial hospital chain worldwide. #1largest global genomic repository #1 most utilized Patient Centered Medical Home program. this Hospital was one of the first four to receive Stage 7 status from HIMSS, the highest possible distinction in electronic medical records implementation, uses Cloudera to host a variety of data, and was awarded by US DHHS a Gold Medal of Honor. #1 Largest Health IT company in the World, $3B+ in revenue has 1000 s of nodes of Cloudera. 21

Broad Institute s industry standard GATK pipeline s new version is based on Apache Spark, over 20,000 global users may migrate to Spark Thanks to the contributions of Cloudera Engineers, GATK4 now

22 Broad Institute s industry standard GATK pipeline s new version is based on Apache Spark, over 20,000 global users may migrate to Spark Thanks to the contributions of Cloudera Engineers, GATK4 now uses Apache Spark for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services, Such as google Dataproc. It has been a privilege collaborating with the Broad Institute over the last two years to ensure that GATK4 can use the power of Apache Spark to make genomics workflows more scalable than precious approaches, said Tom White, principal data scientist at Cloudera. 22

dollars Selected Cloudera as the platform, created their own web user interface Benefit Today, a single lab at SCRI can evaluate and diagnose

23 Seattle Children s Research Institute 200+ PI s at Seattle Children s Research Institute 9 Research Centers including cancer, brain, birth, infectious disease Was no integrated data platform across the 9 Centers Evaluated multiple packaged applications, all multi-millions of dollars Selected Cloudera as the platform, created their own web user interface Benefit Today, a single lab at SCRI can evaluate and diagnose a single patient per week after receiving the whole exome and clinical record. After implementation, the lab could diagnose 4-5 patients per week. 23

24 Agenda Hadoop History Introduction to Apache Hadoop Eco-System Transition from Legacy Data Platform to Hadoop Resources, Q&A 24

25 Start Your Big Data Journey Download Cloudera QuickStart Virtual Machine Today Practice! Practice!! Practice!!! 25

26 Meetups AI + Big Data Healthcare Meetup Washington DC Area Apache Spark Interactive Data-Healthcare-Meetup/ DC-Area-Spark-Interactive/ members 2,700+ members 26

27 Thank you! 27