Building Your Big Data Team

Similar documents
5th Annual. Cloudera, Inc. All rights reserved.

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

Optimal Infrastructure for Big Data

AZURE HDINSIGHT. Azure Machine Learning Track Marek Chmel

Oracle Big Data Cloud Service

Hadoop Course Content

DataAdapt Active Insight

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Business is being transformed by three trends

Big data is hard. Top 3 Challenges To Adopting Big Data

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Big Data Hadoop Administrator.

China Center of Excellence

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

Modernizing Your Data Warehouse with Azure

Hybrid Data Management

Microsoft Azure Essentials

Intro to Big Data and Hadoop

Analytics Platform System

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

The Alpine Data Platform

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

Analytics in Action transforming the way we use and consume information

Bringing the Power of SAS to Hadoop Title


WELCOME TO. Cloud Data Services: The Art of the Possible

Azure ML Data Camp. Ivan Kosyakov MTC Architect, Ph.D. Microsoft Technology Centers Microsoft Technology Centers. Experience the Microsoft Cloud

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study

Data Analytics and CERN IT Hadoop Service. CERN openlab Technical Workshop CERN, December 2016 Luca Canali, IT-DB

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Confidential

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Make Business Intelligence Work on Big Data

Managed Services. Managed Services. Choices that work for you PEOPLESOFT ORACLE CLOUD JD EDWARDS E-BUSINESS SUITE.

Two offerings which interoperate really well

SAS & HADOOP ANALYTICS ON BIG DATA

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

20775 Performing Data Engineering on Microsoft HD Insight

Big Data & Hadoop Advance

Cloud Based Analytics for SAP

20775A: Performing Data Engineering on Microsoft HD Insight

Cognitive Data Warehouse and Analytics

20775: Performing Data Engineering on Microsoft HD Insight

Who is Databricks? Today, hundreds of organizations around the world use Databricks to build and power their production Spark applications.

Course Content. The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

MapR: Solution for Customer Production Success

Sr. Sergio Rodríguez de Guzmán CTO PUE

20775A: Performing Data Engineering on Microsoft HD Insight

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

BIG DATA AND HADOOP DEVELOPER

Architecture Optimization for the new Data Warehouse. Cloudera, Inc. All rights reserved.

Common Customer Use Cases in FSI

How to Build Your Data Ecosystem with Tableau on AWS

Databricks Cloud. A Primer

Investor Presentation. Fourth Quarter 2015

Transforming Analytics with Cloudera Data Science WorkBench

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Taking Advantage of Cloud Elasticity and Flexibility

Hadoop and Analytics at CERN IT CERN IT-DB

Sunnie Chung. Cleveland State University

Architecting Microsoft Azure Solutions

Investor Presentation. Second Quarter 2016

Architecting Microsoft Azure Solutions

DevSci: Better Software Through Data #KCDC2018

The Importance of good data management and Power BI

Modern Analytics Architecture

Realising Value from Data

"Charting the Course... MOC A: Architecting Microsoft Azure Solutions. Course Summary

Ways to Transform. Big Data Analytics into Big Value

What s New. Bernd Wiswedel KNIME KNIME AG. All Rights Reserved.

Exploring Big Data and Data Analytics with Hadoop and IDOL. Brochure. You are experiencing transformational changes in the computing arena.

Pentaho Technical Overview. Max Felber Solution Engineer September 22, 2016

StackIQ Enterprise Data Reference Architecture

Application Integrator Automate Any Application

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

INTRODUCTION TO R FOR DATA SCIENCE WITH R FOR DATA SCIENCE DATA SCIENCE ESSENTIALS INTRODUCTION TO PYTHON FOR DATA SCIENCE. Azure Machine Learning

ABOUT THIS TRAINING: This Hadoop training will also prepare you for the Big Data Certification of Cloudera- CCP and CCA.

Cisco IT Automates Workloads for Big Data Analytics Environments

Discover the New Company

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Designing Business Intelligence Solutions with Microsoft SQL Server 2014

Course 20535A: Architecting Microsoft Azure Solutions

Quinnox BI OBIEE Solution. For more information, visit.

Meetup DB2 LUW - Madrid. IBM dashdb. Raquel Cadierno Torre IBM 1 de Julio de IBM Corporation

Reporting for Advancement


ETL on Hadoop What is Required

Building a Data Lake on AWS EBOOK: BUILDING A DATA LAKE ON AWS 1

Monetizing the Lake. Kirk Haslbeck, Hortonworks Dan Kernaghan, Pitney Bowes

SKF Digitalization. Building a Digital Platform for an Enterprise Company. Jens Greiner Global Manager IoT Development

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Redefine Big Data: EMC Data Lake in Action. Andrea Prosperi Systems Engineer

Hybrid Data Management

Designing Business Intelligence Solutions with Microsoft SQL Server 2014 Course Code: 20467D

Teradata IntelliSphere

Transcription:

Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements. Platforms like Hadoop promise scalable data storage and processing at lower cost than traditional data platforms, but the newer technology requires new skill sets to plan, build and run a Big Data program. Companies not only have to determine the right Big Data technology strategy, they need to determine the right staffing strategy to be successful. What comprises a great Big Data team? Luckily the answer isn t that far from what many companies already have in house managing their data warehouse or data services functions. The core skill sets and experience in the areas of system administration, database administration, database application architecture / engineering and data warehousing are all transferable. What usually makes someone able to successfully make the leap from, say, Oracle DBA to Hadoop administrator is their motivation to learn and more curiosity about what s happening under the covers when data are loaded or a query is run. Here are the key roles your (Hadoop) Big Data team needs to fill to be complete. The first three roles are about building and operating a working Hadoop platform: 1. Data Center and OS Administrator This person is the one who stands up machines, virtual or otherwise, and presents a fully functional system with working disks, networking, etc. They re usually also responsible for all the typical data center services (LDAP/AD, DNS, etc.) and know how to leverage vendor-specific tools (e.g., Kickstart and Cobbler, or vsphere) to provision large numbers of machines quickly. This role is becoming increasingly sophisticated and/or outsourced with the move to infrastructure as a service approaches to provisioning, with the DC role installing hardware and maintaining the provisioning system, and developers interfacing with that system rather than humans. About DesignMind DesignMind uses leading edge technologies, including SQL Server, SharePoint,.NET, Tableau, Hadoop, and Platfora to develop big data solutions. Headquartered in San Francisco, we help businesses leverage their data to gain competitive advantage. DesignMind A white paper by Mark Kidwell, Principal Big Data Consultant, DesignMind September 2014 Copyright 2014 DesignMind. All rights reserved. 1

While this role usually isn t (shouldn t be) part of the core Big Data team, it s mentioned here because it s absolutely critical to building a Big Data platform. Anyone architecting a new platform needs to partner closely with the data center team to specify requirements and validate the build out. BACKGROUND/PREVIOUS ROLE: This role typically already exists, either in the organization or via a cloud hosted solution s web site. TRAINING/RETOOLING: If you still run your own data center operations, definitely look at large scale virtualization and server management technologies that support self-service provisioning if you haven t already. Anyone architecting a new platform needs to partner closely with the data center team to specify requirements and validate the build out. 2. Hadoop Administrator This person is responsible for set up, configuration, and ongoing operation of the software comprising your Hadoop stack. This person wears a lot of hats, and typically has a DevOps background where they re comfortable writing code and/or configuration with tools like Ansible to manage their Hadoop infrastructure. Skills in Linux and MySQL / PostgreSQL administration, security, high availability, disaster recovery and maintaining software across a cluster of machines is an absolute must. For the above reasons, many traditional DBAs aren t a good fit for this type of role even with training (but see below). BACKGROUND/PREVIOUS ROLE: Linux compute / web cluster admin, possibly a DBA or storage admin. TRAINING/RETOOLING: Hadoop admin class from Cloudera, Ansible training from ansible.com, plus lots of lab time to play with the tools. 3. HDFS/Hive/Impala/HBase/MapReduce/Spark Admin Larger Big Data environments require specialization, as it s unusual to find an expert on all the different components - tuning Hive or Impala to perform at scale is very different from tuning HBase, and different from the skills required to have a working high availability implementation of Hive or HttpFS using HAProxy. Securing Hadoop properly can also require dedicated knowledge for each component and vendor, as Cloudera s Sentry (Hive / Impala / Search security) is a different animal than Hortonworks s Knox perimeter security. Some DBAs can successfully make the transition to this area if they understand the function of each component. 2

DesignMind BACKGROUND/PREVIOUS ROLE: Linux cluster admin, MySQL / Postgres / Oracle DBA. See Gwen Shapira s excellent blog on Cloudera s site called the Hadoop FAQ for Oracle DBAs. TRAINING/RETOOLING: Hadoop admin and HBase classes from Cloudera, plus lots of lab time to play with all the various technologies. The next four roles are concerned with getting data into Hadoop, doing something with it, and getting it back out: 4. ETL / Data Integration Developer This role is similar to most data warehouse environments - get data from source systems of record (other databases, flat files, etc), transforming it into something useful and loading into a target system for querying by apps or users. The difference is the tool chain and interfaces the Hadoop platform provides, and the fact that Hadoop workloads are larger. Hadoop provides basic tools like Sqoop and Hive that any decent ETL developer that can write bash scripts and SQL can use, but understanding that best practice is really ELT (pushing the heavy lifting of transformation into Hadoop) and knowing which file formats to use for optimal Hive and Impala query performance are both the types of things that have to be learned. 3

A dangerous anti-pattern is to use generic DI tools / developers to load your Big Data platform. They usually don t do a good job producing high performance data warehouses with non-hadoop DBs, and they won t do a good job here. BACKGROUND/PREVIOUS ROLE: Data warehouse ETL developer with experience on high-end platforms like Netezza or Teradata. TRAINING/RETOOLING: Cloudera Data Analyst class covering Hive, Pig and Impala, and possibly the Developer class. Focus on delivering a usable data product quickly, rather than modeling every last data source across the enterprise. 5. Data Architect This role is typically responsible for data modeling, metadata, and data stewardship functions, and is important to keep your Big Data platform from devolving into a big mess of files and haphazardly managed tables. In a sense they own the data hosted on the platform, often have the most experience with the business domain and understand the data s meaning (and how to query it) more than anyone else except possibly the business analyst. The difference between a traditional warehouse data architect and this role is Hadoop s typical use in storing and processing unstructured data typically outside this role s domain, such as access logs and data from a message bus. For less complex environments, this role is only a part-time job and is typically shared with the lead platform architect or business analyst. An important caveat for data architects and governance functions in general assuming your Big Data platform is meant to be used for query and application workloads (vs. simple data archival), it s important to focus on delivering a usable data product quickly, rather than modeling every last data source across the enterprise and exactly how it will look as a finished product in Hadoop or Cassandra. BACKGROUND/PREVIOUS ROLE: Data warehouse data architect. TRAINING/RETOOLING: Cloudera Data Analyst class covering Hive, Pig and Impala. 4

6. Big Data Engineer / Architect This developer is the one writing more complex data-driven applications that depend on core Hadoop applications like HBase, Spark or SolrCloud. These are serious software engineers developing products from their data for use cases like large scale web sites, search and real-time data processing that need the performance and scalability that a Big Data platform provides. They re usually well versed in the Java software development process and Hadoop toolchain, and are typically driving the requirements around what services the platform provides and how it performs. For the Big Data architect role, all of the above applies, but this person is also responsible for specifying the entire solution that everyone else in the team is working to implement and run. They also have a greater understanding of and appreciation for problems with large data sets and distributed computing, and usually some system architecture skills to guide the build out of the Big Data ecosystem. BACKGROUND/PREVIOUS ROLE: Database engineers would have been doing the same type of high performance database-backed application development, but may have been using Oracle or MySQL as a backend previously. Architects would have been leading the way in product development that depended on distributed computing and web scale systems. TRAINING/RETOOLING: Cloudera s Data Analyst, Developer, Building Big Data Applications, and Spark classes. 5

7. Data Scientist This role crosses boundaries between analyst and engineer, bringing together skills in math and statistics, machine learning, programming, and a wealth of experience / expertise in the business domain. All these combined allow a data scientist to search for insights in vast quantities of data stored in a data warehouse or Big Data platform, and perform basic research that often leads to the next data product for engineers to build. The key difference between traditional data mining experts and modern data scientists is the more extensive knowledge of tools and techniques for dealing with ever growing amounts of data. My colleague Andrew Eichenbaum has written a great article on hiring data scientists. BACKGROUND/PREVIOUS ROLE: Data mining, statistician, applied mathematics. The key difference between traditional data mining experts and modern data scientists is the more extensive knowledge of tools and techniques for dealing with ever growing amounts of data. TRAINING/RETOOLING: Cloudera s Intro to Data Science and Data Analyst classes. The final roles are the traditional business-facing roles for data warehouse, BI and data services teams: 8. Data Analyst This role is largely what you d expect - using typical database client tools as the interface, they run queries against data warehouses, produce reports, and publish the results. The skills required to do this are a command of SQL, including dialects used by Hive and Impala, knowledge of data visualization, and understanding how to translate business questions into something a data warehouse can answer. BACKGROUND/PREVIOUS ROLE: Data Analysts can luckily continue to use a lot of the same analytics tools and techniques they re used to, and benefit from improvements in both performance and functionality. TRAINING/RETOOLING: Cloudera s Data Analyst class covering Hive, Pig and Impala. 9. Business Analyst Like the data analyst, this role isn t too different from what already exists in most companies today. Gathering requirements from end users, describing use cases and specifying product behavior will never go away, and are even more important when the variety of data grows along with volume. What s most different is the tools involved, their capabilities, and the general approach. The business analyst needs to be sensitive to the difference in end-user tools with Hadoop, and determine training needed for users. 6

BACKGROUND/PREVIOUS ROLE: Business / system analyst for previous data products, like a data warehouse / BI solution, or traditional database-backed web application. TRAINING/RETOOLING: Possibly Cloudera s Data Analyst class covering Hive, Pig and Impala, especially if this role will provide support to end users and analysts. What s the Right Approach? All of the above roles mention Hadoop and target data warehousing use cases, but many of the same guidelines apply if Apache Cassandra or similar NoSQL platform is being built out. And of course there are roles that are needed for any software or information management project project managers, QA / QC, etc. For those roles the 1 day Cloudera Hadoop Essentials class might be the best intro. The time invested in training and building out the team will pay off when it s time to leverage your data at scale with the production platform, improving the odds of your Big Data program being a success. So with the above roles defined, what s the right approach to building out a Big Data team? A Big Data program needs a core group of at least one each of an architect, admin, engineer and analyst to be successful, and for a team that small there s still a lot of cross over between roles. If there s an existing data services or data warehouse team taking on the Big Data problem then it s important to identify who will take on those roles and train them, or identify gaps that have to be filled by new hires. Of course the team needs to scale up as workload is added (users, apps and data), also accounting for existing needs. Transforming into a Big Data capable organization is a challenging prospect, so it s also a good idea to start with smaller POCs and use a lab environment to test the waters. The time invested in training and building out the team will pay off when it s time to leverage your data at scale with the production platform, improving the odds of your Big Data program being a success. Mark Kidwell specializes in designing, building, and deploying custom end-to-end Big Data solutions and data warehouses. DesignMind 150 Spear Street, Suite 700, San Francisco, CA 94105 415-538-8484 designmind.com Copyright 2014 DesignMind. All rights reserved. 7