Building a Data Lake on AWS EBOOK: BUILDING A DATA LAKE ON AWS 1

Size: px
Start display at page:

Download "Building a Data Lake on AWS EBOOK: BUILDING A DATA LAKE ON AWS 1"

Transcription

1 Building a Data Lake on AWS EBOOK: BUILDING A DATA LAKE ON AWS 1

2 Contents Introduction The Big Data Challenge Benefits of a Data Lake Building a Data Lake on AWS Featured Data Lake Partner Bronze Drum Consulting Case Study: Capital Markets Deep Learning Powered by Serverless Data Lake Getting Started EBOOK: BUILDING A DATA LAKE ON AWS 1

3 Introduction The world s most valuable resource is no longer oil. It s data. And just like extracting value from crude oil takes a complex, intertwined system of technology, infrastructure, and people, so too does unlocking the value of data. The challenge is that today s demands on data storage, management, and analysis technologies have outstripped the capabilities of legacy systems they re simply not up to the job. One of the most promising new approaches to emerge in today s data-driven world a data lake promises to simplify data storage and speed up data analysis. Data lakes store information in its native format, no matter where it came from. Structured, unstructured it doesn t matter. But it s not just about storage. Data lakes enable things like governance, sharing, scheduling, and workload orchestration, as well as providing tools for data processing and analysis. The overarching idea is agility: Data lakes allow projects to quickly succeed or fail fast and cheaply. In this EBook from Bronze Drum Consulting and AWS, you ll learn the benefits of a data lake on AWS; the basics of building a data lake on AWS; and how one sector of the financial services market is taking advantage of a data lake solution developed by Bronze Drum Consulting to extract high-value insights. With a data lake on AWS, you no longer need to know what questions you want to ask of your data before you store it, giving you a flexible platform for data analysis. EBOOK: BUILDING A DATA LAKE ON AWS 2

4 The Big Data Challenge Organizations today manage greater volumes of data, from more sources, and of more types than ever before. In the face of massive, heterogeneous volumes of data, many organizations find that to deliver timely business insights, they need a storage and analytics solution that offers more speed and flexibility than legacy systems. The common assumption is that legacy refers simply to on-premises hardware and systems. But to fully understand the limitations of old systems, it s more accurate to think that legacy can also mean running modern programs MapReduce, Apache s Cassandra NoSQL database, or Apache s Kafka stream processing platform using on-premises gear without the on-demand elasticity cloud brings. Under that scenario, firms will not be able to run or manage the services required to support a data lake on premises. Big data is one of the key technologies underpinning digital transformation. It s a gateway to business innovation, fail-fast prototyping, business insights, and competitive advantage. It s also the area for which exists a critical skills shortage a shortage that smart firms are tackling by turning to the cloud. A data lake built on AWS addresses many of these challenges by allowing an organization to store all of their data in one centralized repository. Since data can be stored in its original form, there is no need to convert it to a predefined schema before it s put to work, giving you a way to store and process all of your data, both structured and unstructured, with minimal lead time. With a data lake on AWS, you no longer need to know what questions you want to ask of your data before you store it, giving you a flexible platform for data analysis. At the heart of an AWS-based data lake is Amazon Simple Storage Service (S3), which provides secure, cost-effective, durable, and scalable cloud storage. AWS also offers an extensive set of services to help you provide strong security for your data lake, including access controls, virtual isolation, 256-bit encryption, logging and monitoring, and more. There are a variety of ways to transfer data to your data lake: Amazon Kinesis, which enables you to ingest data in real time AWS Import/Export, a service where you can send a portable storage device with your data to AWS AWS Snowball, a secure appliance AWS sends you for ingesting data in batches AWS Storage Gateway, enables you to connect your on-premises software appliances with your AWS cloud-based storage AWS Direct Connect, which gives you dedicated network connectivity between your data center and AWS EBOOK: BUILDING A DATA LAKE ON AWS 3

5 Benefits of Data Lake Here are some of the benefits to hosting your data lake on AWS: Cost-Effective Data Storage Amazon S3 provides cost-effective and durable storage, allowing you to store nearly unlimited amounts of data of any type, from any source. Because storing data in Amazon S3 doesn t require upfront transformations, you have the flexibility to apply schemas for data analysis on demand. This enables you to more easily answer new questions as they come up, and improve the time to value. This agility is key to better business outcomes. Easy Data Collection and Ingestion There are a variety of ways to ingest data into your data lake, including services such as Amazon Kinesis, which enables you to ingest data in real time; AWS Snowball, a secure appliance AWS sends you for ingesting data in batches; AWS Storage Gateway, which enables you to connect on-premises software appliances with your AWS cloud-based storage; or through AWS Direct Connect, which gives you dedicated network connectivity between your data center and AWS. Security and Compliance When hosting your data lake on AWS, you gain access to a highly secure cloud infrastructure and a deep suite of security offerings designed to keep your data safe. As an AWS customer, you will benefit from a data center and network architecture built to meet the requirements of the most security-sensitive organizations. AWS also actively manages dozens of compliance programs in its infrastructure, helping organizations to easily meet compliance standards such as PCI DSS, HIPAA, and FedRAMP. Most Complete Platform for Big Data AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internetof-things processing. With AWS, you don t need to make large, upfront investments in time and money to build and maintain infrastructure. Instead, you can provision exactly the right type and size of resources needed to power big data analytics applications. EBOOK: BUILDING A DATA LAKE ON AWS 4

6 Building a Data Lake on AWS A data lake solution on AWS, at its core, leverages Amazon Simple Storage Service (Amazon S3) for secure, cost-effective, durable, and scalable storage. You can quickly and easily collect data into Amazon S3 from a wide variety of sources by using services like AWS Snowball or Amazon Kinesis Firehose delivery streams. Amazon S3 also offers an extensive set of features to help you provide strong security for your data lake, including access controls & policies, data transfer over SSL, encryption at rest, logging and monitoring, and more. For the management of the data, you can leverage services such as Amazon DynamoDB and Amazon ElasticSearch to catalog and index the data in Amazon S3. Using AWS Lambda functions that are directly triggered to Amazon S3 in response to events such as new data being upoaded, you easily can keep your catalog up to date. With Amazon API Gateway, you can create an API that acts as a front door for applications to access data quickly and securely by authorizing access via AWS Identity and Access Management (IAM) and Amazon Cognito. For analyzing and accessing the data stored in Amazon S3, AWS provides fast access to flexible and low cost services like Amazon EMR, Amazon Redshift, and Amazon Machine Learning, so you can rapidly scale any analytical solution. Example solutions include data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, and internet-of-things processing. By leveraging AWS, you can easily provision exactly the resources and scale you need to power any big data applications, meet demand, and improve innovation. Catalog & Search Access and search metadata Data Ingestion Get your data into S3 quickly and securely Central Storage Secure, cost-effective storage in Amazon S3 S3 Protect and Secure Access & User Interface Give your users easy and secure access Glue Athena API Gateway Identity & Access Management Kinesis Direct Connect Snowball Database Migration Service Identity & Access Management Use entitlements to ensure data is secure and users identities are verified Security Token Service Cognito Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Machine Learning QuickSight EMR Redshift CloudWatch CloudTrail Key Management Service EBOOK: BUILDING A DATA LAKE ON AWS 5

7 Bronze Drum Consulting Bronze Drum works with financial services companies, hedge funds, and insurance firms to transform real-time data into business insights. With an exclusive focus on AWS, Bronze Drum can simplify risk pricing, deep learning, and data acquisition from data providers, helping to unlock new capabilities and competitive advantage for customers. The company s deep AWS platform expertise translates into well-architected solutions optimized for security, cost, availability, and rapid time to market. Bronze Drum Consulting focuses on designing the critical combination of capabilities firms need to succeed with big data: the DevOps workflow for applications, scale-out clusters for deep learning, and the data lakes and continuous analysis that power some of the world s most innovative firms. Bronze Drum provides customers with the option to access insights as needed by developing API solutions that enable simple, secure access to critical insights. EBOOK: BUILDING A DATA LAKE ON AWS 6

8 Case Study: Capital Markets Deep Learning Powered by Serverless Data Lake Traders are always looking for an edge. Increasingly, that edge is found in data. But finding, accessing, and analyzing the right data for the job has been a challenge. Bronze Drum Consulting designed this capital markets solution to efficiently store and analyze terabytes to petabytes to exabytes of traditional capital markets data, as well as alternative data. The system had to be cost effective from day one. Critical Success Factors: Security, Elasticity, Cost Efficiency, Performance The team needed to prove the value of the models early, and then be able to scale them out cost effectively. This data lake solution needed to deliver insights by querying very large data sets, yet be cost efficient as a discovery, research, and modeling solution. Because security was a priority, Bronze Drum selected advanced technologies that minimize exposure to user and configuration error. The fewer people that need to configure and manage solutions, the simpler it can be to manage the attack surface area and configuration of a solution. And the solution had to be easy to use. In this case, two data scientists with a background as researchers in artificial intelligence needed to focus on analysis not running and wrangling large scale clusters. And, of course, performance was key. Be Ambitious Bronze Drum set the ambitious goal of designing a solution that could operate with very little user intervention, yet at the same time be highly personalized to the needs of end users. Combine ease of use and personalization with the need to scale out to thousands of nodes and you ve just described the kind of challenge Bronze Drum s team loves to tackle, says Brian McCallion, founder of Bronze Drum Consulting. EBOOK: BUILDING A DATA LAKE ON AWS 7

9 Bronze Drum designed a continuous delivery solution that enabled the data scientists to check in code and deploy it seamlessly on immutable containers running Torch 7 and custom code and data sets. The solution automatically deploys custom code written by the data scientists, creates a container, then runs that container on thousands of nodes in parallel. To achieve this, Bronze Drum created events that enable steps in a process to trigger subsequent events, and to enrich the data passed to the next event. In this way, the automation, personalization, and scale could be achieved, while the data scientists could focus on coding, training data, and analyzing the results of that training. Bronze Drum selected CodeCommit, CodeBuild, CodePipeline, and AWS Batch for this highperformance compute solution. By leveraging AWS spot pricing, the company is able to save on average 40% to 90% of on-demand pricing, enabling the data scientists to analyze more data when prices are best. The data lake powers the data sets analyzed by the deep learning process, and is queried by both portfolio managers and data scientists. By leveraging S3 storage, Amazon Athena, AWS Lambda, Kinesis, and Glue, the solution is able to acquire and continuously process Bloomberg data as well as nontraditional data sets. Both business users and data scientists can query the massive historical data warehouse on demand, performing sophisticated queries on a cluster that scales seamlessly. A further benefit is that storing large historical data sets is cost effective yet analytic: billions of records can be queried when and as needed at just $5 per TB of data processed. Many queries cost just pennies! No longer constrained by infrastructure, business outcomes are driven by an ability to analyze and gain insights across vast amounts of structured and unstructured data. A traditional data warehouse didn t fit how we worked at all. We needed a data lake as we retain very large data sets that are continuously processed. Bronze Drum s serverless data warehouse on AWS enables us to perform ad hoc analytic queries at any scale often for pennies. Our data scientists need to focus on training our deep learning engine not wrangling with data from sources like Bloomberg. Working with a continuous delivery pipeline, our data scientists can check in code, where it is seamlessly deployed to process the latest data sets. Managing Director of a large, U.S.-based hedge fund EBOOK: BUILDING A DATA LAKE ON AWS 8

10 Getting Started For more information about data lakes on AWS, vsit: > Amazon Web Service website > Big Data on AWS > Building a Data Lake on AWS (video) About AWS For 10 years, Amazon Web Services has been the world s most comprehensive and broadly adopted cloud platform. AWS offers over 70 fully featured services for compute, storage, databases, analytics, mobile, Internet of Things (IoT), and enterprise applications from 33 Availability Zones (AZs) across 13 geographic regions in the U.S., Australia, Brazil, China, Germany, Ireland, Japan, Korea, and Singapore. AWS services are trusted by more than a million active customers around the world including the fastest-growing startups, largest enterprises, and leading government agencies to power their infrastructure, make them more agile, and lower costs. To learn more about AWS, visit aws.amazon.com. EBOOK: BUILDING A DATA LAKE ON AWS 9

11 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved. (c) Bronze Drum Consulting, Inc. All Rights Reserved.