Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

White Paper Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake Motivation for Modernization It is now a well-documented realization among Fortune 500 companies and high-tech start-ups that Big Data analytics can transform the enterprise, and organizations which lead the way will drive the most value. But where does that value come from and how is it sustained? Is it just from the data itself? No. The real value of Big Data does not come from the data in its raw form, but from its analysis - the insights derived, the products created, and the services that emerge. Big Data allows for dramatic shifts in enterprise level decision making and product/ service innovation, but in order to reap the real rewards of it, organizations must keep pace at every level, from management approaches to technology and infrastructure. As your business increasingly demands more and more from your data, chances are strong that your existing data warehouse is also near capacity. In fact, according to Gartner, 70% of all data warehouses are straining the limits of their capacity and performance levels. If this is true for you, it is time to modernize your data warehouse environment. This paper addresses the need to modernize today s data warehouse environment and outlines best practices and approaches. Does This Sound Like You? Enterprise data warehouses were originally created for exploration and analysis, but with the arrival of Big Data, they have frequently become archival data repositories. And, what s worse is that for many organizations getting data into them requires expensive, time-consuming ETL extraction, transformation and loading work.

The standard analytics environment at the majority of enterprise-level companies includes the operational systems that serve as the sources for data; a data warehouse or group of associated data marts which house and sometimes integrate the data for a range of analysis functions; and a set of business intelligence and analytics tools that enable insight discovery and decision making from the use of queries, visualization, dashboards, and data mining. Most big companies have invested millions of dollars in their analytics ecosystems. This includes hardware platforms, database systems, ETL software, analytics tools and BI dashboards middleware, as well as storage systems, all with their attendant maintenance contracts and software upgrades. Ideally, these environments have given enterprises the power to understand their customers and, as a result, also helped them streamline their business and even optimize their product and enhance their brands. However, in the worst case scenario, current data warehouse infrastructure is not able to affordably scale to deliver on the full promise and value of Big Data. Enterprises today have data warehouse modernization programs in place to find a way to combine the best of their legacy data warehouse with the new power of Big Data technology to create a best-of-both-worlds environment. Our experienced team of experts deliver a repeatable methodology to provide a customizable range of services including assessment and planning, implementation and data quality validation to support their data warehouse modernization programs. Make a Move to Modern Data Architecture If you need to modernize your data architecture, your foundation will no doubt begin with Hadoop. It is as much a must-have as it is a game-changer from an IT and a business perspective. Hadoop is a cost-effective, scale-out storage system with parallel computing and analytical capability. It simplifies the procurement and storage of diverse data sources, whether structured, semi-structured (e.g., sensor feeds, machine data), or unstructured (e.g., web logs, social media, image, video, audio). It has become the framework of choice to accelerate time-to-insight, and reduce the overall costs of managing data. Hadoop will play a positive and profound role on your long-term data storage, management and analysis capabilities, and in realizing the critical value of your data to sustain competitiveness. While Hadoop ecosystem offers powerful capabilities and virtually unlimited horizontal scalability, it does not provide the complete set of functionality you need for enterprise-level, Big Data analysis.

With large teams of engineers and analysts, these gaps must be filled through complex manual coding and large support teams. This slows Hadoop adoption and can frustrate management teams who are eager to derive and deliver results. Impetus offers a comprehensive end-to-end Data Warehouse Workload Migration (WM) solution that allows you to identify and safely migrate data, perform ETL processing and enable large scale analytics from the enterprise data warehouse (EDW) to a Hadoop-based Big Data warehouse. Furthermore, WM not just seamlessly moves schema, data, views etc. but also transforms Procedure Language Scripts and migrates complete Role Based Access Control (RBAC) and reports. This ensures that you reap modern big data warehousing benefits along with protecting and using your investments on existing traditional RDBMS and other information infrastructure. Implementing the Data Lake Adopting Hadoop involves introducing a Data Lake into your analytics ecosystem. The Data Lake can serve as your organization s central data repository. What makes the Data Lake a unique and differentiated repository framework is its ability to unify and connect your data. It helps you access your entire body of data simultaneously, unleashing the true power of Big Data a co-related and collaborative output of superior insights and analysis. It presents you with a dynamic scenario where one can dictate a variety of need-based analysis made possible by this unstructured repository. While there are many purposes it can serve, such as feeding both your production and sandbox environments, the first step and most immediate opportunity is often the off-loading of the ETL (extract, transform, and load) routines from the traditional data warehouse. Building a robust Data Lake is a gradual movement. With the right tools, a clearly-planned platform, a strong and uniform vision that includes innovation around advanced analytics, your organization can architect an integrated, rationalized and rigorous Data Lake repository. We specialize in modernizing the data warehouse and implementing data lakes. We have experience with every stage of the Big Data transformation curve. We enable you to: Work with unstructured data. Facilitate democratized data access. Apply Machine Learning algorithms to enrich data quality. Contain costs while continuing to do more with the data. Ensure that you do not end up in a data swamp.

Four Steps to Building a Data Lake Step 1: Acquire & Transform Data at Scale This first stage involves putting the architecture together and learning to handle and ingest data at scale. At this stage, the analytics consist of simple transformations; however, it s an important step in discovering how to make Hadoop work for your organization. Step 2: Focus on Analysis Now you re ready to focus on enhancing data analysis and interpretation. To fully leverage the Data Lake, you will need to use various tools and frameworks to begin combining and integrating the EDW and the Data Lake. Step 3: Collaborate This is where you will start to witness a seamless synergy between the EDW and the Hadoop-based Data Lake. The strengths of each architecture will begin to make themselves visible in your organization as this porous, allencompassing data pool allows analytics and intelligence to flow freely across your enterprise. Step 4: Unify In this last stage, you reach maturity, tying together enterprise capabilities and large-scale unification from information governance, compliance, security, and auditing to the management of metadata and information lifecycle capabilities. Workload Migration includes an auto-recommendation engine that helps in intelligent migration by suggesting various recommendations around offloadable parameters and metrics. This helps in optimizing the schema, synergize and effectively form the data lake. Right from clustering, partitioning to splitting the schema and data to recommendations on offload-able tables, queries, optimization parameters, query engine and other capabilities. Challenges in Migrating to the Data Lake Setting up a Hadoop-based Data Lake can be challenging for organizations who do not have experience migrating Big Data. Organizations often encounter some of the following challenges: Identifying which data sources to offload Data validation and quality checks Issues with SQL compatibility Lack of available user defined functions in Hadoop libraries Lack of procedural support Workflows locked in proprietary data integraton tools The high costs and effort of migration Exception handling Lack of unified view and dashboard to offload data Governance controls on migration system and data

Impetus Workload Migration provides an automated migration toolset consisting of utilities that our team of experts or your in-house staff can use to automate the migration and conversion of data for execution in the Hadoop environment. It also allows you to run data quality functions to standardize, cleanse and de-dupe data. You can re-upload the processed data back to the source EDW for reporting purposes if required. We provide pre-built conversion logic for Teradata, Netezza, Oracle, Microsoft SQL Server and IBM DB2 source data stores. Additionally, Workload Migration includes a library of advanced data science machine learning algorithms for solving difficult data quality challenges. prf_oralce_ds The Impetus Data Warehouse Workload Migration Tool What it does The Impetus Data Warehouse Workload Migration tool does the following: Migration Validation Execution Ingests data rapidly via our fast, fault tolerant, parallel data ingestion component. Transforms SQL and procedural SQL from RDBMS, MPP, and other database to compatible HQL and Spark QL queries. Using our foundational, intelligent transformation engine. Provides a smart User Interface that allows you to effortlessly orchestrate migration pipelines in just a few clicks. Integrates with your firm s LDAP to allow Single Sign-on capabilities for your users. Delivers rapid response times and performance you can count on through our integrated cache. Tracks all metadata in source and target data stores. Provides strict governance controls including access, roles, and security that can be built into the migration process to keep your data safe. Caters to a multitude of data sources to bring data in seamlessly and safely after data validation and quality checks. Runs checks and balance on data migration using our library of data quality and data validation algorithms available as operators. Offload Teradata, SQL server and DB2 Views easily Executes migration pipelines, monitors them for various metrics and health checks and helps the admin to stop or resume any pipeline at any point using our job processing engine. Deploys and monitors components in real-time using our automated cluster management and monitoring utility. Shows comprehensive stage-wise reports for migration, transformation, registration and execution. Intelligent Migration: Assess workloads automatically that includes recommendations on a number of parameters for offloading. Provides seamless connectivity from BI tools like Tableau, Qlikview, etc., allowing you to easily run Teradata or Oracle reports while migrating your data.

How it helps you The Impetus Data Warehouse Workload Migration tool makes migrating to a modern warehouse architecture a goal within reach, easily, skillfully, and rapidly. Our proven tools and methodologies and our experienced team of Big Data experts can help you do the following: Accelerate offloading time Save 50%-80% of labor costs compared to manual offloading Automated Assessment and Expert Recommendations for Offloading Business-critical Data Minimize data quality risk using our full library of data validation and quality checks as well as our advanced monitoring and metrics mechanisms. Optimize performance with advanced features for partitioning and clustering features. Accelerate parallel and SQL processing using Hadoop along with streaming ETL options Maximize existing SQL and stored procedure investments and reuse of tools Reduce Hadoop migration project risks through the use of proven best practices and automated quality assurance checks for data and logic Ready to Modernize? To learn more about our workload migration solution or how we can help you on your data warehouse modernization journey, visit www.impetus.com or write to us at bigdata@impetus.com. 2016 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. Feb 2016 Impetus is focused on creating big business impact through Big Data Solutions for Fortune 1000 enterprises across multiple verticals. The company brings together a unique mix of software products, consulting services, Data Science capabilities and technology expertise. It offers full life-cycle services for Big Data implementations and real-time streaming analytics, including technology strategy, solution architecture, proof of concept, production implementation and on-going support to its clients. To learn more, visit www.impetus.com or write to us at inquiry@impetus.com.