Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

0101 001001010110100 010101000101010110100 1000101010001000101011010 00101010001010110100100010101 0001001010010101001000101010001 010101101001000101010001001010010 010101101 000101010001010 1011010 0100010101000 1011010 001010100010 011010 01000101010 010010 1 0010101101 010101 0 001010101 001000 1 010100010 010001 01 001001010 10001 01 010001010 11010 01 000101010 00010 10 1101000101 000101 011 0100100010 0000010 1001 0101001000 01 0001010101101 00100010101 100 1010010 010101101000 10001 0101011010010 10100010 11010001010100 11010010001 0101000100101001 010001010100010 101011010010001010 00101001 0101001001010 11010001010100010101011 000101010001000101011010001010100010101101001000101010001001 Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

No discussion of big data is complete without addressing mainframe data. According to IBM, about 80 percent of all the transactional data in the world is stored on mainframes. This transactional data is a gold mine of reference data that can be used to make sense of enterprise-wide data and drive your big data analytics. How big of a gold mine? Here s how significant mainframes really are in the age of IoT and streaming data: Roughly 80% of the world s data either originates or is stored on mainframes. IBM z13 system can process up to 2.5 billion transactions per day. 71% of Fortune 500 companies have mainframes. According to our recent survey of over 250+ IT decision-makers, accessing Mainframe data in Hadoop is increasing in importance with over 70% of respondents stating integrating mainframe data with Hadoop is valuable. However, getting data off the mainframe is, well, challenging. That is especially true if you need to get it off the mainframe, yet keep the mainframe data format. In this ebook, we ll explore the challenges associated with integrating mainframe data into Hadoop, while allowing organizations to work with mainframe data in Hadoop or Spark in its native format and how to solve them. 0101001001010110100010101000101 1101001000101010001000101011010 0101000101011010010001010100010 1001010100100010101000101010110 1000101010001001010010010101101 1010100010101011010010001010100 1101000101010001010110100100010 0001001010010101101000101010001 1011010010001010100010010100010 1001010110100010101000101010110 1000101010001000101011010001010 0101011010010001010100000101001 1001000101010001010101101001000 0100010010100100101011010001010 0101010110100100010101000101101 1010100010101101001000101010001 0100101011010001010100010101011 0100010101000100101001010010010 1010001010100010010100100101011 0010101000101010110100100010101 1000101011010001010100010101101 0001010100010010100101010010001 1000101010110100100010101000100 0010010101101000101010001010101 0010001010100010110100010101000 0110100100010101000100101001010 1000101010001010101101001000101 0010010100101001001010110100010 0001001001001010110100010101000 1010010010101101000101010001010

Challenge: Big Data Governance Bridging the Gap between Mainframe and Apache Hadoop 0101001001010110100 0101010001010101101 0010001010100010001 0101101000101010001 0101101001000101010 0010010100101010010 0010101000101010110 1001000101010001001 0100100101011010001 New data sources are easily captured in modern enterprise data hubs, but businesses also need to reference customer or transaction history data to make sense of these newer sources. Sensor or mobile data streamed through Apache Kafka still needs to be enriched and integrated with the transaction history or customer reference data, which are often stored on the mainframes and legacy databases. This is a complex process, fraught with governance and compliance challenges. Some of the most promising data analytics insights and initiatives happen to be taking place in highly regulated industries such as finance, healthcare and insurance. In order to use data such as personal health records or financial transactions for advanced analytics, enterprises must be able to access it in a secure way, maintain and archive a copy in its original mainframe file format and track where the data has been. Security and lineage become critical for cross platform data access. To address the data governance and lineage requirements, Hadoop distributors introduced metadata management solutions, such as Cloudera Management and Apache Ambari.

Solution: Companies need a utility which will allow them to easily access and integrate mainframe data into Hadoop without having to convert the data into a different format for storage or processing in Hadoop. DMX-h By using Syncsort DMX-h, you can easily get end-to-end data lineage across platforms, accessing and processing mainframe data in Hadoop or Spark, on premise or in the cloud. DMX-h securely accesses mainframe data, even in its original EBCDIC format, and makes it available to be processed on the cluster, like any other data source. Better still, it doesn t take specialized mainframe or Hadoop skills to use DMX-h for offloading data from the mainframe to Hadoop securely. It assures the data lineage for governance purposes, while delivering the lowest possible levels of latency. You can populate your Hadoop data lake in just a few easy clicks. The Data Scientists do not need to worry about understanding mainframe data and can focus on the business insights. Syncsort DMX-h can make this data from hundreds of VSAM and sequential files, or from databases like DB2/z and IMS available in Hadoop. It can also map complex COBOL copybook metadata to the Hive metastore automatically. Alternatively, the data can be kept in its original mainframe record format, fixed or variable, for archive purposes or for just leveraging the cluster for scalable and cost-effective computing. This data can then be written back to the mainframe without format changes meeting audit and compliance requirements. In essence, Syncsort DMX-h makes mainframe data distributable for Hadoop and Spark processing. Syncsort DMX-h also secures the entire process with certified Apache Sentry and Apache Ranger integration, native Kerberos and LDAP support, and through secure connectivity. The delivery of these flexibility and strong capabilities were driven by the use cases of our joint customers.

Challenge: How to Assure Your Mainframe Data is Secure in Hadoop Data security is one of the topmost concerns for businesses and IT departments today. Last year businesses experienced the second highest number of verified and tracked data breaches since these statistics first began to be tracked in 2005. The Identity Theft Center tracked some 781 breaches in 2015, which does not include an unknown number that were either never detected or never reported. Data security on the mainframe is famously good. That s one of the reasons the mainframe is still carrying the lion s share of the world s most sensitive transactions, such as credit card payments and storing consumer data. On the other hand, Hadoop is all but essential for getting the kind of business and operational intelligence today s organizations need to survive and remain competitive. In the early days, Hadoop wasn t exactly known for its high level of security. But over time, developers have built enterprise-class security features and measures into the system. Now it s as potent for securing your data as it is for processing it and delivering valuable business insight and intelligence. 001001010110100 01000101010110 000101010001000 11010001010100 01101001000101 01001010010101 001010100010101 100100010101001 When accessing data on the mainframe, the process needs to be secured from the point of access through the offloading process and in the Hadoop cluster, as well. Now that Hadoop has security support from the likes of Kerberos and LDAP, plus the Hadoop-specific solutions that are now available, such as Apache Sentry and Apache Ranger, organizations can have total confidence that their data is secure from beginning to ending. This helps businesses stay within compliance, as well as providing protection against a legal and PR quagmire. 010100100101011010001010100010101011010010 101000100010101101000101010001010110100100

Solution: With Syncsort, your data can be as safe and secure in Hadoop (and during the ingestion process) as it is on the mainframe. Syncsort s DMX-h takes care of your security and compliance worries with support for FTPS and Connect:Direct data transfers, and also features native support for both Kerberos and LDAP. It also integrates seamlessly with all of the popular security systems, like Apache Sentry, as it handles the processing within the Hadoop cluster. Many businesses operate within industries such as finance that require that data be copied in its original format. DMX-h is able to make this happen, plus it is the easiest way to access and integrate mainframe data into Hadoop because DMX-h data integration tasks are able to work directly with mainframe data without having to convert the data into a different format for storage or processing in Hadoop. DMX-h is the ideal solution for heavily regulated industries like banking, insurance, and healthcare, which have struggled in the past to leverage Hadoop and Spark cost-effectively. These industries must deal with massive mainframe data sets while keeping the original EBCDIC format, which is not able to be processed within Hadoop. DMX-h is the only software that is able to make this happen. DMX-h

Challenge: Addressing the Hadoop Connectivity Issues with the Mainframe It s been problematic to integrate mainframe data into Hadoop because there is no native connectivity and processing capabilities in Hadoop for mainframe data. It can take a frustrating amount of time and effort to load database tables into Hadoop, primarily because developers must develop individual loads for each and every table. Access to mainframe data is limited to short periods of time in which users have to extract extremely large quantities of data. Attempting to translate and unpack the data in transit takes too much time.

Solution: Syncsort DMX-h solves this issue, allowing organizations to work with mainframe data in Hadoop or Spark in its native format essential for maintaining data lineage and compliance. Since Syncsort is a contributor to both Apache Sqoop and Apache Spark open source library for accessing the mainframe, DMX-h extends these connections in order to offer additional support for file type, data type, and COBOL Copybook. Additionally, with DMX-h Data Funnel, you can easily ingest hundreds of DB2 tables into Hadoop, all in one single swoop. It allows you to extract and migrate entire database schemas in a single invocation. Syncsort s utility has been a powerful tool in our Data Lake strategy. We were able to ingest into Hadoop over 800 tables from one source system with one press of the button, all while leveraging our existing DMX-h install. Its configuration-based approach provides great flexibility from source to target. With Syncsort DMX-h, data can be copied from the mainframe to Hadoop, while keeping the mainframe formatting, very efficiently. After the data is in Hadoop, DMX-h is able to take advantage of the distributed resources of the clusters in order to access and integrate the data natively, without staging a translated copy. Alternatively, if you need your mainframe data in an open format like ASCII, Parquet or Avro, DMX-h can translate your data in-flight, or on the cluster to avoid a bottleneck on the edge node. DMX-h

Summary The significance of mainframe data is ever more apparent in our daily lives. Every time you swipe your credit card, you are accessing a mainframe; every time you make a payment with your mobile phone, you are accessing a mainframe; and of course, your social security checks are generated based on data on mainframes. If we leave these critical data assets outside of the big data analytics platforms and exclude from the enterprise data lakes, it is a missed opportunity. Making these data assets available in the data lake for predictive and advanced analytics opens up new business opportunities and significantly increases business agility. Syncsort s DMX-h software allows you to quickly access mainframe data unchanged and work with it like any other data source, without the need for specialized skills in either Hadoop or mainframe. By ingesting or loading the data via DMX-h, you can preserve the data lineage for the purposes of governance while eliminating much of the latency often associated with these tasks. It just takes a few simple clicks to do. 01010010010101101 DMX-h 00010101000101010 11010010001010100 01000101011010001

About Syncsort Syncsort is a provider of enterprise software and the global leader in Big Iron to Big Data solutions. As organizations worldwide invest in analytical platforms to power new insights, Syncsort s innovative and high-performance software harnesses valuable data assets while dramatically reducing the cost of mainframe and legacy systems. Thousands of customers in more than 85 countries, including 87 of the Fortune 100, have trusted Syncsort to move and transform mission-critical data and workloads for nearly 50 years. Now these enterprises look to Syncsort to unleash the power of their most valuable data for advanced analytics. Whether on premise or in the cloud, Syncsort s solutions allow customers to chart a path from Big Iron to Big Data. Experience Syncsort at www.syncsort.com syncsort.com/liberate 2017 Syncsort Incorporated. All rights reserved. All other company and product names used herein may be the trademarks of their respective companies. DMXH-EB-011817US DMX-h