White Paper: SAS and Apache Hadoop For Government Unlocking Higher Value From Business Analytics to Further the Mission Inside: Using SAS and Hadoop Together Design Considerations for Your SAS and Hadoop Project Kick- Starting Your SAS and Hadoop Implementation 1
About This Paper Enterprises in government are awash in more data than they can make sense of. This has given rise to the current Big Data phenomenon, in which opportunities for turning data into knowledge using analytics calls for new solutions. Challenges such as scalability, performance and the ability to handle new and different types of data makes it difficult to unlock the value in the data while it is still current. One of the most important architectural trends enterprises should consider today is the integration of new Hadoop-centric Big Data approaches with user-focused business analytics capabilities. The powerful combination of SAS and Hadoop for business analytics can provide a great solution to address the many threats as well as opportunities government agencies face today. What Is Business Analytics? Tom Davenport defines business analytics as: the broad use of data and quantitative analysis for decision making within organizations. It encompasses query and reporting, but aspires to greater levels of mathematical sophistication. It includes analytics, of course, but involves harnessing them to meet defined business objectives. Business analytics empowers people in the organization to make better decisions, improve processes and achieve desired outcomes. It brings together the best of data management, analytic methods, and the presentation of results all in a closed-loop cycle for continuous learning and improvement (From: The New World of Business Analytics March 2010) SAS business analytics software is focused on delivering actionable value from enterprise data holdings. The long-term, consistent vision and continuous innovation of SAS has kept SAS the market leader in business analytics. This remains true in the age of Hadoop, where SAS has brought the power of user-focused business analytics to big data. Apache Hadoop At its core, Hadoop is an open-source framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is a free, Java- 2
based programming framework that essentially accomplishes two tasks: massive data storage and faster processing. The open source Apache Hadoop framework has become the foundation of missionfocused data modernization activities throughout industry and government. This emerging platform has widespread adoption and has a large development community focused on continual improvement. The power of the Hadoop framework is in its ability to analyze data at scale. By using distributed computing models where many processors work over different parts of the data all at once, Hadoop enables very fast analysis. The framework supports diverse types of data. Hadoop enables this analysis using cost efficient commodity hardware, making the approach particularly virtuous. The typical enterprise use of Hadoop today is as a comprehensive platform that stores and retains data, in any state, form, or volume. This concept is one of an Enterprise Data Hub (EDH). Today, data storage costs are declining to roughly $1.5K per terabyte; therefore, Hadoop has been revolutionary in data storage by being more cost efficient. The attractiveness of this solution comes from its ability to meet mission needs in economical and efficient ways and an open design that ensures future missions can be supported without forklift upgrades. Using SAS and Hadoop Together SAS and Hadoop are a particularly good match for each other. Hadoop s ability to store and manage all data types and execute operations over the data in distributed ways has brought new power to the SAS business analytics applications. SAS is deeply commited to research and the development of its software, with user experience always top of mind. Their engineers have enabled a technical connection to Hadoop that abstracts away complexity for users, but brings the full power of all the data in the enterprise to the users. Users see easy to use business analytics tools that enable more powerful support to missions. And this is done without the need for technologists to craft queries in programming languages. It really just works. Once only available to coders, the latest SAS software enables users with drag-n-drop analytic capabilities such as creating reports, visualizing trends, identifying anomalies and outliers, and spotting variable correlations. SAS and Hadoop work especially well in situations where advanced analytic techniques are applied on large volumes of data. Some current use cases leveraging this approach have included: 3
Tracking down ground-zero and root-cause of disease outbreaks such as Ebola and measles. Identifying likely drug traffickers at border crossings. Detecting fraudulent medical claims. Identifying money laundering and terrorist financing rings. Spotting insider threat by recognizing anomalous patterns of behavior. The SAS Approach to Hadoop: From, With, and In To the analyst user of SAS, the business analytic tools work seamlessly and produce results. How this is done is something only the enterprise architects need to track. Architects see a design where SAS can be connected in three key ways: From: SAS accesses and extracts data from Hadoop to a SAS server for processing, and writes results back as required. SAS capabilities can move the right data from any source, including Hadoop, and for some analytical workloads this is the right approach: run a query and move data to a SAS analytic tool. With: SAS accesses and processes Hadoop data on SAS servers, while keeping the data and computations massively parallel. This more powerful operation is the working of SAS in conjunction with a Hadoop cluster, where some analytical tasks are performed with SAS and others are farmed out to the cluster. Results are presented in dynamic ways for analysts to iterate on and analyze. In: SAS processes data directly in the Hadoop cluster. The combination of SAS s embedded process agents and the distributed data framework of Hadoop itself make this even more powerful combination possible. This approach presents information to analysts fast and enables quick iteration over results that take into account all the data holdings of an organization. The Benefits of SAS and Hadoop Together SAS support for big data implementations and Hadoop center on one goal: helping the analyst know more, faster, so better decisions can be made in a more timely manner. The engineering to achieve this goal has resulted in a SAS and Hadoop architecture that: Allows queries in the SAS business analytic tools to run faster than if they were to run in Hadoop alone. 4
Improves the performance of Hadoop to the point where queries are now so fast that analysts can iterate their questions rapidly. Analysts see incredible speed from their SAS business analytic tools. Allows analytics on very large data sets in an enterprise situation that other vendors just can't handle. Combines SAS predictive analytics, forecasting and data visualization capabilities, with the power and large data capabilities of Apache Hadoop; therefore, making SAS analytical procedures and applications even more powerful. Enables target identification, fraud detection, and other data-intensive analysis to run faster, using the same user-friendly business analytic tools they already rely on only on a far larger volume of data. Make direct contributions to operational decisions by using machine learning to process data in new ways. All the power of Hadoop and more is brought to the analyst via tools designed specifically for them. While SAS allows for coding, Java developers are not needed for queries, and analysts do not need to write MapReduce jobs. 5
The SAS and Hadoop Ecosystem Figure 1. SAS helps users manage data on Hadoop through an intuitive user interface, so it s easy to perform self- service data preparation tasks with minimal training. The most critical part of the diagram at Figure 1 is the user interface. SAS business analytic tools are used not only because they are powerful, but also because they are focused on the needs of humans, and that remains true in a combined SAS and Hadoop architecture. But architects will also appreciate the implied interoperability and functionality of this diagram. Design Considerations for Your SAS and Hadoop Project We interviewed Doug Liming and John McCue, two of SAS s leading big data engineers, seeking insights that can help architects optimize their SAS and Hadoop implementation. The result of these interviews is a succinct list of the top principles for SAS and Hadoop project success. Our recommendations for planners include: 6
Architect so humans do what humans do best and computers do what computers do best. Organizations are optimized for analysis when they design systems that empower their analysts to do what they do best, and leverage IT to do what it does best. Analysts leverage the greatest processor on earth, their brains. They are paid to think and generate knowledge that supports their organization s mission. Humans develop insights and inferences and produce actionable intelligence for decision makers to act upon. Humans are great at utilizing their pattern recognition and sensemaking abilities, up to a point. Even the most trained and experienced analyst can only process a fixed number of objects at any one time. Once analysts pass that threshold, human processing power degrades rapidly. Using SAS as the business analytics platform empowers analysts with very capable ways to access and interact with all the data in the enterprise, and does so in a way that leverages the strengths of the human mind. Understand and focus on current use cases: The mission of your organization is key, and that is what your business analytic tools and your overall data architecture should support. Ensure this is done by dialog over well-thought-out and well-staffed use cases. This will help planners identify and clarify the most important objectives and design goals for your project. Determining the prioritized data flows for the first use cases will help ensure demonstrable success early in a project s lifecycle. Ensure the design focuses on outputs: Identify the analytical queries and algorithms required to generate desired outputs. This will enable the capturing of the advanced analytics requirements and interactive query needs that the system must meet. Plan for future expansion of use cases: First successes will be measured based on how well they meet current agency needs. But the power of a wellengineered solution of SAS and Hadoop is that it can support many new use cases and future workloads. The key action in planning for expansion is to listen to the challenges faced by mission owners, and to be prepared to iteratively incorporate the new workloads and new data flows, provided by lessons learned, into the solution. Consider the full design: Consider compute, networking, data storage and the software framework together as the data platform. SAS is the business analytics component, and Hadoop is the data framework. Optimizing them 7
should include consideration for communications and storage that performs to your expectations. Ask for design help: Repeatable patterns from other enterprises are available for reference. Engineers from SAS and their partners can help refine and turn functional reference architecture into a technical design that will rapidly bring new functionality to the agency mission. Kick-Starting Your SAS and Hadoop Implementation Ready to move out? Here are four steps to consider as you do: 1) Evaluate your enterprise in light of the recommended criteria above. Use that to build your plan. 2) Enlist the aid of your analyst community to prioritize the analytical capabilities to deliver. 3) After prioritizing the analytical capabilities your mission requires, address the enterprise technology gaps required to enhance support to mission. 4) Track improvements to your enterprise like a project: Watch cost, schedule and performance, and use those metrics to drive to goals. Concluding Thoughts Government organizations do recognize the importance of big data, and they understand the value it can bring using analytics to extract insights. As the cost of storage has decreased, Hadoop has become an affordable means to accommodate these voluminous collections of data, as well as enable a new level of analytic capability never possible before. The combination of SAS analytics and SAS data management tools for Hadoop brings analytics to a higher level of scalability and performance and overcomes many of the obstacles preventing government organizations from extracting real, timely value from their data. SAS also reduces the burden on IT by allowing users to be more self- 8
sufficient and offers tools that allow users with minimal data skills to access and prepare their own data for their own analysis. SAS is flexible in working with any hardware or database vendor and will easily integrate with all legacy and new technologies in the government enterprise today. This includes Hadoop and all data warehouse capabilities. The success of your big data analytics project will depend on the value it brings to the organization. Together, SAS and Hadoop can unlock the value that you are not experiencing today. This is the driving reason to consider SAS and Hadoop together for your enterprise data mission needs. For more information on SAS and Hadoop visit: http://sas.com/hadoopvision 9
More Reading For more federal technology and policy issues visit: CTOvision.com- A blog for enterprise technologists with a special focus on Big Data. CTOlabs.com - A reference for research and reporting on all IT issues. J.mp/ctonews - Sign up for the government technology newsletters including the Government Big Data Weekly. About the Authors Bob Flores is a co-founder and partner at Cognitio. Bob spent 31 years at the Central Intelligence Agency. While at CIA, Bob held various positions in the Directorate of Intelligence, Directorate of Support, and the National Clandestine Service. He was the agency s Chief Technology Officer. Bob serves on numerous government and industry advisory boards Bob Gourley is a co-founder of Cognitio and editor and chief of CTOvision.com He is a former federal CTO. His career included service in operational intelligence centers around the globe where his focus was operational all source intelligence analysis. He was the first director of intelligence at DoD s Joint Task Force for Computer Network Defense, served as director of technology for a division of Northrop Grumman and spent three years as the CTO of the Defense Intelligence Agency. Bob serves on numerous government and industry advisory boards. Roger Hockenberry is a co-founder and partner and CEO at Cognitio. Following a two-decade career in industry first as a technology consultant and later as a management consultant and Managing Partner at Gartner Roger spent four-years in government service in the intelligence community where he was charged with driving the realization of the vision he had helped craft as a consultant. For More Information If you have questions or would like to discuss this report, please contact me. As an advocate for better IT use in enterprises I am committed to keeping this dialogue open on technologies, processes and best practices that will keep us all continually improving our capabilities and ability to support organizational missions. Contact: Bob Gourley bob.gourley@cognitiocorp.com CTOlabs.com 10