Hortonworks, Inc. (HDP)

Size: px
Start display at page:

Download "Hortonworks, Inc. (HDP)"

Transcription

1 Americas/United States Equity Research Software Rating OUTPERFORM* [V] Price (05 Jan 15, US$) Target price (US$) 35.00¹ 52-week price range Market cap. (US$ m) 1, *Stock ratings are relative to the coverage universe in each analyst's or each team's respective sector. ¹Target price is for 12 months. [V] = Stock considered volatile (see Disclosure Appendix). Share price performance Dec-14 Daily Dec 12, Jan 05, 2015, 12/12/14 = US$26.38 Price Research Analysts Philip Winslow, CFA philip.winslow@credit-suisse.com Sitikantha Panigrahi sitikantha.panigrahi@credit-suisse.com Michael Baresich michael.baresich@credit-suisse.com Joanna Kamien joanna.kamien@credit-suisse.com Indexed S&P 500 INDEX On 01/05/15 the S&P 500 INDEX closed at Quarterly EPS Q1 Q2 Q3 Q4 2013A E E Hortonworks, Inc. (HDP) INITIATION Horton Hears An Outperform! Initiating Coverage with Outperform: We believe that Hortonworks' 100% open-source business model and unique competitive strategy combined with the massive market opportunity and the early-stage adoption of Hadoop and Big Data technologies will produce significant, sustained revenue growth. As such, we are initiating coverage with an Outperform rating and a target price of $35. The Big Deal About Big Data: The digital data universe is forecast to expand from 4.4 zettabytes in 2013 to 44 zettabytes by Of all data growth, 85% is coming from new types of data, including social data, clickstreams, server logs, sensors, and the Internet of Things (IoT), but existing relational databases are not well suited to harness these types of unstructured data, leaving Hadoop as the preferred platform to manage Big Data. 1,2 We estimate the market opportunity for paid Hadoop subscriptions to be approximately $3.5 billion if the same amount of analytical data currently stored in relational databases were to be stored in Hadoop clusters. (See Exhibit 5.) We view this market sizing as conservative owing to its assumption of including only relational data, relative data stagnation, and a paid attach rate similar to Linux. Unique Competitive Position and Strategy: Although other distributions exist, the primary competitors in the pure-play Hadoop distribution market are Hortonworks, Cloudera, and MapR. In our view, a key difference between these three competitors is "open-source purity." 3 Whereas Hortonworks is a 100% open-source distribution, MapR is approximately 80% open-source, and Cloudera is approximately 85% open-source. 4 We believe that Hortonworks maintains a unique competitive position in the commercial open-source Hadoop market, based on the following strategic pillars: (1) leading the innovations of the open-source core of Hadoop (i.e., largest committer in terms of lines of code to Apache Hadoop and related projects); (2) evolving Hadoop into an enterprise-grade, mission-critical system (i.e., YARN); and (3) enabling the Hadoop ecosystem (i.e., YARN Ready). Financial and valuation metrics Year 12/13A 12/14E 12/15E 12/16E EPS - (Excl. ESO) (US$) EPS (CS adj.) (US$) Prev. EPS (CS adj.) (US$) P/E (CS adj., x) P/E rel. (CS adj., %) Revenue (US$ m) EBITDA (US$ m) Net debt (US$ m) OCFPS (US$) P/OCF (x) Number of shares (m) Price/sales(x) 15.1 BV/share (Next Qtr., US$) -2.5 P/BVPS (x) -7.6 Net debt (Next Qtr., US$ m) Dividend (current, US$) Dividend yield (%) Source: Company data, Credit Suisse estimates. DISCLOSURE APPENDIX AT THE BACK OF THIS REPORT CONTAINS IMPORTANT DISCLOSURES, ANALYST CERTIFICATIONS, AND THE STATUS OF NON-US ANALYSTS. US Disclosure: Credit Suisse does and seeks to do business with companies covered in its research reports. As a result, investors should be aware that the Firm may have a conflict of interest that could affect the objectivity of this report. Investors should consider this report as only a single factor in making their investment decision. CREDIT SUISSE SECURITIES RESEARCH & ANALYTICS BEYOND INFORMATION Client-Driven Solutions, Insights, and Access

2 Hortonworks, Inc. HDP Price (05 Jan 15): US$26.14, Rating: OUTPERFORM [V], Target Price: US$35.00 Income statement (US$ m) 12/13A 12/14E 12/15E 12/16E Revenue (US$ m) EBITDA (54) (108) (146) (152) Depr. & amort. (1) (1) (5) (10) EBIT (US$) (55) (109) (151) (162) Net interest exp Associates Other adj, PBT (US$) (55) (108) (150) (161) Income taxes (0.05) (0.10) (0.17) (0.18) Profit after tax (55) (108) (151) (162) Minorities Preferred dividends Associates & other Net profit (US$) (55) (108) (151) (162) Other NPAT adjustments 5 5 (7) (10) Reported net income (50) (103) (157) (172) Cash flow (US$) 12/13A 12/14E 12/15E 12/16E EBIT (55) (109) (151) (162) Net interest Cash taxes paid Change in working capital Other cash & non-cash items (5) Cash flow from operations (46) (77) (77) (56) CAPEX Free cash flow to the firm (46) (77) (77) (56) Acquisitions (3) Divestments Other investment/(outflows) 2 (82) 65 (17) Cash flow from investments 2 (85) 65 (17) Net share issue/(repurchase) Dividends paid Issuance (retirement) of debt (0) Other (68) Cash flow from financing (66) 2 activities Effect of exchange rates Changes in Net Cash/Debt (78) (71) Net debt at start (39) (196) (118) Change in net debt (39) (157) Net debt at end (39) (196) (118) (48) Balance sheet (US$ m) 12/13A 12/14E 12/15E 12/16E Assets Cash and cash equivalents Accounts receivable Inventory Other current assets Total current assets Total fixed assets Intangible assets and goodwill Investment securities Other assets Total assets Liabilities Accounts payable Short-term debt Other short term liabilities Total current liabilities Long-term debt Other liabilities Total liabilities Shareholders' equity (143) Minority interest Total equity & liabilities Net debt (US$ m) (39) (196) (118) (48) Per share data 12/13A 12/14E 12/15E 12/16E No. of shares (wtd avg) CS adj. EPS (US$) (4.54) (4.06) (3.33) (3.26) Prev. EPS (US$) Dividend (US$) Dividend payout ratio Free cash flow per share (3.83) (2.90) (1.71) (1.13) (US$) Key ratios and 12/13A 12/14E 12/15E 12/16E valuation Growth (%) Sales EBIT Net profit EPS Margins (%) EBITDA margin (223.4) (232.2) (190.1) (124.9) EBIT margin (227.5) (234.5) (197.2) (133.2) Pretax margin (226.9) (233.5) (196.3) (132.9) Net margin (227.1) (233.8) (196.6) (133.0) Valuation metrics (x) EV/sales EV/EBITDA (25.2) (12.6) (9.3) (8.9) EV/EBIT (26.1) (11.7) (8.9) (8.8) P/E (5.8) (6.4) (7.8) (8.0) P/B (3.5) (7.7) (5.0) (3.3) Asset turnover ROE analysis (%) ROE stated-return on equity ROIC Interest burden Tax rate (0.10) (0.10) (0.11) (0.10) Financial leverage Credit ratios (%) Net debt/equity (312.7) (121.3) (802.3) 33.3 Net debt/ebitda Interest coverage ratio Quarterly data 12/13A 12/14E 12/15E 12/16E EPS for Q1 (0.89) (0.84) (0.94) (0.88) EPS for Q2 (0.98) (0.98) (0.86) (0.84) EPS for Q3 (1.00) (1.14) (0.81) (0.79) EPS for Q4 (0.94) (1.04) (0.73) (0.75) Source: Company data, Credit Suisse estimates Dec-14 Daily Dec 12, Jan 05, 2015, 12/12/14 = US$26.38 Price On 01/02/15 the S&P 500 INDEX closed at Indexed S&P 500 INDEX Hortonworks, Inc. (HDP) 2

3 Table of Contents The Download 4 Key Charts 5 Investment Analysis 8 Investment Positives 8 Big Data Is a Big Deal! 8 Unique Competitive Position and Strategy 11 Large Addressable Market 20 Framing Risk/Reward 25 Potential for Long-Term Margin Expansion 27 Risks 27 Limited Track Record in a Young Ecosystem 27 Potential Difficulties in Monetizing Open-Source Software 28 High Cash Burn 31 Lack of VSOE 31 Valuation Ain't Cheap 32 Market Overview 35 The Evolution of Data Management Technologies 35 Relational Database 35 SQL and MapReduce in an MPP Data Warehouse 37 NoSQL 38 What is Hadoop? 39 Hadoop Hadoop The Ecosystem of Hadoop Projects 52 Competitive Review 57 Pure-Play Hadoop Distributions 57 Cloudera 58 MapR 62 Other Technology Vendors 63 Pivotal 64 Oracle 64 IBM 65 Microsoft 66 Teradata 68 Company Overview 72 Company Background 72 Corporate History 72 Headcount & Management Team 73 Platform Overview 73 Hortonworks Data Platform (HDP) 73 Subscription/Services Overview 76 Support Subscriptions 76 Professional Services 77 Sales/Distribution Overview 77 YARN Ready Program 78 Accounting Overview 78 Revenue Recognition 78 Estimates 80 Sources & References 84 Hortonworks, Inc. (HDP) 3

4 The Download What's the Call? We are initiating coverage on Hortonworks with an Outperform rating and a $35 target price. What's Consensus Missing? Using Red Hat as an analog with respect to the Linux market, we can surmise that open-source technologies operate in roughly a "winnertakes-most" market. (See Exhibit 17.) Continuing the analogy of the commercial Linux market, a key to monetizing an open-source software product is to be the vendor driving the direction of the platform. This can be measured by lines of code contributed and number of contributors. 1 In a market poised to grow as exponentially as Hadoop, the company contributing the most lines of code to the ecosystem, which is Hortonworks, could potentially be that winner. Given Hortonworks' strong positioning within the Hadoop ecosystem, we believe the company will be able to capture a sizable portion of the addressable market, analogous to Red Hat in the paid Linux market. Hortonworks has a substantial committer presence within the Apache Hadoop project, with more than twice the number of individual committers as Cloudera and more than any other individual company as well as more than twice the number of lines of code contributed and changed as Cloudera and more than any other individual company. (See Exhibit 16 and Exhibit 18 to Exhibit 21, respectively.) This substantial presence in key areas of the Hadoop community enables Hortonworks to not only strongly influence the direction of development of Hadoop but also ensures that its support offering is backed up by the most-informed knowledge base possible. 1 What's the Stock Thesis? We believe that Hortonworks maintains a unique competitive position in the commercial open-source Hadoop market, based on three strategic pillars: (1) leading the innovations of the open-source core of Hadoop, (2) evolving Hadoop into an enterprise-grade, mission-critical system, and (3) enabling the Hadoop ecosystem. We believe that Hortonworks' 100% open-source business model and unique competitive strategy combined with the massive market opportunity and the early-stage adoption of Hadoop and Big Data technologies will produce significant, sustained revenue growth. (See Exhibit 3.) What's the Impact to the Model? For fiscal 2014, we estimate $46.4 million in revenue and EPS of ($4.06). What's the Next Catalyst/Data Point? Hadoop World will be hosted on February 17-20, 2015, and Hadoop Summit will take place on June 9-11, What's Valuation? Hortonworks currently trades at an enterprise value to next-12-months (NTM) revenue multiple of However, AMR Research forecasts the Hadoop software market to grow from $400 million in 2013 to $11.2 billion by We estimate the market opportunity for paid Hadoop subscriptions to be approximately $3.5 billion if the same amount of analytical data currently stored in relational databases were to be stored in Hadoop clusters. (See Exhibit 29.) Therefore, although we have forecast Hortonworks' revenue to grow 65.3% in 2015 (a level substantially higher than the software industry average), we view this forecast as conservative and meaningfully biased to the upside. Hortonworks, Inc. (HDP) 4

5 Key Charts Exhibit 1: Businesses are Looking for Ways to Unlock the Value of Their Data Source: IDC, IDG Enterprise, AMR Research, Credit Suisse. Exhibit 2: Hadoop Bridges that Gap Source: Company data, Credit Suisse. Hortonworks, Inc. (HDP) 5

6 Exhibit 3: Hortonworks' Competitive Strategy Source: Hortonworks. Exhibit 4: Hortonworks vs. Cloudera vs. MapR Source: Company data, Credit Suisse. Hortonworks, Inc. (HDP) 6

7 Exhibit 5: Framework for Sizing Worldwide Hadoop Market If Storing an Amount of Data Equivalent to the Amount of Analytical Data Currently in Relational Database Management Systems (RDBMS) Terabytes and US$, unless otherwise stated Data Warehousing/Data Mart 2,143,028 3,177,695 3,516,455 5,159,506 7,673,174 Data Analysis/Data Mining 1,274,467 1,473,423 1,822,824 2,686,425 3,818,056 Total Storage Shipments 3,417,495 4,651,118 5,339,279 7,845,931 11,491,230 Sum of Terabytes Shipped, ,745,052 TBs per Hadoop Node 12 Hadoop Nodes 2,728,754 Price of Annual Subscription per Hadoop Node $2,000 Attach Rate of Paid Hadoop 65% Market $3,547,380,589 Source: Company data, IDC, Credit Suisse estimates. Hortonworks, Inc. (HDP) 7

8 Investment Analysis Investment Positives Big Data Is a Big Deal! Even with the continued innovation delivered by the database industry, organizations are struggling with an ever increasing amount and variety of data that they must handle, sift, and retain and/or dispose of every day. 5 Data is doubling in size every two years, with the digital universe forecasted to expand from 4.4 zettabytes in 2013 to 44 zettabytes by (See Exhibit 6.) Exhibit 6: Big Data Expansion Source: Hortonworks. At its heart, the Big Data revolution is about finding new value outside of conventional data sources. 6 As the amount of data generated by businesses continues to grow at exponential rates, organizations are struggling with how to manage these vast and diverse datasets that include not only traditional structured but also faster-growing unstructured data types. These large, untapped datasets define a category of information, known as "Big Data." 9 This data can provide useful operational insights into user behavior, security risks, capacity consumption, peak usage times, fraudulent activity, customer experience, and so on. 5 (See Exhibit 7.) Enterprises are not only inundated with increasing amounts of data but they also struggle with managing more types of data that are less easily managed by traditional datacenter architectures. 1 As such, organizations are struggling with how to manage these vast and diverse datasets that include traditional structured data as well as semistructured or unstructured data types, including sensor data, Web pages, Web log files, click-streams, AVI files, search indexes, text messages, , etc. 7 (See Exhibit 8.) Hortonworks, Inc. (HDP) 8

9 Exhibit 7: Big Data Goals Exhibit 8: Big Data Problems Source: Hortonworks. Source: Hortonworks. Because of the architectural limits of traditional data management systems, the vast majority of data an organization generates today is either neglected or not utilized, as the data is often nonstandard, time-series, and/or constantly changing. 5 According to The Economist, only 5% of the information that is created is structured, which further enhances the problem of how to derive quality business insights from the remaining 95% of data, which is multistructured in nature and is increasing at an exponential rate that far outpaces the growth of structured data. These multi-structured data types are fundamentally different than the scalar, structured numbers and text that organizations have been storing in relational data warehouses for the past three decades. 6 Managing different types of data (e.g., , search results, sales data, inventory and customer data, click-through data, and so on) is very expensive and technically challenging through a relational database management system (RDBMS). (See Exhibit 9) Exhibit 9: Businesses are Looking for Ways to Unlock the Value of Their Data Source: IDC, IDG Enterprise, AMR Research, Credit Suisse. The increased diversity of data sets that include traditional structured data as well as semistructured or unstructured data types has sparked the emergence of new approaches to Hortonworks, Inc. (HDP) 9

10 data management that allow this information to be effectively understood and analyzed. The growing need for large-volume, multi-structured "Big Data" analytics, as well as the emergence of in-memory data architectures that characterize "Fast Data," have positioned the industry at the cusp of the most radical revolution in database architectures in 20 years, and we believe that the economics of data (not just the economics of applications) will increasingly drive competitive advantage, resulting in high, sustained growth in the market for Big Data technologies. In comparison to relational database management systems, Hadoop allows users to store as much data as they want in whatever form they need, simply by adding more servers to a Hadoop cluster. Each new server (which can be relatively inexpensive x86 machines) adds more storage and processing power to the overall cluster. This makes data storage with Hadoop far less costly than prior methods of data storage. 8 The vast amounts of Big Data typically contain mostly irrelevant detail, but some "hidden gems" may be useful for further analysis or for enriching other data sources. Despite storing these data outside of traditional databases, some customers do want to integrate these data with data stored in the database. The goal of such integration is to extract information that is of value to the business user. 9 In fact, studies show that a sophisticated algorithm and a small amount of data is less accurate than a simple algorithm and a large volume of data. 10 For example, social networking alone could bring huge external unstructured datasets into the enterprise either as actual data or metadata, as well as links from blogs, communities, Facebook, YouTube, Twitter, LinkedIn, and others. Too much information certainly is a storage issue, but too much and too many types of data is also a massive analysis issue. 11 Since the project's inception in 2006, a key goal of Apache Hadoop has been to enable companies to affordably hold all of their data as opposed to just portions. All data becomes equal and equally available, so business scenarios can be run with raw data at any time as needed, without limitation or assumption. Since Hadoop also lets companies store data as it comes in structured or unstructured they do not need to spend money and time configuring data for relational databases and rigid tables. Hadoop offers a cost-effective, scalable platform for catching and analyzing data that is coming from multiple sources at once. 8 (See Exhibit 10.) Exhibit 10: Hadoop Bridges that Gap Source: Company data, Credit Suisse. Hortonworks offers Hortonworks Data Platform (HDP), an enterprise-ready, 100% opensource, 100% Apache Hadoop-compatible distribution of Hadoop. Hortonworks tests HDP with large-scale, high-stability deployments in mind, key criteria for enterprise adoption of Hortonworks, Inc. (HDP) 10

11 Hadoop as a mainstream platform. All solutions in HDP are developed as projects through the Apache Software Foundation (ASF), and there are no proprietary extensions in HDP. 1,12 Unique Competitive Position and Strategy The primary competitors in the pure-play Hadoop distribution market are Hortonworks, Cloudera, and MapR. These three companies singly focus on developing, supporting, and marketing unique Hadoop distributions, add-on innovations, and services. 13 (See Exhibit 11) Although the company has a competitive solution, MapR has lagged behind Cloudera and Hortonworks in terms of market awareness. 13,14 In fact, one of the biggest talking points at the most-recent Hadoop Summit (June 3-5, 2014) and Hadoop World (October 15-17, 2014) was the race between Hortonworks and Cloudera, with the competition heating up as the market matures. 15 Exhibit 11: Hortonworks vs. Cloudera vs. MapR Source: Company data, Credit Suisse. Hortonworks' largest differentiator is that it is the only 100% open-source distributor of Apache Hadoop that is truly enterprise-grade. The Hortonworks Data Platform offers linear scale storage and compute across batch, interactive, and real-time access methods with no proprietary extensions and is available on-premise, off-premise, or from an appliance across Windows and Linux. Hortonworks currently consists of 153 Committers and 104 PMC Members to various Hadoop projects, and customer checks have noted switching to Hortonworks from competitors such as Cloudera to avoid vendor lock-in while taking advantage of an agile, responsive, and open provider. 1 (See Exhibit 12.) Hortonworks, Inc. (HDP) 11

12 Exhibit 12: Hortonworks vs. Other Pure-Play Hadoop Competitors Source: Hortonworks, Credit Suisse. We believe that Hortonworks maintains a unique competitive position in the commercial open-source Hadoop market, based on three strategic pillars: (1) leading the innovations of the open-source core of Hadoop, (2) evolving Hadoop into an enterprise-grade, mission-critical system, and (3) enabling the Hadoop ecosystem. (See Exhibit 13.) Exhibit 13: Hortonworks' Competitive Strategy Source: Hortonworks. #1: Lead the Innovation of the Open-Source Core of Hadoop Unlike proprietary software, in which the vendor writes software and locks that intellectual property up under a proprietary license and then sells the right to use the software, opensource software is a community-based software development model in which thousands of developers contribute to the development and testing of the software and then give away the software for free to download under GNU-based licenses. However, certain customers are willing to pay for technical support, consulting, and enterprise-grade features, as well as for legal protection. Therefore, the key to monetization lies in being known as the Hortonworks, Inc. (HDP) 12

13 contributor of the source code. As a result, the source code contributors are recognized as the best source for support, updates, and add-on components for the open-source software. 16 In our view, the key difference between Hortonworks, Cloudera, and MapR is "opensource purity." 3 Whereas Hortonworks is a 100% open-source distribution, MapR is approximately 80% open-source, and Cloudera is approximately 85% open-source. 4 (See Exhibit 14.) Exhibit 14: Hortonworks Open-Source Process for Enterprise Hadoop Source: Hortonworks. For some background, the Apache Software Foundation (ASF) is a nonprofit corporation that provides support for the Apache community of open-source software projects, which offers software products for the public good. The Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Apache projects consist of a Project Management Committee (PMC) that is responsible for the management and oversight of a project, a vice president, who is an unpaid volunteer responsible for organizational oversight, committers who have earned write access to the project, a release manager who is a committer responsible for the logistics of major releases, and contributors, who provide bug reports, testing, documentation, design feedback, among other responsibilities, to the project. 1 Given that Hadoop is an open-source project from the Apache Software Foundation, Hortonworks, as an open-source provider of Hadoop, is well positioned to take advantage of this market growth. Hortonworks employs the largest number of active Apache Software Foundation committers and Project Management Committee (PMC) members of any company for the enterprise-grade Hadoop projects within the Hortonworks Data Platform, including Apache Hadoop, Apache Hive, Apache Pig, Apache Tez, Apache HBase, Apache Accumulo, Apache Storm, Apache Ambari, Apache Knox, Apache Falcon, Apache Oozie, Apache Sqoop, Apache Flume, and Apache Zookeeper. The number of active committers and active PMC members that are employed by Hortonworks and focused on the Apache Hadoop project individually as well in total across all of the Apache projects listed above is more than twice the total of the next largest employer of such committers. These employees enable Hortonworks to drive innovation, define a roadmap for the future of Hadoop, ensure predictable and reliable enterprise quality releases and provide comprehensive, enterprise-class support. 1 (Exhibit 15.) Hortonworks, Inc. (HDP) 13

14 Exhibit 15: Key Hortonworks Founders: Some of the Original Architects of Hadoop Source: Hortonworks, Credit Suisse. Hortonworks has a substantial committer presence, with more than twice the number of individual committers as Cloudera and more than any other individual company. (See Exhibit 16.) This substantial presence in key areas of the Hadoop community enables Hortonworks to not only strongly influence the direction of development of Hadoop but also ensures that its support offering is backed up by the most-informed knowledge base possible. Hortonworks also contributes 100% of its developed code back to Apache Software Foundation, rather than developing proprietary technology, negating the need to support separate proprietary code bases outside of the community. 1,17 Exhibit 16: Committers to Hadoop and Related Projects by Contributing Company Facebook, 5 LinkedIn, 2 IBM, 2 Yahoo, 10 Hortonworks, 27 Cloudera, 11 Others, 23 Source: Apache Software Foundation, Credit Suisse. Hortonworks operates on a similar open-source business model to Red Hat. Hortonworks compiles its own distribution of open-source products, which the company then makes available freely and sells services and support, offering users (1) the latest versions of Hortonworks, Inc. (HDP) 14

15 Hadoop as well as patches for Hadoop bug fixes; (2) Hortonworks' experience with the Hadoop platform; and (3) a purely open-source Hadoop deployment. Hortonworks' strong base of committers to the various Hadoop-related Apache products ensures that its offering is able to remain purely open-source while still evolving to meet the needs of enterprises, such as when Hortonworks purchased XA Secure and implemented its security technology into the open-source Hadoop release. 12 While having proprietary technology can provide an advantage while the market is nascent, as the market for Hadoop grows, we believe it becomes difficult for companies providing proprietary technology to innovate faster than the ecosystem itself. The company fueling the ecosystem, as Hortonworks does, ultimately has the opportunity to gain a competitive advantage using an open-source business model. Because Hortonworks' solution does not implement a proprietary layer of technology alongside or on top of Hadoop like Cloudera and MapR do the company can offer the latest version of Hadoop technology as fast as it is released, rather than having to adapt the proprietary technology to the latest releases as they happen. Proprietary Hadoop extensions can be made open-source simply by publishing to github, but compatibility issues will creep in, and as the extensions diverge from the trunk, so too does reliance on the extension's vendor. In comparison, filling feature gaps and offering an agile Hadoop platform enables Hortonworks to drive adoption of Hadoop by enterprises, driving demand for support services. By leveraging its experience with and closeness to the Hadoop platform's development, we believe Hortonworks can position itself to be the "go-to" provider of Hadoop support. 12 Using Red Hat as an analog with respect to the Linux market, we can surmise that opensource technologies operate in roughly a "winner-takes-most" market. Within the Linux market, Red Hat has emerged as the dominant leader in this open-source market, holding a significantly higher market share than competitors such as SUSE or Oracle, and is known as the "de facto standard" for enterprise Linux implementations. While there are several distributors of Linux, Red Hat maintains a dominant share of 72.9% in 2013 within the Linux server operating system market, followed by SUSE with 18.1% and Oracle with 6.7% market share. Within the Linux operating system market, Red Hat has been steadily gaining market share from 67.7% in 2008 to 72.9% in (See Exhibit 17.) Exhibit 17: Worldwide Linux Server Operating System Revenue Market Share by Vendor 80.0% 70.0% 60.0% 50.0% 67.7% 66.7% 68.3% 69.1% 69.8% 72.9% 40.0% 30.0% 20.0% 27.4% 28.5% 26.3% 23.6% 20.5% 18.1% 10.0% 0.0% 7.3% 4.9% 2.9% 6.7% 3.5% 4.7% 0.0% 0.7% 2.4% 2.6% 2.4% 2.3% Red Hat SUSE (Attachmate) Oracle Other Source: IDC. Continuing the analogy of the commercial Linux market, the key to monetizing the opensource software product was to be the vendor driving the direction of the platform. This can be measured by lines of code contributed. In Linux, Red Hat is the single largest Hortonworks, Inc. (HDP) 15

16 corporate contributor, contributing 19% of all changes to the Linux Kernel between versions 3.2 and 3.10, according to The Linux Foundation. (See Exhibit 18.) This is second only to the sum total of all contributions by unaffiliated individuals, at 24% of total changes, and three times as many changes as SUSE, Red Hat's closest competitor in enterprise Linux. Hortonworks occupies a similar role in the Hadoop community. Hortonworks contributed more than 420,000 lines of code to Hadoop and related projects between 2006 and May 2013, significantly more than any other vendor, including its largest competitor, Cloudera. 1,17 (See Exhibit 19.) Exhibit 18: Changes to the Linux Kernel (from Versions 3.2 to 3.10) by Contributing Company Unknown 6% SUSE 6% Samsung 5% IBM 6% Google 4% No affiliation 24% Exhibit 19: Cumulative Lines of Code Contributed to Apache Hadoop and Related Projects (as of May 2013) by Contributing Company Cloudera 17% InMobi 5% Facebook 4% WANdisco 2% Apple 2% LinkedIn 2% Twitter 1% Linaro 7% Texas Instruments 7% Intel 16% Source: Red Hat 19% Yahoo! 19% Source: Hortonworks, Credit Suisse. Hortonworks 48% Exhibit 20: Number of Lines of Code Changed within Apache Hadoop by Contributing Company, , , , , , , , , ,000 50,000 0 Exhibit 21: Number of Lines of Code Changed within Apache Hadoop by Contributing Company, , , , , ,000 50,000 0 Source: Hortonworks. Source: Hortonworks. Hortonworks also offers patches, training, and support services for the Hadoop platform, making the Hortonworks Data Platform compelling to use as opposed to downloading the basic Hadoop platform from the Apache Website and piecing together a Hadoop solution in-house. Operating with an open-source business model, Hortonworks can increase the velocity by which the technology is built and how rapidly it can be innovated. Thus, Hortonworks, Inc. (HDP) 16

17 Hortonworks drives the enterprise roadmap with transparency across the complete platform, and many customers enjoy participating in the open-source model. 1 In addition, certain open-source components for Hadoop are only available on an enterprise-ready level through Hortonworks. Hortonworks is the only company that supports Apache Falcon, Apache Knox, Apache Tez, Apache Ambari, and Apache Argus (previously XA Secure). 18 If Hadoop can achieve mission-critical status and vast deployment, Hortonworks stands at the cusp of significant growth. While Hortonworks is fueling the growth of the open-sourcebased Hadoop ecosystem by contributing 100% of its developed code back into Hadoop, its competitors face the challenge of having to develop faster than the Hadoop ecosystem to sustain growth. As evidenced by Red Hat and its successful monetization of Linux, open-source software technologies typically operate in a "winner-takes-most" environment, with one vendor taking the majority of market share. In a market poised to grow as exponentially as Hadoop, the company contributing the most lines of code to the ecosystem, which is Hortonworks, could potentially be that winner. #2: Evolve Hadoop into an Enterprise-Grade, Mission-Critical System The long-term success of Hortonworks will partially be determined by its ability to establish Hadoop as an enterprise-grade, mission-critical system. 19 The Hadoop platform has expanded to incorporate a range of Apache projects that are required components of a complete enterprise-grade data platform. Hortonworks intends to leverage its leadership position in the Apache Software Foundation to strengthen the quality and capabilities of enterprise-grade Hadoop by continually enhancing its governance, security and operations capabilities, rigorously testing new versions of enterprise-grade Hadoop and strengthening its leadership position as a trusted distributor for enterprise Hadoop deployments. 1 For example, today's enterprises are looking to go beyond batch processing and integrate existing applications with Hadoop to realize the benefits of real-time processing and interactive query capabilities. 12 MapReduce (which was closely tied to the Hadoop 1.0 platform) is great for many applications but not everything. Other programming models better serve requirements such as graph processing (e.g., Google Pregel/Apache Giraph) and iterative modeling using Message Passing Interface (MPI). As is often the case, much of the enterprise data is already available in Hadoop HDFS, and having multiple paths for processing is critical and a clear necessity. Furthermore, since MapReduce is essentially batch-oriented, support for real-time and near-real-time processing has become an important issue for the user base. 20 To address this initial shortcoming of Hadoop 1.0, Hortonworks engineers created the initial architecture and developed the technology for YARN (Yet Another Resource Negotiator) within the Apache Hadoop community, leading to the release of YARN in October YARN was conceived of by Hortonworks founder Arun Murthy, who submitted JIRA MapReduce 279 in January 2008 and has been working on the project ever since. Hortonworks has written 80% of the YARN code. 21 YARN technology advancement transformed Hadoop (i.e., Hadoop 2.x), recasting Apache Hadoop as a much more powerful system by moving the platform beyond MapReduce into additional frameworks. YARN is designed to allow individual applications to utilize both MapReduce and non-mapreduce tasks on cluster resources in a shared, secure, and multi-tenant manner. 20 (See Exhibit 22.) Hortonworks, Inc. (HDP) 17

18 Exhibit 22: YARN Has Fundamentally Changed Hadoop Source: Hortonworks. YARN eliminates the need to silo data sets and enables a single cluster to store a wide range of shared data sets on which mixed workloads can simultaneously process with predictable service levels. YARN is designed to serve as a common data operating system that enables the Hadoop ecosystem to natively integrate applications and leverage existing technologies and skills while extending consistent security, governance, and operations across the platform. With these capabilities, YARN can facilitate mainstream Hadoop adoption by enterprises of all types and sizes for production use cases at scale. 1 YARN is essentially what drives Hadoop as enterprise-ready and mission-critical, and Hortonworks is the company that drove YARN. While direct competitors of Hortonworks (e.g., Cloudera and MapR) support YARN in a rudimentary way given YARN is an opensource project, Hortonworks is the only vendor that supports YARN as its central infrastructure. (See Exhibit 23.) Exhibit 23: Hortonworks Data Platform (HDP) Source: Hortonworks. As more organizations move from single-application Hadoop clusters to a versatile, integrated Hadoop 2.0 data platform hosting multiple applications, YARN is strategically positioned as the true integration point of today's enterprise data layer. At the architectural center of Hadoop, YARN provides access to the core elements of the platform. Specifically created to capitalize on the capabilities of Hadoop 2.0, Hortonworks introduced the YARN Ready Program, which includes tools, guides, sample code, access to the technical resources, and a simple mechanism for certification. Tools and applications that are YARN Ready have been certified to deeply integrate with the Hortonworks Data Platform. 12 Hortonworks, Inc. (HDP) 18

19 Several early vendors of traditional Hadoop have not made the transition from the bolt-on architecture of traditional Hadoop to fully embracing YARN in their respective offerings. 1 YARN is ultimately Hortonworks' unique offering that allows customers to achieve resource scale. While competitors such as Cloudera and MapR do support YARN, the extent to which their products offer the full scale and resource efficiency that Hortonworks offers is unclear. As customers add more data to Hadoop specifically customer, corporate, and proprietary data they will have more mission-critical applications that leverage YARN. Furthermore, as these customers add applications that are certified as YARN Ready and utilize YARN as a resource manager for Hadoop, we believe they are likely to pay for a subscription from the company that developed YARN, which is Hortonworks. Thus, Hortonworks offers the infrastructure, security, and support for adding data and applications to Hadoop, developing the data platform as mission-critical. #3: Enable the Hadoop Ecosystem Hortonworks' strategy of delivering enterprise Hadoop as 100% open-source has resulted in close alignment and tight partnerships with a broad Hadoop ecosystem of partner vendors. 12 In particular, YARN is designed to serve as a common data operating system that enables the Hadoop ecosystem to natively integrate applications and leverage existing technologies and skills while extending consistent security, governance, and operations across the platform. With these capabilities, YARN can facilitate mainstream Hadoop adoption by enterprises of all types and sizes for production use cases at scale. 1 Hortonworks partner programs are designed to expand, support, and accelerate the growth of a vibrant Apache Hadoop ecosystem by providing technical enablement, joint marketing opportunities, design assistance, technical support, and training. Partners accepted into the program are eligible for the Certified Partner Program and the YARN Ready Program. 12 (See Exhibit 24 and Exhibit 25.) Exhibit 24: Certified Partner Program and YARN Ready Program Source: Hortonworks. Exhibit 25: Selected YARN Ready Program Partners Source: Hortonworks. Hortonworks, Inc. (HDP) 19

20 We believe the combination of Hortonworks' 100% open-source solution and its ecosystem of more than 500 partners provides compelling solutions for enterprises across a wide variety of use cases. 1 For example, major infrastructure vendors like Microsoft, Red Hat, and Rackspace are contributing to the future of Hadoop and integrating their tools to enable the enterprise to take advantage of the promise of Hadoop. Leading data management and analytics vendors such as Teradata and SAS are integrating in a way that creates seamless interoperability with enterprise Hadoop. 12 (See Exhibit 26.) Exhibit 26: Examples of Partner Ecosystem/Integrations Source: Hortonworks. These partners contribute to and augment Hadoop with incremental functionality and this combination of core and ecosystem provides compelling solutions for enterprises whatever their use case. There are many different points of integration that the ecosystem uses to tie their products into HDP, enabling the enterprise to reuse its existing skills and infrastructure investment. 12 Large Addressable Market Big Data and analytics initiatives are a key driver of infrastructure spending as the transformation to data-driven enterprises continues. According to AMR Research, the global Hadoop market spanning hardware, software and services is expected to grow from $2.0 billion in 2013 to $50.2 billion by 2020, representing a CAGR of 58%. AMR Research pegged the Hadoop software market at $400 million as of 2013 and forecasts it growing to $11.2 billion by (See Exhibit 27.) Given Hortonworks' strong positioning within the Hadoop ecosystem, we believe the company will be able to capture a sizable portion of the addressable market, analogous to Red Hat in the paid Linux market. Hortonworks, Inc. (HDP) 20

21 Exhibit 27: Worldwide Hadoop Market, E US$ in billions, unless otherwise stated Source: Hortonworks, AMR Research. While still in the early phases of adoption, Hadoop has already generated tremendous value by enabling enterprises to cost effectively address the growth of their data, and to do more with Big Data. 1 Companies across multiple industries use the Hortonworks Data Platform (HDP) to improve many functions of their businesses, including product design, R&D, advertising, marketing, sales, and customer experience. (See Exhibit 28.) Exhibit 28: Diverse Customer Use Cases of Hadoop Source: Hortonworks. In order to gauge the size of the Hadoop market (beyond the aforementioned, high-level approach from AMR Research), we first began with a conservative, bottoms-up approach, making several assumptions or inputs around the (1) total amount of data suitable for Hadoop, (2) amount of data managed by one Hadoop node, (3) standard price per Hadoop node for support, and (4) paid attach rate for Hadoop. (See in Exhibit 29.) Hortonworks, Inc. (HDP) 21

22 Exhibit 29: Framework for Sizing Worldwide Hadoop Market If Storing an Amount of Data Equivalent to the Amount of Analytical Data Currently in Relational Database Management Systems (RDBMS) Terabytes and US$, unless otherwise stated Data Warehousing/Data Mart 2,143,028 3,177,695 3,516,455 5,159,506 7,673,174 Data Analysis/Data Mining 1,274,467 1,473,423 1,822,824 2,686,425 3,818,056 Total Storage Shipments 3,417,495 4,651,118 5,339,279 7,845,931 11,491,230 Sum of Terabytes Shipped, ,745,052 TBs per Hadoop Node 12 Hadoop Nodes 2,728,754 Price of Annual Subscription per Hadoop Node $2,000 Attach Rate of Paid Hadoop 65% Market $3,547,380,589 Source: Company data, IDC, Credit Suisse estimates. We estimate that the current market opportunity for paid Hadoop support on existing relational data is approximately $3.5 billion. In the following bullets, we describe the steps and assumptions that we made to determine this figure. To assess the amount of enterprise data that could be stored in a Hadoop cluster, we summed up the last five years of storage shipments in terabytes from IDC's Storage Workloads model, specifically the data warehousing/data mart and data analysis/data mining workload segments, totaling 32.7 million terabytes of relational data from 2010 to We determined the number of Hadoop nodes this amount of data would translate to by assuming approximately 12 TB of data per Hadoop node, which, when divided into our aforementioned data estimate, equates to approximately 2.7 million nodes. We have assumed subscription pricing for paid Hadoop of $2,000 per node. Based on the list pricing of other open-source technologies, we view $2,000 per node for paid Hadoop as reasonable. (See Exhibit 30.) Exhibit 30: Subscription Pricing for Open-Source Technologies by Vendor US$, unless otherwise stated MySQL Database Standard Enterprise Cluster Carrier Grade -per 1-4 socket server $2,000 $5,000 $10,000 Jboss Application Platform 16 core - Standard 16 core - Premium 64 core - Standard 64 core - Premium $6,200 $9,000 $22,500 $32,000 with Management: $8,000 $12,000 $29,500 $42,000 Red Hat Enterprise Linux Server Standard Premium Self Support (no virtual) -2 sockets, 1 physical or 2 virtual nodes $799 $1,299 $349 with Smart Management: $991 $1,491 $541 RHEL Server for Virtual Datacenters Standard Premium -2 sockets, unlimited virtual $1,999 $3,249 with Smart Management: $2,575 $3,825 SUSE Linux Enterprise Server Basic Standard Priority AMD64/Intel64, Physical, 2 socket $349 $799 $1,499 AMD64/Intel64, Virtual, 2 socket $529 $1,199 $1,939 POWER, per socket $750 $850 $1,000 Oracle Linux Network Basic Limited Basic Premier Limited Premier -per system $119 $499 $1,199 $1,399 $2,299 Source: Company data, Credit Suisse. With an estimated $2,000 per node for subscription pricing, we then assumed a 65% paid attach rate for support. This attach rate is based on the attach rate for paid Linux. Hortonworks, Inc. (HDP) 22

23 We believe that this is a conservative assumption, as there is more potential for mission-critical data to be stored within Hadoop clusters, which would prompt even more technically capable enterprises to pay for support rather than attempting to solely self-service. Red Hat's open-source approach was and continues to be fairly unique in its ability to focus on and built a sound infrastructure upon which all of its subsequent innovations were built. 22 The paid attach rate for Linux in 2013 was 64.3% and is estimated to stay relatively stable over the next five years, and the total addressable market for Linux is approximately $1.7 billion. 23 (See Exhibit 31). Exhibit 31: Worldwide Linux Server Operating Environment Paid and Nonpaid Installed Base, E in thousands, unless otherwise stated 25,000 20,000 15,000 10,000 5, E 2015E 2016E 2017E 2018E Source: IDC, Credit Suisse. Paid Nonpaid Paid Attach Rate 100.0% Comparing Linux to Hadoop, there is the possibility of attaining an equal if not greater paid attach rate, given Hadoop has become synonymous with Big Data, a market that is poised to grow exponentially over the next five years. Since Hadoop is an opensource-based analytics platform that could hold confidential customer and corporate data, the attach rate of subscription is likely to be higher than that of Linux, which is an operating system with much broader use cases. Therefore, the paid attach rate of Hadoop could potentially exceed 70% or more over the next five years. We view our market sizing as particularly conservative owing to its assumption of relative data stagnation and including only relational data shipped over the last five years. IDC forecasts that annual shipments in the two data segments used in our calculation will reach a total of over 30 million TB in 2017; the total shipments over the next three years is forecast to be 67.1 million TB, or more than double the size of the data shipments in 2010 through Beyond this existing structured relation data lies a much larger pool of unstructured data, as well as un-utilized structured and semistructured data. Businesses currently only use approximately 12% of their data volumes, largely owing to the difficulty associated with capturing, storing, and subsequently utilizing this unstructured data. 25 (See Exhibit 32.) In addition, enterprises want to do more with data by extracting value from the increasing quantities and varieties of context-rich data to create more intelligent applications and to increase business productivity. Reinforcing these statistics, according to Forrester Research, most firms estimate that they are analyzing only 12% of the data they have under management. The unanalyzed 88% represents a missed opportunity for additional insight; however, traditional datacenter architectures restrict enterprises' ability to capture and process new types of data and bring this data under management % 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Hortonworks, Inc. (HDP) 23

24 Exhibit 32: Companies Do Not Use Most of Their Data Today Source: SAP, Forrsights Strategy. In addition, unstructured data is projected to grow far more than structure data. As such, there exists the potential for truly massive amounts of data to be captured in Hadoop clusters. Industry estimates point to 85% of all data growth coming from new types of data, such as social data, clickstreams, server logs, sensors, and the Internet of Things (IoT). (See Exhibit 33.) Existing relational databases are not well suited to harness these types of unstructured data, leaving Hadoop as the preferred platform to manage these growing data sets. 1,2,26,27 Exhibit 33: Sources of New Data Growth Source: IDC, IDG Enterprise, AMR Research. As another attempt to size the total addressable market for Hadoop, assuming an opportunity set of 1,500 large enterprises (based on 75% of the Global 2000 deploying a paid version of Hadoop) and 1,000 paid Hadoop nodes in production per company would translate into a market opportunity for 1,500,000 paid Hadoop nodes within the Global With a price per node of $2,000, the Hadoop subscription total addressable market within the Global 2000 could be $3.0 billion. 1 (See Exhibit 34.) Hortonworks, Inc. (HDP) 24

25 Exhibit 34: Framework for Sizing Worldwide Hadoop Market within Global 2000 US$, unless otherwise stated Source: Hortonworks. 1. Assumes only ~3.0% penetration of the global installed base for x86 servers (~50 million units). Framing Risk/Reward While the valuation for Hortonworks is not cheap, we believe analyzing Red Hat's valuation, as well as the valuations of both M&A transactions in the open-source software market and recent venture capital funding of open-source software company, offers a theoretically sizable upside return to downside risk for Hortonworks' stock from current levels. Within the $1.7 billion Linux market, Red Hat is known as the "de facto standard" for enterprise Linux implementations with 72.9% share. Given its position as the dominant leader in this open-source market, Red Hat trades at an enterprise value of $10.5 billion. Even based on our aforementioned somewhat conservative estimations, the potential market for paid Hadoop could be multiple times larger than the paid Linux market. Given Hortonworks' strong positioning within the Hadoop ecosystem, we believe the company will be able to capture a sizable portion of the addressable market, analogous to Red Hat in the paid Linux market. Therefore, the medium- to long-term upside to Hortonworks' valuation could be closer to that of Red Hat. At current levels, Hortonworks trades at an EV/NTM revenue multiple of 18.9x, representing an enterprise value of $1,240 million. We believe the potential for M&A, combined with the company's current valuation, would (at the very least) provide a valuation floor of the stock, which reinforces our thesis that Hortonworks offers long-term investors an attractive risk/reward at current levels. (See Exhibit 35.) For example, even SUSE Linux, which has the second-highest (though far distant) market share in paid Linux to Red Hat, was acquired for $210 million in November 2003, when the paid Linux market was meaningfully smaller than it is today. Furthermore, MySQL, the world's most widely used open-source relational database management system (RDBMS) for the enterprise, was acquired by Sun Microsystems for $1 billion in September Hortonworks, Inc. (HDP) 25

26 Exhibit 35: Notable M&A in Open-Source Software Product/Open-source Target Project Acquirer Value Date Cygnus Solutions Binutils, ecos, Cygwin Red Hat $674 million November 1999 Bluecurve Theme GNOME Red Hat $37 million May 2000 Wirespeed Communications Embedded Linux Red Hat $350 million August 2000 Hell's Kitchen Systems CCVS Red Hat $85 million August 2000 C2Net Stronghold Red Hat $40 million September 2000 SuSE Linux Linux Novell $210 million November 2003 Android Android OS Google $50 million August 2005 JBoss JBoss Red Hat $350 million April 2006 XenSource Xen Citrix Systems $500 million August 2007 Zimbra Zimbra Yahoo! $350 million September 2007 MySQL MySQL Sun Microsystems $1 billion January 2008 Symbian Symbian Nokia $410 million June 2008 Qumranet KVM Red Hat $107 million September 2008 SpringSource Spring Framework VMware $420 million August 2009 Zimbra Zimbra VMware ~$107 million January 2010 Magento Magento ebay ~$180 million June 2011 Cloud.com CloudStack Citrix Systems $200 million July 2011 Gluster Gluster Red Hat $136 million October 2011 ManageIQ ManageIQ/CloudForms Red Hat $104 million December 2012 CentOS CentOS Red Hat N/A January 2014 Inktank Storage Ceph Red Hat $175 million April 2014 Jaspersoft Jaspersoft TIBCO $185 million August 2014 FeedHenry FeedHenry MBaaS Red Hat $82 million September 2014 Source: Credit Suisse. When comparing the company's enterprise value with current valuations of private companies in the open-source software distribution market based on their most-recent funding rounds, Hortonworks' valuation falls in-line with to even somewhat below its peers. Couchbase, an open-source NoSQL database vendor, was valued at $1.0 billion as of June MongoDB, another open-source NoSQL database, was valued at $1.2 billion as of its last funding round in October Cloudera and MapR, Hortonworks' two most direct, pure-play Hadoop competitors, were valued at $4.2 billion and more than $1 billion as of March 2014 and June 2014, respectively. (See Exhibit 36.) Given Hortonworks currently commands a lower valuation (in terms of absolute dollars) than the last funding round of its most direct competitor, coupled with its market position as the driver of the Apache Hadoop ecosystem, we believe the potential of Hortonworks as an acquisition target provides a just valuation floor for the stock. Exhibit 36: Notable Venture Capital Funding in Open-Source Software Vendor Product/ Open-Source Project Valuation Date of Last Funding Round Acquia Drupal >$272 million May 2014 Alfresco Alfresco >$300 million August 2014 Cloudera Hadoop $4.1 billion March 2014 Couchbase CouchDB $1.0 billion June 2014 Docker.io Docker $400 million September 2014 DataStax Cassandra $830 million September 2014 GitHub Git $750 million July 2012 MapR Hadoop >$1.0 million June 2014 MongoDB MongoDB $1.2 billion October 2013 Source: Credit Suisse. Hortonworks, Inc. (HDP) 26

27 Potential for Long-Term Margin Expansion Hortonworks remains on an accelerated growth path and is expected to double its revenue in 2014 versus Approximately 64% of Hortonworks' revenue currently comes from subscription and support, with a long-term target of more than 80% of revenue. We believe that, as enterprises begin adding additional nodes to their Hadoop clusters and analyzing mission-critical data and Hortonworks maintains a continued focus on selling its subscription and support services, the company is well positioned for long-term margin expansion. Hortonworks' long-term goal is to maintain gross margins of 75-80% and expand operating margins to 20%. (See Exhibit 37.) Exhibit 37: Long-Term Operating Model E Long-Term Target Gross Margin 46% 24% 31% 75-80% Operating Expenses Research & Development 110% 79% 77% 12-15% Sales & Marketing 84% 123% 145% 35-40% General and Administrative 45% 49% 44% 7-10% Operating Margin -194% -227% -235% 20% Source: Hortonworks, Credit Suisse estimates. However, the company currently remains in investment mode and continues to spend aggressively on sales and marketing to drive market share gains and increase and train headcount as well as R&D in an effort to broaden enterprise adoption of Hadoop and operational capacity. However, we expect this spending to moderate as a percentage of revenue in the longer term, which in turn should drive improvements in operating margin and cash flow. Risks Limited Track Record in a Young Ecosystem Although Hadoop is seeing more enterprise adoption, Hortonworks is a very young company, which was founded in 2011 and launched its main product, Hortonworks Data Platform (HDP), in June of The first release of HDP that included YARN, a key part of expanding the Hadoop ecosystem, was in October Hortonworks' relationships with its customers are consequently very young. The company has only been offering HDP for eight quarters and, as such, has only been through two support subscription renewal cycles. Hortonworks does not have a great sense for the future dynamic of renewals, which are a key driver of revenue and the viability of the company. 1,12 By comparison, Hortonworks' major competitor, Cloudera, was founded in October 2008, and launched a Cloudera distribution of Hadoop in March MapR, another Hadoop competitor, was founded in ,28,29 A survey by IDC of enterprise IT professionals in 2013 found that 32% of respondents had already deployed Hadoop, 31% planned to do so in the next 12 months, and 37% of respondents expected that it would be more than 12 months before they expect to deploy Hadoop. (See Exhibit 38.) Of the Hadoop-using enterprises in the survey, nearly 25% of respondents said that they were using Cloudera for their Hadoop distribution, with over 20% using MapR. Only 15% use Hortonworks, illustrating the lead that Cloudera and MapR gained by being earlier to market than Hortonworks. 30,31 (See Exhibit 39.) Hortonworks, Inc. (HDP) 27

28 Exhibit 38: IDC 2013 Survey "Has Your Organization Deployed or Considered Deploying Hadoop?" Exhibit 39: IDC 2013 Survey "What Hadoop Distribution Do You Use?" Yes, but it's likely to be longer than 12 months before we deploy it 37% Yes, already deployed 32% Cloudera MapR Yes, planning to deploy in the next 12 months 31% Hortonworks Other 0% 5% 10% 15% 20% 25% Source: IDC. Source: IDC. Furthermore, although many of Hortonworks' contributors to the Apache Hadoop project have been participating since the beginnings of Hadoop, the Hadoop ecosystem itself is still relatively young. Hadoop was created in 2005 by Doug Cutting and adopted by Yahoo in 2006 but was only fully implemented at Yahoo in ,30,31,32 Potential Difficulties in Monetizing Open-Source Software One of the initial drivers of open-source adoption in the 2000s was the free/least expensive aspect of open-source software, as IT organizations are able to download the software for free and use it. The open-source model provides the freedom to use opensource software without any third-party commercial vendor. While a few aggressive technology adopters (e.g., Google, and other Web companies) may be capable of supporting a certain OSS codebase with internal resources, most mainstream adopters cannot. 7 Although end-user IT organizations may have access to the source code, they may not have the requisite skills and/or bandwidth to fully support it by themselves. Alternatively, they may turn to the community as an excellent knowledge base to augment self-support efforts, especially for sufficiently mature open-source projects. However, these communities never come with a contracted service-level guarantee, and a 24x7 telephone support line is not provided. An IT organization might find answers to a wide range of common technical problems from community resources, but there are no guarantees as to how quick or accurate these answers will be, or even if they will come on a case-by-case basis. 7 Furthermore, this is a particularly risky scenario when open-source solutions are deployed within truly mission-critical workloads. As a result, most organizations rely on a commercial third-party vendor (e.g., Red Hat) to get contracted service and support for sufficient technical and legal (e.g., warrants and indemnities) assurance. Furthermore, commercial open-source vendors (e.g., Red Hat and Hortonworks) perform extensive testing, tuning, and troubleshooting of the open-source software across a wide range of hardware, configurations, and applications before they package it into their subscription products. Hadoop vendors depend on the growth, goodwill, and subscription of the enterprise (i.e., their satisfaction with Hadoop's support services). A Hadoop distribution sales cycle is also different; organizations often first download free versions of a Hadoop distribution and then come to the vendors as educated customers, this minimizes buyer's remorse and increases rapport. Vendors compete on their competency and support, as well as on their good relationships with partners and customers. Doug Cutting, founder of Hadoop, has explained, "We don't do open-source [only] because Hortonworks, Inc. (HDP) 28

29 giving back is a good thing. What you get when you contribute is the ability to influence and the knowledge that is important." 18 Although, however important the Hadoop ecosystem itself may be, the success of Hortonworks will be determined by its ability to monetize open-source software, a task only truly accomplished by Red Hat. As enterprises grow more comfortable with Hadoop and begin adding more nodes to their Hadoop clusters, the data and applications under management will likely become missioncritical themselves. Thus, customers are likely to pay for the subscription support, training, and services offered by Hortonworks in order to leverage this data properly. Companies making attempts to implement Hadoop without commercial support unnecessarily spend resources on testing and integrating their Hadoop stack. They run into bugs that could quickly be fixed by people deeply familiar with a particular project. Because of the growing number of Hadoop projects, testing and integration are becoming increasingly daunting stacks. For example, when Hadoop becomes critical for business operations, many companies will require high availability and disaster recovery that lead to additional expenditures on hardware to manage the increased complexity that comes with scale. Additional license fees for third-party software are also required if the software is priced per core or per node. In this situation, there is much more at stake, and access to support and a knowledge base provided by the community is insufficient; enterprises need the commitment and experience of commercial vendors. 18 Though Hadoop is an open-source project and anyone has the ability to download it, implementing a Hadoop cluster without the support of a vendor like Hortonworks is potentially difficult. Some of these managers are specifically in charge of Hadoop deployment, database technology, and data warehousing as organizations continuously search for ways to derive value from Big Data. Hadoop is on its way to becoming today's de facto data management platform because it (1) offers lower-cost storage than traditional systems; (2) has open-source innovation (which we have seen in the case of Red Hat and Linux is becoming increasingly important to enterprises); (3) can scale, as its distributed file system is suitable for storing and processing Big Data; and (4) has opened firms' eyes to the power and profit potential of Big Data. 13 Hortonworks provides its main product, the Hortonworks Data Platform, for free on an open-source basis. The company then charges for support subscription or professional services. A support subscription entitles a customer to direct support for implementation of HDP as well as updates, bug fixes, and patches for Hadoop applications. Professional services consist of consulting and training services designed to help enterprises get started with deploying Hadoop and to expand the number of skilled Hadoop users. If Hortonworks cannot convince enterprises that its support subscription and professional services are worth paying for, then it cannot sustain itself as a business. 1 Hortonworks' paid open-source model is highly analogous to that of Red Hat. Red Hat's Linux distribution runs many key systems for enterprises, which is reflected in the paid attach rate for Linux server environments worldwide. The paid attach rate of Linus has been approximately 64% of deployments from 2009 through 2013, and IDC forecasts it to remain relatively stable in the near future. 23 (See Exhibit 40.) While Hadoop has the potential to reach if not exceed the paid attach rate of Linux, which will depend on whether enterprises come to view the platform as mission-critical. Hortonworks, Inc. (HDP) 29

30 Exhibit 40: Worldwide Linux Server Operating Environment Paid and Nonpaid Installed Base, E in thousands, unless otherwise stated 25,000 20,000 15,000 10,000 5,000 0 Source: IDC E 2015E 2016E 2017E 2018E Paid Nonpaid Paid Attach Rate 100.0% In addition, just because an open-source platform is popular does not ensure that a large paid support ecosystem must arise around it., Apache Webserver has been the most popular HTTP Web server since 1996, serving over 50% of all sites, but does not have a prominent paid support market. 33 (See Exhibit 41.) Exhibit 41: Market Share of All Sites by Web Server 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% Source: Should Hadoop be adopted as a mission-critical technology (e.g., managing all of an enterprise's information using HDFS and accessing it using YARN-compatible apps), we believe enterprises will be more likely to pay for support rather than waiting for public releases of patches or attempting to maintenance the platform themselves. However, if Hadoop remains more of a side project, or just used for particular applications rather than as a critical platform, it is very possible enterprises will only pay for minimal support, if at all. According to IDC's 2013 enterprise Hadoop deployment study, nearly 90% of respondents surveyed indicated that they had deployed or planned to deploy 100 or fewer nodes. 1,31 As Hadoop adoption increases, enterprises that are unfamiliar with the platform will likely need help implementing the technology, driving demand for Hortonworks' support subscription and professional services offerings. Once HDP is implemented in the enterprise, we believe enterprises will likely retain Hortonworks' support subscription. A support subscription would provide the enterprise with immediate access to Hortonworks' Hortonworks, Inc. (HDP) 30

31 first patch in the event of the discovery of a major issue, rather than having to wait for the patch to be integrated into the open-source version of Hadoop, which would be especially important for mission-critical systems running on Hadoop. In addition, while enterprises could choose to task their own developers on writing patches for Hadoop given its opensource nature, we believe that the time and resources would be better spent developing applications that sit on top of the platform. 1 The potential for Hortonworks to achieve the same success as Red Hat in monetizing an open-source project hinges on the company's ability to drive the Hadoop ecosystem as mission-critical. The Hadoop ecosystem is still a young one, and therefore enterprises may be skeptical to move sensitive or proprietary data onto their Hadoop clusters, or to purchase additional Hadoop clusters. Although Hortonworks does supply the necessary patches, bug fixes, and updated releases to run Hadoop successfully, its success will be determined by how much support enterprises are willing to pay for. High Cash Burn Hortonworks is a young company that remains in investment mode, using cash heavily to grow the business. As such, Hortonworks expects to maintain negative operating cash flow, as well as free cash flow, until some point in Although we forecast Hortonworks to maintain a viable cash balance through 2016, the company may need to raise capital at some point in the next few years to continue to invest for the long-term market opportunity. (See Exhibit 42.) Exhibit 42: Cash Balance and Cash Flow Estimates US$ in millions, unless otherwise stated E 2015E 2016E 2017E Cash (Balance Sheet) Operating Cash Flow Free Cash Flow Source: Company data, Credit Suisse estimates. Lack of VSOE Hortonworks generates revenue primarily under multiple-element arrangements (~90% of contracts) that include support subscription offerings combined with consulting and/or training services. However, Hortonworks has not yet established vendor-specific objective evidence (VSOE) of fair value for its support subscriptions. Therefore, for arrangements for which the company provides both support subscription and professional services offerings, Hortonworks recognizes the entire arrangement fee ratably over the subscription period, although the timing of revenue recognition must be evaluated on an arrangementby-arrangement basis. 1 As such, the company recognizes revenue from its support subscription and professional services revenues on a ratable basis over the period beginning when both of the services have commenced, and ending at the conclusion of the services period for either service. Under the company's multiple element arrangements, the support subscription element generally has the longest service period and the professional services element is performed during the earlier part of the support subscription period. 1 Hortonworks separates its customer deals into three phases: (1) customer preparation, (2) implementation, and (3) expand and renew. The customer preparation phase typically lasts days while the customer is procuring the hardware. During this time, Hortonworks is incurring support subscription expenses but cannot recognize revenue from the deal (support subscription or professional services) until the professional services begin, roughly a month or two later. During the implementation phase, the Hortonworks team helps the customer to architect its Hadoop solution and can begin to recognize revenue for the deal once the professional services begin. As Hortonworks ends the implementation phase and enters the expand and renew phase, Hortonworks is selling support for more nodes and different types of support, with less of a professional services Hortonworks, Inc. (HDP) 31

32 Target Current 06 January 2015 mix in the contract. 1 The timing of these three phases should create variability in terms of recognized revenue and GAAP operating profit on an arrangement-by-arrangement basis. (See Exhibit 43.) Exhibit 43: Example of Deal Accounting Timeline Customer Preparation Phase Implementation Phase Expand and Renew Phase Subscription Billings PSO Billings Source: Company data, Credit Suisse. * Graph is indicative only and does not purport to represent actual billings, revenue, or operating profit for any periods for any particular customer contract. Valuation Ain't Cheap Hortonworks currently trades at an enterprise value to next-12-months revenue multiple of 18.9, which represents a 397% premium to the software industry average multiple of 3.8. (See Exhibit 44.) However, we conservatively forecast Hortonworks' revenue to grow 57.9% over the next twelve months, a growth rate that is 49.0% higher than the software sector during this timeframe. Exhibit 44: Valuation US$ in millions, unless otherwise stated 2014E 2015E 2016E EV/R 26.7x 16.2x 10.2x EV/Subscription 45.6x 26.9x 16.1x P/E (Pro Forma) NM NM NM EV/CFO NM NM NM EV/FCF NM NM NM EV/UFCF NM NM NM EV/R 37.5x 22.7x 14.3x EV/Subscription 63.9x 37.7x 22.6x P/E (Pro Forma) NM NM NM EV/CFO NM NM NM EV/FCF NM NM NM EV/UFCF NM NM NM Source: Company data, Credit Suisse. Hortonworks, Inc. (HDP) 32

33 We have compared Hortonworks' valuation with other high-growth software companies. Specifically, we analyzed revenue growth and enterprise value to revenue multiples of comparable companies based on 2015 and 2016 consensus estimates. (See Exhibit 45.) Exhibit 45: Revenue Growth and EV/Revenue Multiples for Comparable Companies Revenue Growth EV/Revenue 2015E 2016E 2015E 2016E Blackbaud Inc (BLKB) 10.3% 4.7% 3.3x 3.1x Callidus Software Inc (CALD) 20.8% 20.1% 4.6x 3.8x Covisint Corp (COVS) 2.6% NA 0.6x NA Salesforce.com Inc (CRM) 22.6% 19.6% 6.0x 5.0x Castlight Health Inc (CSLT) 83.5% 68.9% 10.7x 6.4x Cornerstone OnDemand Inc (CSOD) 30.8% 28.5% 5.4x 4.2x Constant Contact Inc (CTCT) 17.0% 15.8% 2.5x 2.2x Tableau Software Inc (DATA) 41.2% 35.2% 9.6x 7.1x Digital River Inc (DRIV) 6.1% 7.7% 1.4x 1.3x Demandware Inc (DWRE) 38.1% 37.6% 8.3x 6.1x E2open Inc (EOPN) 19.1% NA 2.6x NA FireEye Inc (FEYE) 47.1% 39.1% 6.8x 4.9x Rocket Fuel Inc (FUEL) 51.4% 24.3% 1.0x 0.8x Guidewire Software Inc (GWRE) 10.3% 14.3% 7.6x 6.7x LogMeIn Inc (LOGM) 16.8% 16.1% 3.8x 3.3x Marketo Inc (MKTO) 35.4% 30.5% 6.2x 4.8x Marin Software Inc (MRIN) 19.2% 20.5% 1.9x 1.6x ServiceNow Inc (NOW) 39.9% 31.9% 10.3x 7.8x NetSuite Inc (N) 30.6% 28.2% 11.2x 8.7x Palo Alto Networks Inc (PANW) 40.6% 30.9% 9.7x 7.4x Paycom Software Inc (PAYC) 29.2% 28.2% 7.0x 5.5x Proofpoint Inc (PFPT) 25.9% 23.4% 7.2x 5.8x Rackspace Hosting Inc (RAX) 16.0% 15.3% 3.1x 2.7x RingCentral Inc (RNG) 26.8% 26.1% 3.2x 2.5x incontact Inc (SAAS) 20.1% 22.5% 2.5x 2.1x Synchronoss Technologies Inc (SNCR) 19.6% 16.1% 3.1x 2.7x Splunk Inc (SPLK) 38.1% 32.4% 10.9x 8.3x SPS Commerce Inc (SPSC) 21.6% 20.5% 4.9x 4.1x Tangoe Inc (TNGO) 13.7% 13.3% 1.8x 1.6x DealerTrack Technologies Inc (TRAK) 20.4% 15.1% 2.9x 2.5x Ultimate Software Group Inc (ULTI) 21.9% 22.1% 6.6x 5.4x Veeva Systems Inc (VEEV) 32.1% 19.1% 7.7x 6.5x Workday Inc (WDAY) 48.9% 38.1% 6.1x 4.4x Zendesk Inc (ZEN) 44.8% 37.5% 9.3x 6.8x Average 28.2% 25.1% 5.7x 4.6x Weighted Average 25.9% 33.7% 5.9x 4.6x Median 24.8% 22.9% 6.0x 4.6x Hortonworks Inc (HDP) 63.9% 52.2% 14.8x 9.7x Source: Company data, Thomson, Credit Suisse. Hortonworks' revenue grew 176.2% in For 2014 and 2015, we conservatively forecast Hortonworks' revenue to grow 92.5% and 65.3%, respectively. (See Exhibit 46.) Hortonworks, Inc. (HDP) 33

34 Exhibit 46: Historical and Forecast Revenue Growth 200.0% 180.0% 176.2% 160.0% 140.0% 120.0% 100.0% 92.5% 80.0% 60.0% 65.3% 58.5% 46.2% 40.0% 20.0% 0.0% E 2015E 2016E 2017E Source: Company data, Credit Suisse. Given Hortonworks maintains revenue growth rates well above the software industry's average and based on the large total addressable market of Big Data, we believe that Hortonworks deserves a multiple that is at a meaningful premium to the company's peer group average. As a result, we set our target price at $35, which implies an enterprise value to revenue multiple of 16.2 based on our 2015 revenue estimate of $76.6 million. Hortonworks, Inc. (HDP) 34

35 Market Overview The Evolution of Data Management Technologies Big Data consists of structured, semi-structured, unstructured, and raw data in many different formats. These multi-structured data types are fundamentally different than the scalar, structured numbers and text that organizations have been storing in relational data warehouses for the past three decades. 43 Big Data consists of structured, semi-structured, unstructured, and raw data in many different formats. (See Exhibit 47.) Exhibit 47: Classification of Data Definition Description Structured Semistructured Semistructured Quasistructured Unstructured Unstructured Source: Gartner, Credit Suisse. Relational database (i.e., full ACID support, referential integrity, strong type and schema support) Structured data files that include metadata and are self-describing (e.g., netcdf and HDF5) XML data files that are self-describing and defined by an XML schema Data that contains some inconsistencies in data values and formats (e.g., Web clickstream data) Text documents amenable to text analytics Images and video The increased diversity of data sets that include traditional structured data as well as semistructured or unstructured data types has sparked the emergence of new approaches to data management that allow this information to be effectively understood and analyzed. (See Exhibit 48.) Exhibit 48: Data Management Technologies Source: Credit Suisse, Splunk. Relational Database In the 1980s, relational databases became popular, and highly-structured data was stored in database tables with rows and columns, similar to Excel spreadsheets. 34 The relational model separates data into many interrelated tables that contain rows and columns, and tables reference each other through foreign keys that are also stored in columns. 35 (See Exhibit 49.) Hortonworks, Inc. (HDP) 35

36 Exhibit 49: Relational Data Model Source: SQL (structured query language) was one of the first commercial declarative languages designed for managing structured data in relational database management systems (RDBMS) and is essentially the standard query language for requesting information from a relational database. Although non-transactional data is sometimes stored in a data warehouse, approximately 95-99% of the data is usually transactional data. 36 While relational databases were much more intuitive for developers, complex logic, as well as highly structured and rather inflexible data structures, were needed to join multiple database tables to obtain the information that was needed. 34 Single RDMS: Symmetric multi-processing (SMP) data warehousing (DW) is the most common architecture for data warehouses under 50 TB. These systems are characterized by a single instance of a RDBMS sharing all resources (including CPU, memory, and disk) with respect to executing workloads. However, As SMP-based DW implementations grow in terms of data size and query efficiency, designing and maintaining an efficient data warehouse infrastructure becomes more and more challenging. 37 When applied to large-scale analytics, with its very broad and complex database queries, the SMP approach is particularly vulnerable to a variety of bottlenecks and inefficiencies, which prevent the large-scale parallel execution of queries, including long response times for data-intensive queries (e.g., full table scans). As such, SMP-based decision support databases suffer from substantial efficiency losses in clustering SMP hardware to build out larger system sizes. 38 MPP RDMS: MPP is a fundamentally different relational database architecture uniquely designed at every level for large-scale decision support, rather than transaction processing. 38 Massively parallel processing (MPP) architectures, consisting of independent processors or servers executing in parallel, provide high query performance and platform scalability. As opposed to shared memory and shared disk approaches, most MPP architectures implement a "shared-nothing architecture" in which each server operates self-sufficiently and controls its own memory and disk. Data warehouse appliances distribute data onto dedicated disk storage units connected to each server in the appliance. This distribution allows data warehouse appliances to resolve a relational SQL query by scanning data on each server in parallel. The divide-and-conquer approach delivers high performance and scales linearly as new servers are added into the architecture. 39 RDMS Sharding: Sharding is a type of database partitioning that separates very large databases into many, much smaller databases that share nothing and can be spread across multiple servers. The governing concept behind sharding is based on the idea that as the size of a database and the number of transactions per unit of time made on the database increase linearly, the response time for querying the database increases exponentially. The costs of creating and maintaining a very large database in one place can increase exponentially because the database will require high-end computers. In contrast, data shards can be distributed across a number of much less Hortonworks, Inc. (HDP) 36

37 expensive commodity servers. However, data sharding can be a more complex process in some scenarios. For example, sharding a database that holds less structured data, for example, can be very complicated, and the resulting shards may be difficult to maintain. 40 Despite the omnipresence of relational data management technologies, existing RDBMS systems are simply not engineered to handle the high-volume, variable, and dynamic nature of Big Data. 41 (See Exhibit 50.) For example, vertical scaling (i.e., scaling up) of a RDBMS, which involves running on the most-powerful single server available, is both very expensive and limiting. There is no single server available today or in the foreseeable future that has the necessary power to process so much data in a timely manner. 42 Clustering beyond a handful of RDBMS servers is notoriously hard, 42 and even the most modern massively parallel processing (MPP) RDBMS system would still struggle with petabytes of data commonly associated with genetics, physics, aerospace, counter intelligence, and other scientific, medical, and government applications. 43 Furthermore, there are a large number of programming problems that cannot be easily solved in an SQL dialect, as SQL is designed for retrieving data and performing relatively simple transformations and not for complex programming tasks. 44 In addition, an RDBMS enforces a strict, pre-defined structure when loading data (i.e., tables and columns), which limits the data types that can be handled and reduces flexibility. 45 To anticipate what the data might yield and pre-define a matching database structure is a daunting if not impossible task. 5 Exhibit 50: New Data Types Are Placing Crippling Pressure on Traditional RDBMS Architectures Source: Company data, Credit Suisse. SQL and MapReduce in an MPP Data Warehouse If an enterprise needs indexes, relationships, transactional guarantees, and lower latency, a database is needed. If a database is needed, a massively parallel processing (MPP) data warehouse that supports MapReduce will allow for more-expressive analysis than one that only supports SQL. 46 Hortonworks, Inc. (HDP) 37

38 In recent years, a significant amount of research and commercial activity has focused on integrating MapReduce and relational database technology. There are two approaches to this problem: (1) starting with a parallel database system and adding MapReduce features (e.g., Aster Data, Greenplum, Oracle 11g's Table Functions) or (2) starting with MapReduce and adding database system technology (e.g., HadoopDB and Hadapt). 47 The extended RDBMS systems cannot, however, be the only solution for Big Data analytics. At some point, tacking on non-relational data structures and non-relational processing algorithms to the basic, coherent RDBMS architecture will become unwieldy and inefficient. 6 For example, to anticipate what machine-generated IT data might yield and pre-define a matching database structure is a daunting if not impossible task. 5 NoSQL Originally designed for Web-scale databases, Not Only SQL (NoSQL) databases provide mechanisms for storage and retrieval of data that uses looser consistency models rather than traditional relational databases. The NoSQL taxonomy supports key-value stores, document store, column, and graph databases. 48 (See Exhibit 51 and Exhibit 52.) Exhibit 51: NoSQL Databases Source: Exhibit 52: NoSQL Classification Based on Features Performance Scalability Flexibility Complexity Functionality Key-Value High High High None Variable (None) Column High High Moderate Low Minimal Document High Variable (High) High Low Variable (Low) Graph Variable Variable High High Graph Theory Source: Christof Strauch, "NoSQL Databases" ( Key-value: Key-value data stores allow developers to store arbitrary types of data using attributes that are defined as needed. Unlike relational databases, key-value data stores do not require a predefined schema with all attributes defined. This model is suited for applications storing simply structured data with the potential for a large number of frequently changing attributes. Even on low-end hardware, key-value data stores can achieve read and write rates of 100,000 I/Os per second. 49 Column: Column store databases work well for Big Data applications that can benefit from MapReduce analysis. In spite of the increased performance that comes with additional hardware, column data stores are not known for supporting complex queries and rapid query response times. 49 A column family resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. 50 Graph: Graph databases use nodes and links between nodes as the basic building blocks. Networks are easily modeled with graph databases, making them suitable for social network analysis, workflow modeling, and other systems of linked or interacting entities. Graph databases allow one to easily create queries about paths and Hortonworks, Inc. (HDP) 38

39 relationships while providing read performance comparable to other NoSQL databases. However, because write performance may not achieve the same levels as other NoSQL databases, graph databases might not be the best option for writeintensive applications. 49 Document: Document databases store data in denormalized data structures called documents. Like key-value pair databases, document databases do not require a fixed schema. Schema flexibility combined with wide programming language support (e.g., Java, Python, C#, JavaScript, Ruby), makes document databases a good option for developers that need flexibility. One of the disadvantages that comes with schema flexibility is a lack of complex query languages. As a result, writing application code might be needed for operations that would have performed in SQL when using a relational database. 49 Regardless of the flavor of NoSQL, motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. NoSQL databases are often intended for retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. Many NoSQL systems are also referred to as "Not only SQL" to emphasize that they do in fact allow SQL-like query languages to be used. Despite the increasing adoption of non-relational databases, even these NoSQL approaches fail to meet all of today's increasing data management requirements. 1 What is Hadoop? Although often used interchangeably, MapReduce and Hadoop are not synonymous with each other. Hadoop, an open-source project under the Apache Software Foundation, is a generic processing framework designed to execute queries and other operations against massive datasets that can be tens or hundreds of terabytes and even petabytes in size. 42 As a key component of the platform, Hadoop leverages an open-source implementation of the MapReduce paradigm, which is the most popular, well-known implementation of MapReduce and is different from other implementations of MapReduce by Google, Aster Data, Greenplum, Splunk, and so on. 51 Hadoop, created by Doug Cutting and named after his son's toy elephant, is a software framework for running applications that process vast amounts of data in-parallel on large clusters of commodity hardware (potentially thousands of nodes) in a reliable, fault-tolerant manner. Hadoop allows organizations to achieve storage and high-quality query abilities on a large dataset in an efficient and relatively inexpensive manner over a distributed file system, known as Hadoop Distributed File System (HDFS), which can easily scale out. 52 In as much as a computer exists to process data, Hadoop in effect transforms lots of cheap little computers into one big computer that is especially good for analyzing indexed text. 53 As a result, the popularity of Hadoop has grown over the past several years, especially with organizations that require analysis of multi-structured data, including highly unstructured/text-based data as well as machine-generated logs. 52 Hadoop has been particularly useful in environments in which massive server farms are being used to collect the data. Hadoop is able to process parallel queries as big, background batch jobs on the same server farm. This saves the user from having to acquire additional hardware for a traditional database system to process the data. Most importantly, Hadoop also saves the user from having to load the data into another system. The huge amount of data that needs to be loaded can make this impractical. 42 Many of the ideas behind the open-source Hadoop project originated from the Internet search community, most notably Google and Yahoo!. Search engines employ massive farms of inexpensive servers that crawl the Internet retrieving Webpages into local files. They then process this data using massive parallel queries to build indexes to enable search. 42 Hortonworks, Inc. (HDP) 39

40 The actual algorithms used by search engines to determine the relevance to search terms and quality of Webpages are very specialized, sophisticated, and highly proprietary. These algorithms are the secret sauce of search. The application of the algorithms to each Webpage and the aggregation of the results into indexes to enable search is done through MapReduce processes and is more straightforward (although done on a massive scale). The map function identifies each use of a potential search parameter in a Webpage. The reduce function aggregates this data (e.g., determining the number of times a search parameter is used in a page). 42 Some large Websites use Hadoop to analyze usage patterns from log files or click-stream data that is generated by hundreds or thousands of their Web servers. The scientific community can use Hadoop on huge server farms that monitor natural phenomena and/or the results of experiments. The intelligence community needs to analyze vast amounts of data gathered by server farms monitoring phone, , instant messaging, travel, shipping, etc. to identify potential terrorist threats. 42 Hadoop 1.0 The Apache Software Foundation announced the initial release of Hadoop on December 10, 2011, which included the HDFS file system and MapReduce processing engine. With 1.0, the main development components of Hadoop were contained in a single coherent release. 54 (See Exhibit 53.) Exhibit 53: Hadoop 1.0: Batch Source: Hortonworks. The term "Hadoop" commonly refers to the main components of the base platform, the ones from which others offer higher-level services. In Hadoop 1.0, these components included the storage framework with the processing framework, formed by (1) the Hadoop Distributed File System library and (2) the MapReduce library, which both work together with a core library (known as Hadoop Common, a set of utilities that contain the necessary JAR files and scripts needed to start Hadoop and also provide source code, documentation, and a contribution section that includes projects from the Hadoop Community) to enable the higher-level services of Hadoop. These represent the first Hadoop projects, which established a foundation for the others to work. 55 (See Exhibit 54.) Hortonworks, Inc. (HDP) 40

41 Exhibit 54: Hadoop 1.0 Base Platform and Ecosystem Source: Hadoop MapReduce: Hadoop MapReduce implements the MapReduce functionality over HDFS. A MapReduce job usually splits the input dataset into independent chunks, which are processed by the map tasks in a completely parallel manner. (See Exhibit 55.) The framework sorts the outputs of the maps, which are then inputted into the reduce tasks. Both the input and the output of the job are typically stored in a file system. The framework takes care of scheduling tasks, monitoring them, and reexecuting failed tasks. 56 Hortonworks, Inc. (HDP) 41

42 Exhibit 55: Hadoop MapReduce Lifecycle Source: Hadoop+MapReduce+Architecture.png. Hadoop Distributed File System (HDFS): The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large datasets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was designed with these goals in mind: 57 o o o Hardware Failure: Hardware failure is the norm rather than the exception. The system recovers quickly from nodes that do not return results in a timely fashion. 57 Large Datasets: Applications that run on HDFS have large datasets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files, providing high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. HDFS should support tens of millions of files in a single instance. 57 "Moving Computation is Cheaper than Moving Data": A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the dataset is huge. HDFS provides interfaces for applications to move themselves closer to where the data is located. 57 HDFS is designed to reliably store very large files across machines in a large cluster. HDFS stores each file as a sequence of blocks, and all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation Hortonworks, Inc. (HDP) 42

43 time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. 57 HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients, thus, a NameNode is a single point of failure in a given cluster. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is internally split into one or more blocks and these blocks are stored in a set of DataNodes. (See Exhibit 56.) A Hadoop client is a user's terminal into the Hadoop cluster that initiates processing, but no actual code is run. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. A single master JobTracker serves as the query coordinator, handing out the tasks to one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them, and re-executing failed tasks. The slaves execute the tasks as directed by the master. 58 The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. 57 (See Exhibit 57.) Exhibit 56: HDFS Cluster Architecture Exhibit 57: Querying Data from HDFS Cluster Source: Oracle, Credit Suisse. Source: Oracle, Credit Suisse. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment, that is rarely the case. 57 The NameNode makes all decisions regarding replication of blocks. The NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. 57 In addition to the Hadoop core, an ecosystem of multiple Hadoop subprojects exist that rely on HDFS for input and output data and in MapReduce for processing in a specific way for different needs and focuses. Some of these, which are subject to the availability of the platform, include: HBase (columnar database), Hive (data warehouse/data mining), Pig (scripting), and Chuckwa (log analysis). Conversely, ZooKeeper (coordination service) is independent of Hadoop availability and is used from HBase, and Avro Hortonworks, Inc. (HDP) 43

44 (serialization/deserialization) is designed to support the main service component requirements. 55 (See Exhibit 58.) Exhibit 58: Hadoop 1.0 Ecosystem and Processes Source: What Were the Limitations of Hadoop 1.0? Even though Hadoop is a free project and companies want it because they have heard they need it, they cannot always think of applications for it. Another issue with Hadoop is that large, distributed systems are complex and many traditional companies want infrastructure that meets their requirements around such things as security and reliability but that can be managed without an entire team of people. This issue is exactly what companies like Hortonworks spend their time working on. 30 Hadoop 1.0 had a single JobTracker, which had to deal with thousands of TaskTrackers and MapReduce tasks. This architecture limited scalability options and enabled a cluster to run a single application at a time. In addition, there were several other issues with Apache Hadoop 1.0. Firstly, there was only one NameNode, which managed the whole cluster. It was dealing with all metadata operations and stored metadata in RAM. With scalability limited to approximately 4,000 nodes and 40,000 tasks, this node was a single point of failure. It was also impossible to update the Hadoop component on some of the nodes, and the MapReduce paradigm could be applied to only limited types of tasks. Other than Hadoop 1.0, there were no other models (other than MapReduce) of data processing, and resources of a cluster were not utilized in the most effective way. 59 HDFS specifically also had limitations to its architecture. HDFS originally had two main layers, Namespace and Block Storage, which allowed only a single namespace for the entire cluster, managed by a single Namenode. However, this resulted in a tight coupling of the two layers, which made alternate implementations of Namenodes challenging and limited other services from using the block storage directly. 60 Furthermore, while HDFS cluster storage scales horizontally with the addition of datanodes, the namespace does not. The namespace could only be vertically scaled on a single Namenode, which stores the entire file data system in memory. This limits the number of blocks, files, and directories supported on the file system to what can be accommodated in the memory of a single Namenode. A typical large deployment at Yahoo! includes an HDFS cluster with 2,700-4,200 datanodes with 180 million files and blocks, and addresses ~25 PB of storage. At Facebook, HDFS has around 2,600 nodes, 300 million files and blocks, addressing up to 60 PB of storage. While these are very large systems and good enough for the majority of Hadoop users, a few deployments might want to grow even larger and as a result find the namespace scalability to be limiting. 60 Hortonworks, Inc. (HDP) 44

45 At Yahoo! and many other deployments, the cluster is used in a multi-tenant environment in which many organizations share the cluster. A single Namenode offers no isolation in this setup. A separate namespace for a tenant is not possible. An experimental application that overloads the Namenode can slow down the other production applications. A single Namenode also does not allow segregating different categories of applications (such as HBase) to separate networks. 60 While most distributions were developed to address the limitations, they did not introduce any significant architectural changes compared with the open-source version. This is what made Hadoop 2.0 a real breakthrough when it emerged in In particular, Hadoop 2.0 features YARN (Yet Another Resource Negotiator), a layer between HDFS and data processing applications that turned Hadoop from a batch processing solution into a real multi-application platform. Hadoop 2.0 eliminated a number of issues; in particular, it eliminated vulnerability of a system with a single NameNode and increased the possible number of nodes in a cluster. YARN also extended the number of tasks that could be successfully solved with Hadoop. 59 Hadoop 2.0 Hadoop 2.0's major innovation was the introduction of YARN (Yet Another Resource Manager), which allows for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, and interactive SQL with Apache Hive and Apache Tez). 61 MapReduce is great for many applications, but not everything; other programming models better serve requirements such as graph processing (e.g., Google Pregel/Apache Giraph) and iterative modeling using Message Passing Interface (MPI). As is often the case, much of the enterprise data is already available in Hadoop HDFS and having multiple paths for processing is critical and a clear necessity. Furthermore, since MapReduce is essentially batch-oriented, support for real-time and near real-time processing has become an important issue for the user base. A more robust computing environment within Hadoop enables organizations to see an increased return on the Hadoop investments by lowering operational costs for administrators, reducing the need to move data between Hadoop HDFS and other storage systems, and providing other such efficiencies. 20 The need for an updated version of Hadoop arose from the want of users and customers to store all data in HDFS and interact with that data in multiple ways, including (1) realtime processing of events (sensor, telecommunications, fraud, etc.) even before it lands on HDFS, (2) interactive query capabilities for interrogating new data for data analysts (SQL) and data scientists (SQL plus scripting), and (3) the need to productionize the insight via batch-processing and reporting in a well-defined and timely manner. 61 YARN significantly changes the game, recasting Apache Hadoop as a much more powerful system by moving it beyond MapReduce into additional frameworks. YARN is designed to allow individual applications (via the ApplicationMaster) to utilize cluster resources in a shared, secure, and multi-tenant manner. YARN provides a framework for managing both MapReduce and non-mapreduce tasks of greater size and complexity. YARN provides the framework to apply low-cost commodity hardware to virtually any Big Data problem. 20 In fact, YARN is the architecture center of Apache Hadoop 2.0 enabling more efficient cluster utilization. More and more customers are asking about tools and applications that are integrated into YARN to maximize the value of their Hadoop cluster. 21 (See Exhibit 59.) Hortonworks, Inc. (HDP) 45

46 Exhibit 59: Key Driver Of Hadoop Adoption Enterprise Data Lake Source: Hortonworks. The enhancements and major features of Apache Hadoop 2.0 are YARN, high availability for HDFS, HDFS Federation, HDFS Snapshots, NFSv3 access to data in HDFS, Binary Compatibility for MapReduce applications between Hadoop 1.0 and Hadoop 2.0 to ease migration, support for running Hadoop on Microsoft Windows, and integration testing for the entire Apache Hadoop ecosystem at the ASF. 61 (See Exhibit 60.) Exhibit 60: Hadoop 2.0: Multiple Workload Source: Hortonworks. YARN To improve on this early functionality, Hortonworks engineers created the initial architecture and developed the technology for YARN within the Apache Hadoop community, leading to the release of YARN in October This technology advancement transformed Hadoop (i.e., Hadoop 2.x) into a platform that allows for multiple ways of interacting with data, including interactive SQL processing, real-time processing and online data processing, along with its traditional batch data processing. Several early vendors of traditional Hadoop have not made the transition from the bolt-on architecture of traditional Hadoop to fully embracing YARN in their respective offerings. 1 (See Exhibit 61.) Hortonworks, Inc. (HDP) 46

47 Exhibit 61: YARN Development Framework Source: Hortonworks. YARN is a significant innovation, eliminating the need to silo data sets and reducing total cost of ownership by enabling a single cluster to store a wide range of shared data sets on which mixed workloads spanning batch, interactive and real-time use cases can simultaneously process with predictable service levels. YARN is designed to serve as a common data operating system that enables the Hadoop ecosystem to natively integrate applications and leverage existing technologies and skills while extending consistent security, governance, and operations across the platform. With these capabilities, YARN can facilitate mainstream Hadoop adoption by enterprises of all types and sizes for production use cases at scale. 1 The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker (i.e., resource management and job scheduling and monitoring) into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new and generic, system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework-specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks. The ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues, etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container, which incorporates resource elements such as memory, CPU, disk, network, etc. The NodeManager is the per-machine slave, which is responsible for launching the applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status, and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. 62 (See Exhibit 62.) Hortonworks, Inc. (HDP) 47

48 Exhibit 62: Architecture Enabled by YARN Source: Hortonworks. As a cluster operating system, YARN provides two distinct capabilities to applications: resource management and workload management. Resource management refers to resource allocation and attendant resource isolation across a cluster of many thousands of nodes and tens of thousands of applications including aspects such as tracking node failures and the availability of resources at individual machines whereas workload management refers to the mechanics of deciding whom to allocate resources to (applications, users, queues), SLAs for allocation (e.g., via preemption), and so on. 63 While YARN is a major step to make Hadoop enterprise ready, YARN ensures compatibility for the existing MapReduce applications and users. One of the crucial implementation details for MapReduce within the new YARN system is the reuse of the existing MapReduce framework without any major changes. 20 MapReduce is great for many applications, but not everything. Other programming models better serve requirements such as graph processing (e.g., Google Pregel/Apache Giraph) and iterative modeling using Message Passing Interface (MPI). As is often the case, much of the enterprise data is already available in Hadoop HDFS and having multiple paths for processing is critical and a clear necessity. Furthermore, since MapReduce is essentially batch-oriented, support for real-time and near real-time processing has become an important issue for the user base. A more robust computing environment within Hadoop enables organizations to see an increased return on the Hadoop investments by lowering operational costs for administrators, reducing the need to move data between Hadoop HDFS and other storage systems, and providing other such efficiencies. 20 Hadoop 1.0 was a combination of the MapReduce environment with the HDFS. Prior to YARN, MapReduce was the primary means of getting at HDFS data and developing Hadoop applications. YARN changes this by allowing users to leverage Hadoop's storage and cluster management capabilities without going through MapReduce, if users want it. YARN provides an accessible resource management layer over the HDFS. 64 With YARN, Hadoop is no longer just a MapReduce-based batch environment. The users can run many applications on it concurrently. The goal is to be able to cater to streaming applications (e.g., data being analyzed and acted upon as it is streamed into Hadoop for storage and later use), interactive applications (e.g., OLTP usage), and the Big Data applications (e.g., extensive queries against high volumes of data and associated analytics workloads). Such a feature extends Hadoop capabilities to a new level as all these distinct categories of applications can easily have an interest in the same data stored in HDFS. 64 (See Exhibit 63.) Hortonworks, Inc. (HDP) 48

49 Exhibit 63: Apache YARN Architecture Source: Hortonworks. ResourceManager: Arbitrates resources among all the applications in the system. 65 It is strictly limited to arbitrating available resources in the system among the competing applications. It optimizes for cluster utilization (keeps all resources in use all the time) against various constraints such as capacity guarantees, fairness, and SLAs. To allow for different policy constraints, the ResourceManager has a pluggable scheduler that enables different algorithms such as capacity and fair scheduling to be used as necessary. 20 NodeManager: The per-machine slave, which is responsible for launching the applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager. 65 ApplicationMaster: Negotiates appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. 65 The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status, and monitoring progress. 20 The ApplicationMaster design enables YARN to offer the following important new features: o Scale: The ApplicationMaster provides much of the functionality of the traditional Resource Manager so that the entire system can scale more dramatically. Simulations have shown jobs scaling to 10,000 node clusters composed of modern hardware without significant issue. As a pure scheduler, the ResourceManager does not, for example, have to provide fault-tolerance for resources across the cluster. By shifting fault tolerance to the Application Master instance, control becomes local and not global. Furthermore, since there is an instance of an ApplicationMaster per application, the ApplicationMaster itself isn't a common bottleneck in the cluster. 20 Hortonworks, Inc. (HDP) 49

50 o Open: Moving all application framework specific code into the ApplicationMaster generalizes the system so that the platform can now support multiple frameworks such as MapReduce, MPI, and Graph Processing. 20 Container: Unit of allocation incorporating resource elements such as memory, CPU, disk, network, among others to execute a specific task of the application (similar to map/reduce slots in MRv1). 65 While a Container, as described above, is merely a right to use a specified amount of resources on a specific machine (NodeManager) in the cluster, the ApplicationMaster has to provide considerably more information to the NodeManager to actually launch the Container. YARN allows applications to launch any process and, unlike existing Hadoop MapReduce, it isn't limited to Java applications. 20 The YARN Container launch specification API is platform agnostic and contains (1) command line to launch the process within the container; (2) environment variables; (3) local resources necessary on the machine prior to launch, such as jars, shared-objects, auxiliary data files, etc.; and (4) security-related tokens. 20 This design allows the ApplicationMaster to work with the NodeManager to launch containers ranging from simple shell scripts to C/Java/Python processes on Unix/Windows to fully-fledged virtual machines. 20 Apache Hadoop 2.x is powered by YARN as its architectural center. 63 (See Exhibit 64.) The release of Apache Hadoop YARN provides many new capabilities to the existing Hadoop Big Data ecosystem. While the scalable MapReduce paradigm has enabled previously intractable problems to be efficiently managed on large clustered systems, YARN provides a framework for managing both MapReduce and non-mapreduce tasks of greater size and complexity. YARN provides the framework to apply low-cost commodity hardware to virtually any Big Data problem. 20 Exhibit 64: Hadoop 2.0 Source: Hortonworks. YARN essentially provides the resource management and pluggable architecture to enable a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels. Engines such as Apache Tez and Apache Slider provide powerful frameworks to rapidly integrate third-party processing and services. YARN APIs can be used natively for complete control where needed. 66 Apache Tez: Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks. By Hortonworks, Inc. (HDP) 50

51 eliminating unnecessary tasks, synchronization barriers, and read-from and write-to HDFS, Tez speeds up data processing across both small-scale and low-latency and large-scale and high-throughput workloads. 66 Applications such as Apache Pig and Apache Hive run on Apache Tez. 63 Apache Slider: Slider is an engine that runs other applications in a YARN environment. With Slider, distributed applications that are not YARN-aware can now participate in the YARN ecosystem, usually with no code modification. Slider allows applications to use Hadoop's data and processing resources, as well as the security, governance, and operations capabilities of enterprise Hadoop. 66 Applications such as Apache HBase, Apache Accumulo, and Apache Storm run on Apache Slider. 63 HDFS2 The Hadoop Distributed File System is the reliable and scalable data core of the Hortonworks Data Platform. In HDP 2.0, YARN and HDFS combine to form the distributed operating system for an enterprise's data platform, providing resource management and scalable data storage to the next generation of analytical applications. 67 The HDFS enhancements and major features of Apache Hadoop 2.0 are high availability for HDFS, HDFS Federation, HDFS Snapshots, and NFSv3 access to data in HDFS. 61 In Hadoop 1.0, a single Namenode managed the entire namespace for a Hadoop cluster. (See Exhibit 65.) With HDFS Federation, multiple Namenode servers manage namespaces and this allows for horizontal scaling, performance improvements, and multiple namespaces. (See Exhibit 66.) The implementation of HDFS federation allows existing Namenode configurations to run without changes. For Hadoop administrators, moving to HDFS federation requires formatting Namenodes, updating to use the latest Hadoop cluster software, and adding additional Namenodes to the cluster. 68 Exhibit 65: HDFS in Hadoop 1.0 Exhibit 66: HDFS Federation in Hadoop 2.0 Source: Hortonworks. Source: Hortonworks. The major breakthrough in HDFS Federation is that it allows horizontal scalability so that users can keep adding cheap servers to the cluster. With linear scaling, users can solve many of the limitations experienced with the original HDFS. HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer and supporting for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases. 60 Hortonworks, Inc. (HDP) 51

52 In order to scale the name service horizontally, HDFS Federation uses multiple independent namenodes and namespaces. The namenodes are federated, which means that they are independent and do not require coordination with each other. The datanodes are used as common storage for blocks by all the namenodes. Each datanode registers with all the namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the namenodes. 60 HDFS Federation consists of a block pool, or set of blocks that belong to single namespace. Datanodes store blocks for all the block pools in the cluster. It is managed independently of other block pools. This allows a namespace to generate Block IDs for new blocks without the need for coordination with the other namespaces. The failure of a namenode does not prevent the datanode from serving other namenodes in the cluster. A namespace and its block pool together are called Namespace Volume. HDFS Federation is a self-contained unit of management. When a namenode or namespace is deleted, the corresponding block pool at the datanodes is deleted. Each Namespace Volume is upgraded as a unit, during cluster upgrade. 60 HDFS Federation offers scalability and isolation, since support for multiple namenodes horizontally scales the file system namespace. HDFS Federation separates volumes for users and categories and improves isolation. Block pool storage also opens up the architecture for future innovation. New file systems can be built on top of block storage and new applications can be directly built on the block storage layer without the need to use a file system interface. New block pool categories are also possible, and examples would include a block pool for MapReduce tmp storage with different garbage collection schemes or a block pool that caches data to make distributed cache more efficient. 60 Hortonworks chose to go with federation because it is significantly simpler to design and implement. Namenodes and namespaces are independent of each other and require very little change to existing namenodes. Federation also preserves backward compatibility of configuration. Most of the changes are in the datanode, to introduce block pool as a new hierarchy in storage, replica map, and other internal data structures. 60 The Ecosystem of Hadoop Projects In addition to the Hadoop core, an ecosystem of multiple Hadoop subprojects exist that make up the services required by an enterprise to deploy, integrate, and work with Hadoop. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. 12 The key services serve a variety of needs, including: (1) data access, (2) governance and integration, (3) security, and (4) operations management. Data Access Apache Hive is the most widely adopted data access technology, although there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase offers columnar NoSQL storage, and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources thanks to YARN and intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running applications. YARN also provides flexibility for new and emerging data access methods, such as Apache Solr for search and programming frameworks such as Cascading. 12 Apache Pig: In MapReduce frameworks such as Hadoop, the user must break any distinct retrieval task into the map and reduce functions. For this reason, both Google and Yahoo! have built layers on top of a MapReduce infrastructure to simplify this for end users. Google's data processing system is called Sawzall, and Yahoo!'s is called Pig. A program written in either of these languages can be automatically converted into a MapReduce task and can be run in parallel across a cluster. These layered alternatives are certainly easier than native MapReduce frameworks, but they all require writing an extensive amount of code. 69 Hortonworks, Inc. (HDP) 52

53 Apache Pig is a platform for analyzing large datasets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large datasets. This high-level language for ad hoc analysis allows developers to inspect HDFS stored data without the need to learn the complexities of the MapReduce framework thus simplifying the access to the data. 70 The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce. 71 Apache Hive: Hive is an open-source data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data, query the data using an SQL-like language called HiveQL, and execute these statements against data stored in the Hadoop Distributed File System (HDFS). 72 The execution of the HiveQL statement generates a MapReduce job to transform the data as required. 73 At the same time, this language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. 72 Two differentiators between HiveQL and SQL are that HiveQL jobs are optimized for scalability (i.e., all rows returned) not latency (i.e., first row returned) and Hive QL implements a subset of the SQL language. 73 In the MapReduce/Hadoop world, Pig and Hive are widely regarded as valuable abstractions that allow the programmer to focus on database semantics rather than programming directly in Java. 6 HiveQL is designed for the execution of analysis queries and supports aggregates and sub-queries in addition to select and join operations. Complex processing is supported through user-defined functions implemented in Java and by embedding MapReduce scripts directly within the HiveQL. As Gartner points out, the promise of HiveQL is the ability to access data from within HDFS, HBase, or a relational database (via Java Database Connectivity [JDBC]), thus allowing interoperability between data within the existing BI domains and Big Data analytics infrastructure. Although Pentaho for Hadoop is an early implementer of this technology, wider support for HiveQL among BI tools vendors is currently very limited, but we would expect the number of BI vendors supporting HiveQL to increase as the stability and feature set of Hive mature and the adoption of Hadoop increases. 73 Apache HBase: Modeled after Google's Bigtable, HBase is an open-source, distributed, versioned, column-oriented database, providing the capability to perform random read/write access to data. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. 74 SQL-ish support for HBase via Hive is in development; however, Hive is based on HDFS, which is not generally suitable for low-latency requests. 75 HBase is not a direct replacement for a classic SQL database, as HBase does not support complex transactions, SQL, or ACID properties. However, HBase's performance has improved recently and is now serving several data-driven Websites, including Facebook's Messaging Platform. A principal differentiator of HBase from Pig or Hive is the ability to provide real-time read and write random-access to very large datasets. 6 HDFS is a distributed file system that is well suited for the storage of large files, and its documentation states that HDFS is not a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides faster record lookups (and updates) for large tables. 75 Hortonworks, Inc. (HDP) 53

54 Apache HCatalog: Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools Apache Pig, Apache MapReduce, and Apache Hive to more easily read and write data on the grid. HCatalog's table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables' metadata. 76 Apache Avro: The ecosystem around Apache Hadoop has grown at a tremendous rate, allowing organizations to use many different pieces (or multiple pieces) of software to process their large datasets. For example, data collected by Flume might be analyzed by Pig and Hive scripts, and/or data imported with Sqoop might be processed by a MapReduce program. To address this data interoperability, each system must be enabled to read and write a common format. 77 Avro is a data serialization system. Avro provides: rich data structures; a compact, fast, binary data format; a container file to store persistent data; remote procedure call (RPC); and simple integration with dynamic languages. 78 Apache Mahout: Mahout's goal is to build scalable machine learning libraries. Mahout's core algorithms for clustering, classification, and batch-based collaborative filtering are implemented on top of Apache Hadoop using the MapReduce paradigm. Mahout currently supports mainly four use cases: (1) recommendation mining takes user behavior and from that tries to find items users might like; (2) clustering takes text documents, for example, and groups them into groups of topically related documents; (3) classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category; and (4) frequent item set mining takes a set of item groups (e.g., terms in a query session or shopping cart content) and identifies which individual items usually appear together. 79 Apache Kafka: Apache Kafka is a fast, scalable, durable, and fault-tolerant publishsubscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, replication, and fault tolerance. 80 Apache Spark: Apache Spark is designed be the primary processing framework for new Hadoop workloads to make the in-memory engine a better candidate for enterprise use. The goal of integrating Spark more deeply with YARN is about enabling it to operate more efficiently with other engines, such as Hive, Storm and HBase, on a single data platform and remove the need for dedicated Spark clusters. 81 Spark, released in early 2014 has been lauded for being faster than MapReduce (in memory and on disk) and easier to program. This means it is well-suited for nextgeneration Big Data applications that might require lower-latency queries, real-time processing, or iterative computations on the same data (i.e. machine learning). 82 Apache Accumulo: Apache Accumulo is a high-performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google's Big Table design that works on top of Apache Hadoop and Apache ZooKeeper. Cell-level access control is important for organizations with complex policies governing who is allowed to see data. It enables the intermingling of different data sets with different access control policies and proper handling of individual data sets that have some sensitive portions. Without Accumulo, those policies are difficult to enforce systematically. Accumulo encodes those rules for each individual data cell and allows fine-grained access control. 83 Apache Storm: Apache Storm is a distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing Hortonworks, Inc. (HDP) 54

55 capabilities to Apache Hadoop 2.x. Storm in Hadoop helps capture new business opportunities with low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in their Hadoop cluster. 84 Apache Solr: Apache Solr is the open-source platform for searches of data stored in HDFS in Hadoop. Solr powers the search and navigation features of many of the world's largest Internet sites, enabling powerful full-text search and near real-time indexing. Solr supports searches of a variety of data types including tabular, text, geolocation, and sensor data. 85 Data Governance & Integration Apache Falcon provides policy-based workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS. 12 Apache Sqoop: Sqoop, which is short for "SQL to Hadoop," is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores (e.g., relational databases). Organizations can use Sqoop to import data from external, structured data stores into the Hadoop Distributed File System and related systems (i.e., Hive and HBase). Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured data stores, including relational databases and enterprise data warehouses. 86 Apache Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data from applications to Hadoop's HDFS. Flume has a simple and flexible architecture based on streaming data flows. Flume uses a simple extensible data model that allows for online analytic applications. 87 Flume was developed after Chukwa and has many similarities, as both have the same overall structure and do agent-side replay on error. However, there are some notable differences as well. In Flume, there is a central list of ongoing data flows, stored redundantly in Zookeeper. Whereas Chukwa does this end-to-end, Flume adopts a more hop-by-hop model. In Chukwa, agents on each machine are responsible for deciding what to send. 88 Apache Falcon: Apache Falcon is a framework for simplifying data management and pipeline processing in Apache Hadoop. It enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery, and data retention use cases. Instead of hard-coding complex dataset and pipeline processing logic, users can now rely on Apache Falcon for these functions, maximizing reuse and consistency across Hadoop applications. 89 Security Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox. 12 Apache Knox: The Knox Gateway ("Knox") is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users that access the cluster data and execute jobs, and for operators that control access and manage the cluster. Knox runs as a server (or cluster of servers) that serve one or more Hadoop clusters. 90 Apache Ranger: Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster, providing central security policy administration across the core enterprise security requirements of authorization, accounting, and data protection. 91 Operations Apache Ambari offers the necessary interface and APIs to provision, manage, and monitor Hadoop clusters and integrate with other management console software. 12 Hortonworks, Inc. (HDP) 55

56 Apache Ambarri: Apache Ambari is a completely open operational framework for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a set of APIs that mask the complexity of Hadoop, simplifying the operation of clusters. 92 Apache Oozie: Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts. 93 Apache ZooKeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, enabling distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented, there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. ZooKeeper aims at distilling the essence of these different services into a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable. Consensus, group management, and presence protocols will be implemented by the service so that the applications do not need to implement them on their own. Application-specific uses of these will consist of a mixture of specific components of ZooKeeper and application-specific conventions. 94 Hortonworks, Inc. (HDP) 56

57 Competitive Review The market for Big Data is dominated by both new entrants and legacy technology vendors. Given Hadoop is an Apache open-source project that anyone can download for free, success in this market will be determined by vendors' ability to support, extend, and augment Hadoop, as well as add differentiated features to make their solutions attractive to enterprises. 13 Pure-Play Hadoop Distributions Although other distributions exist, the primary competitors in the pure-play Hadoop distribution market are Hortonworks, Cloudera, and MapR. (See Exhibit 67.) These three companies singly focus on developing, supporting, and marketing unique Hadoop distributions, add-on innovations, and services. They sell their solution directly to customers but also have an aggressive channel strategy of selling through partners, such as large enterprise software vendors. 13 One of the biggest talking points at the most-recent Hadoop Summit (June 3-5, 2014) and Hadoop World (October 15-17, 2014) was the race between Hortonworks and Cloudera, with the competition heating up as the market matures, particularly in light of Cloudera's recent funding round led by Intel. 15 Exhibit 67: Hortonworks vs. Cloudera vs. MapR Source: Company data, Credit Suisse. The key difference between Hortonworks and Cloudera and MapR is that Hortonworks is committed to being completely open-source, while Cloudera and MapR support Apache Hadoop projects in addition to their own proprietary software. Hortonworks consistently reiterates its strategy of innovating the core of Hadoop, extending Hadoop as an Enterprise Data Platform, and then enabling the ecosystem by allowing leaders in the datacenter to easily adopt and extend their platforms. 1 The key differentiator for Cloudera Hortonworks, Inc. (HDP) 57

58 and MapR would be if one of those two companies could innovate its proprietary software faster than the entire Hadoop ecosystem. Hortonworks: Hortonworks is completely open-source. Everything on its platform is available from the Apache Hadoop Distribution. The distribution is available as a free download or with paid support. 95 Cloudera: Cloudera offers the open-source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. The distribution is available as a free download or with paid support for the additional tools. 95 MapR: MapR offers a version of Hadoop that replaces the HDFS with a proprietary MFS (MapR File System). Everything else on its stack is based on the open-source Apache distribution. MapR offers a free M3 version along with paid M5 and M7 versions. 95 While there are similarities among the three companies, the differences are the ones that play a deciding role in choosing one vendor over the other two. Cloudera Cloudera is a leading vendor of enterprise-level Hadoop implementation and support. 28 Founded in 2008, Cloudera was the first provider and supporter of Apache Hadoop for the enterprise. The company also offers software for business-critical data challenges including storage, access, management, analysis, security, and search. 96 Cloudera has several enterprise software programs overlaid on its open-source distributions to aid the consumers, whereas Hortonworks strives to provide a framework comprising only of opensource projects. 97 Cloudera and Hortonworks are built upon the same core of Apache Hadoop. As such, they have more similarities than differences. Both the companies offer enterprise-ready Hadoop distributions, paid training, and professional services to help early adopters or customers to familiarize with the complex and relatively newer Hadoop concept for Big Data and Analytics. Both the companies support MapReduce as well as YARN. 97 Cloudera has a commercial license, while Hortonworks has an open-source license. Cloudera also allows the use of its open-source projects free of cost, but the package does not include the Cloudera's proprietary management suite (Cloudera Manager) or any other proprietary software. 97 Cloudera's approach to innovation is to be loyal to core Hadoop but to innovate quickly and aggressively to meet customer demands and differentiate its solutions (e.g., Cloudera Manager and Impala). 13 In other words, the key difference between Hortonworks and Cloudera is that Cloudera has its own Hadoop distribution surrounded with proprietary technology, whereas Hortonworks is wholly open-source. 15 Although Hortonworks is compared with Cloudera more than any other company, Cloudera is distancing itself from the turf war, preferring instead to frame itself against data giants like IBM. Cloudera is now building proprietary products on top of an open-source Hadoop code, which means that the company is taking on the task of building a full platform to compete with the big players. Hortonworks has devoted itself to innovating exclusively in the open-source space and is reaching customers through the existing products of its partners, which include Microsoft, Teradata, and SAP. 98 Cloudera's strengths include the ability to process unstructured data, structured data, and data types that are in between (e.g., XML) with a variety of solutions such as batch computer (MapReduce), interactive SQL (Impala), and text search (Cloudera Search). Examples of Cloudera capabilities include the training and scoring of predictive models via push-down to SAS and R (programming languages) and offering a wide array of prepackaged machine-learning libraries. Reference customers report high confidence in Hortonworks, Inc. (HDP) 58

59 Cloudera's personnel, its specific skills in deploying Hadoop Distributions, and in the Cloudera-developed intellectual property (e.g., Impala). 99 Cloudera is vying for a spot from which to lead market execution in the era of Big Data. 99 However, Hortonworks has emerged as a threat to Cloudera, as Cloudera customers have left the company in favor of Hortonworks' open-source model. In September 2013, Spotify, the music streaming service with over 24 million users worldwide, migrated its 690-node Hadoop cluster from Cloudera's software distribution to HDP and Hortonworks enterprise support. One of the largest Hadoop implementations in Europe, Spotify's cluster is used to develop analytics that drive the company's personalized services, such as Spotify Radio. The company cited that Hortonworks' true open-source approach and the work the company had done to improve the Apache Hive data warehouse system aligned well with the needs of Spotify, given that it uses Hive extensively for ad-hoc queries and the analysis of large data sets. 100 Although Cloudera is the oldest competitor in the Hadoop market, we believe Hortonworks is quickly catching up, having made more innovations in the Hadoop ecosystem in the recent past. 97 In particular, YARN, a significant architectural change in Hadoop to make the platform enterprise ready, was developed by Hortonworks. Cloudera adopted YARN with its release of Cloudera Enterprise 5 (containing CDH Cloudera Manager 5.0.0) in April CDH is the first release of Cloudera distribution in which YARN and MapReduce 2 (MR2) is the default MapReduce execution framework. 101 Although Cloudera now supports YARN, given the nature of open-source, we believe that customers could be inclined to Hortonworks' distribution since Hortonworks contributed 85% of this open-source software project. There are other key differences between Hortonworks and Cloudera. For example, while Cloudera CDH can be run on Windows Server, HDP is available as a native component on Windows Server. A Windows-based Hadoop cluster can be deployed on Windows Azure through HDInsight Service. Cloudera has a proprietary management software (Cloudera Manager), SQL query handling interface (Impala), and real-time data search (Cloudera Search). Customers might fear vendor lock-in with such proprietary components of Hadoop. In contrast, Hortonworks has no proprietary software, using Ambari for management, Stinger for handling queries, and Apache Solr for searches of data. 97 Cloudera Distribution including Apache Hadoop (CDH) The Cloudera Distribution including Apache Hadoop (CDH) offers an integrated Apache Hadoop-based stack containing all the components needed for production use, which are tested and packaged to work together. 28 Just like a Linux distribution delivers more than Linux, CDH delivers the core elements of Hadoop scalable storage and distributed computing along with additional components such as a user interface, plus necessary enterprise capabilities such as security, and integration with a broad range of hardware and software solutions. 102 Included in CDH is Hadoop, Flume, HBase, HCatalog, Hive, Hue, Impala, Mahout, Oozie, Pig, Cloudera Search, Sentry, Spark, Sqoop, Whirr and ZooKeeper. Cloudera Impala: Impala is a massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop. The Apache-licensed, open-source Impala project combines modern, scalable parallel database technology with the power of Hadoop, enabling users to directly query data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is designed from the ground up as part of the Hadoop ecosystem and shares the same flexible file and data formats, metadata, security, and resource management frameworks used by MapReduce, Apache Hive, Apache Pig, and other components of the Hadoop stack. 103 (See Exhibit 68.) Hortonworks, Inc. (HDP) 59

60 Exhibit 68: Cloudera Impala Source: Cloudera. Impala raises the bar for SQL query performance on Apache Hadoop while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase including SELECT, JOIN, and aggregate functions in real time. Furthermore, Impala uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-ofmagnitude faster performance than Hive, depending on the type of query and configuration. 103 Cloudera built Impala as a competitor to Hive, and Chief Strategy Officer Mike Olson announced the company's strategy to develop Impala from the ground up as a new project rather than improving the existing Apache Hive project, arguing that Hive was the wrong architecture for real-time distributed SQL processing. However, 20 months after this announcement, Cloudera unveiled its plan to enable Apache Hive to run on Apache Spark, arguably a recognition of Hortonworks' Stinger Initiative in modernizing Hive's architecture for interactive SQL applications while preserving the investments of Hive's users and the broader ecosystem. One could interpret this action as Cloudera essentially acknowledging that the company could not innovate its proprietary product faster than the entire Hadoop ecosystem. 104 Sentry: Apache Sentry (incubating) is a unified authorization mechanism so enterprises can store sensitive data in Hadoop. Sentry is a fully integrated component of CDH and provides fine-grained authorization and role-based access control all through a single system. Sentry currently integrates with the open-source SQL query frameworks Apache Hive and Cloudera Impala, and the open-source search engine, Cloudera Search, and can also extend to other computing engines within the Hadoop ecosystem. Sentry is a crucial part of compliance-ready security and is available with CDH and Cloudera Enterprise out-of-the-box. With Sentry, customers can gain comprehensive control of user access to subsets of data, simplify permissions management based on functional roles, and delegate security management to individual administrators. 105 Cloudera Search: Cloudera Search brings full-text, interactive search and scalable, flexible indexing to CDH and an enterprise's enterprise data hub. Powered by Apache Hortonworks, Inc. (HDP) 60

61 Hadoop and Apache Solr, the enterprise standard for open-source search, Cloudera Search brings scale and reliability for a new generation of integrated, multi-workload search. Through its unique integrations with CDH, Cloudera Search gains the same fault tolerance, scale, visibility, security, and flexibility provided to other enterprise data hub workloads. 106 With Search, business users can analyze both structured and unstructured data together and (1) break down "data silos" by correlative multiple, disparate data sets across multiple attributes interactively; (2) improve Big Data ROI by uncovering patterns, (4) streamline operations and costs with simplified deployment, provisioning, and monitoring of large scale, multi-purpose clusters through centralized management, and (4) use Search's broad range of indexing options to accommodate a growing number of diverse use cases. 107 Spark: Apache Spark is an open-source, parallel data processing framework that complements Hadoop to develop fast, unified Big Data applications combining bath, streaming, and interactive analytics on all of a customer's data. In addition, Spark Streaming extends Spark with an API for working with streams, providing full fault tolerance for mission-critical applications. 28 HBase: Apache HBase is a distributed, scalable data store that runs on top of Apache Hadoop's file system, the Hadoop Distributed File System (HDFS). HBase is a key component of an enterprise data hub (EDH), as its design caters to applications that require fast, random access to significant data sets. HBase, which is modeled after Google's BigTable, can handle massive data tables containing billions of rows and millions of columns. 28 Cloudera Express is a free download that combines CDH, Cloudera's 100% open-source and enterprise-ready distribution of Apache Hadoop with Cloudera Manager, which provides robust cluster management capabilities like automated deployment, centralized administration, monitoring, and diagnostic tools. Designed specifically for mission-critical environments, Cloudera Enterprise is available on a subscription basis in three editions: (1) Basic Edition, (2) Flex Edition, and (3) Data Hub Edition. 28 (See Exhibit 69.) Exhibit 69: Cloudera Product Comparison Source: Cloudera Enterprise includes CDH, as well as advanced system management (including Cloudera Manager) and data management tools (including Cloudera Navigator) plus dedicated support and community advocacy from Cloudera. 28 (See Exhibit 70.) Cloudera Manager: Cloudera Manager is a sophisticated management application that delivers granular visibility into and centralized control over the enterprise data hub Hortonworks, Inc. (HDP) 61

62 at scale, empowering operators to improve performance, enhance quality of service, increase compliance, and reduce administrative costs. The application automates the installation process, reducing deployment time from weeks to minutes; gives administrators a cluster-wide, real-time view of nodes and services running; provides a single, central console to enact configuration changes across an enterprise's cluster; and incorporates reporting and diagnostic tools to help optimize performance and utilization. 28 Cloudera Navigator: Cloudera Navigator is a native governance solution for Apache Hadoop-based systems. Through a single user interface, Cloudera Navigator provides visibility for administrators, data managers, data scientists, and analysts to secure, govern, and explore the large amounts of diverse data that land in Hadoop. Cloudera Navigator is part of Cloudera Enterprise's comprehensive data security and governance offering and is a key part to meeting compliance and regulatory requirements. 28 Exhibit 70: Cloudera Distribution including Apache Hadoop (CDH) Enterprise Data Hub Source: Cloudera, Credit Suisse. MapR Founded in 2009, MapR Technologies offers a Hadoop distribution with storage optimizations, high-availability improvements, and administrative and management tools, and uses Network File System (NFS) instead of HDFS. The company's recent M7 release eliminates region servers (replacing them with automated region splits) and disaster recovery (data assurance and process recovery). MapR also markets a robust version of Apache HBase, the table-style NoSQL database built on several components from the Apache Hadoop stack. (See Exhibit 71.) The company has a rich partner ecosystem and provides its products through multiple cloud infrastructure firms as well as on-premises. 99 Hortonworks, Inc. (HDP) 62

63 Exhibit 71: MapR Data Platform Source: MapR. Thus, instead of focusing on a pure open-source distribution like Hortonworks, or on adding additional proprietary capabilities to the core project like Cloudera, MapR makes fundamental innovations to the underlying data platform that make Hadoop more "friendly" and resilient to use in conjunction with existing business technology. 14 Based on Google BigTable, MapR M7 supports a wide column or "table style" NoSQL data model to manage a wide variety of operational data formats, including log data, sensor data, metadata, clickstreams, user profiles, session states, and links, semantics, and relationship data. It's compatible with the Apache HBase core API for running existing HBase applications. 108 The key difference between Hortonworks and MapR is "open-source purity." 3 MapR's CEO, John Schroeder, estimates that MapR is approximately 80% open-source while Cloudera is about 85% open-source. 4 In comparison, Hortonworks is a pure open-source distribution. MapR has grown by adding some proprietary software for helping manage the installation, configuration, and operation of its distribution. M.C. Srivas, its founder, has taken significant parts of Hadoop and re-implemented them in an API-compatible manner. 3 MapR is betting that its approach of combining architectural advances with open-source innovations provides a lasting advantage over others. While the vast majority of MapR's Hadoop distribution consists of the Apache Hadoop code, in a few key areas, MapR felt that the gap was too wide and needed to be closed to make Hadoop suitable for enterprise use. A key part of MapR's strategy is to continue to build on the advantages that its underlying platform provides, especially with respect to HDFS and HBase projects. However, MapR has lagged behind the company's two primary pure-play competitors, Cloudera and Hortonworks, in terms of market awareness. MapR's success will only come if enterprise buyers really see the value in its additional functionality. 14 However, we believe that MapR may be innovating too far down the stack for the company to gain the traction of Hortonworks or Cloudera. In fact, Forrester cites that its clients often ask about Cloudera and Hortonworks, but not about MapR. In other words, although the company has a competitive solution, MapR lacks the market noise of its competitors. 13 Other Technology Vendors All of the major enterprise software vendors have a Hadoop strategy because Hadoop is emerging as an essential data management technology. Many of them partner with one or more pure-play vendors. For example, Oracle partners with Cloudera, while SAP partners with Hortonworks. Others (e.g., IBM, Microsoft, Pivotal, and Teradata) are in various Hortonworks, Inc. (HDP) 63

64 stages of launching their own unique distributions and securing partnerships with pureplay vendors. For example, Microsoft partners with Hortonworks and has used this as a base to create HDInsight for Windows Azure, and Teradata partners with Hortonworks and has ramped up a significant engineering and services organization to offer the Teradata Hadoop distribution built on Hortonworks. 13 Pivotal Pivotal became an independent entity in April 2013 but carries assets of EMC and VMware. Pivotal has a strong vision to integrate its products to form the Pivotal Data Platform, supporting operational use cases combined with analytics (with HDFS) for common persistence. Gartner reports that customers view speed as the largest benefit from Pivotal and that they utilize it extensively for complex analysis when combining diverse and large datasets across the range of information types. 99 Pivotal HD 2.0, released in May 2014, is the vendor's first distribution based on Apache Hadoop 2.2, the latest release of the open-source platform incorporating YARN. The release also integrates and supports Apache GraphLab, an open-source framework for derivatives monitoring, recommendations, and graph analytics. Pivotal also offers GemFire XD, an in-memory database designed to execute algorithms and analytics on data in real time. Blending elements of Pivotal's GemFire and SQL Fire, GemFire XD puts a SQLcompliant, in-memory database on top of HDFS from which it can read or write data with low-latency. 109 Pivotal also offers HAWQ, its SQL-on-Hadoop query engine, which is based on the Greenplum database. While Hortonworks denounces HAWQ's commercial roots, Pivotal argues its proprietary technology has advantages over Hive, Impala, and other opensource SQL-on-Hadoop options in that HAWQ takes advantage of Greenplum's history as a massively parallel processing analytical query engine, and thus its impressive performance. 109 Matching Cloudera's "enterprise data hub" concept, Pivotal developed a "business data lake architecture," with HD 2.0 as the center of enterprise data management. However, the company is still catching up in that its proprietary HAWQ and GemFire XD components cannot, as of yet, be managed by YARN. That is something Pivotal is working on, but for now, companies will have to use the combination of Pivotal Command Center, Virtual Resource Planner tools, and YARN to separately manage the resources and workloads within a data lake environment. 109 Despite its stated goals in the Hadoop market, Pivotal laid off approximately 60 employees in November 2014, according to CRN. Approximately one-half of a dozen salespeople were included in the cuts, but the majority of those laid off were employees that work on Pivotal's Big Data products. These include Pivotal HD, the vendor's Hadoop distribution, and its Greenplum and GemFire database offerings. 110 In July 2014, Pivotal teamed up with Hortonworks, one of its main rivals in the Big Data market, to work together on Apache Ambari, an open-source project for managing and monitoring Hadoop clusters. 110 Oracle Oracle announced a partnership with Cloudera in January of 2012, in which Oracle integrated Cloudera Distribution including Apache Hadoop (CDH) and Cloudera Manager, which includes an open-source distribution of R, into the Oracle Big Data Appliance, an Engineered System designed to provide a scalable data processing environment for Big Data. 111 Oracle continues to be the relational DBMS market share leader (Gartner estimates over 42% of the market in 2013) and shows good execution in the data warehouse market. 99 At Oracle OpenWorld 2014, Oracle announced a new set of tools called Oracle Big Data Discovery, which allows database managers and engineers to work more effectively with Hadoop, as the "visual face of Hadoop." Oracle Big Data Discovery lets users profile, explore, and analyze Hadoop data and do prediction and correlation. 112 With Big Data Hortonworks, Inc. (HDP) 64

65 Discovery 12c, businesses users can browse, annotate, build models, and share data via self-service data preparation. Oracle cites the differentiator with this product as having a simple search capability and highly interactive analysis. 113 Oracle Big Data Discovery first samples, profiles, and catalogs the data set available within a Hadoop cluster. Machinelearning algorithms are then applied behind the scenes to surface interesting correlations and offer suggested visualizations for exploring attributes. Search and guided navigation features offer further data exploration. 114 In addition, the Oracle Big Data Appliance is optimized for acquiring, organizing, and loading unstructured data into the Oracle Database. The Oracle Big Data Appliance includes Cloudera Distribution including Apache Hadoop (CDH), Oracle NoSQL Database, Oracle Data Integrator with Application Adapter for Apache Hadoop, Oracle Loader for Hadoop, an open-source distribution of R, Oracle Linux, and Oracle Java HotSpot Virtual Machine. 28 Oracle's Big Data SQL on Big Data Appliance (Hadoop) automatically combines Oracle and Hadoop data. Announced in July, Big Data SQL lets users perform SQL queries across Hadoop, NoSQL databases, and the Oracle Database. As the name suggests, Oracle Big Data SQL does not address advanced analytics, machine learning, or Big Data correlation techniques, but the system is "friendly" for any data-management professional who is accustomed to analyzing data with Hadoop. 114 (See Exhibit 72.) Exhibit 72: Oracle Big Data SQL on Big Data Appliance (Hadoop) Source: Oracle. Oracle's Big Data SQL 11g and 12c also allows for standard Oracle SQL to query across Hadoop, which opens up data in Hadoop to people who know Oracle SQL. Customers can offload scanning to the storage tier, which improves query performance and allows for faster analysis. (See Exhibit 73.) Exhibit 73: Oracle Big Data SQL 11g & 12c Source: Oracle. IBM IBM offers stand-alone DBMS solutions, as well as data warehouse applications and a z/os solution. Its various appliances include: IBM zenterprise Analytics System, PureData Hortonworks, Inc. (HDP) 65

66 System for Analytics, IDAA, IBM Smart Analytics System, and others. IBM offers data warehouse managed services and professional services. The company has delivered products that support the LDW and is a leader in execution according to Gartner. A rearchitected Puredata and the new BLU (IMDBMS) were released in IBM offers all five form factors for data warehouses: software only, managed services, appliances, cloud, and reference architectures. 99 IBM's road map includes continuing to integrate the BigInsights Hadoop solution with related IBM assets like SPSS advanced analytics, workload management for high-performance computing, BI tools, and data modeling tools. IBM currently has more than 100 Hadoop deployments, some of which are fairly large and run to petabytes of data. 13 However, Gartner inquiry data does not show any increase in IBM's competitive presence. At the same time, IBM reports new customer wins for data warehouse products. This makes it unclear as to IBM's current ability to grow outside its currently very large customer base. 99 Microsoft Microsoft first partnered with Hortonworks in October 2011, shortly after the founding of Hortonworks', with the intent of developing Hadoop-based solutions in both on-premise (Windows Server) and cloud (Windows Azure) configurations. In May 2013, Hortonworks released the first Windows version of HDP, followed up by Microsoft's announcement of Windows Azure HDInsight a Hadoop-based Azure service built on HDP in October of Enabling HDP on Windows as well as developing Microsoft's Windows Server and Azure Hadoop solutions was the culmination of a multi-year subscription and partnership arrangement. Co-engineering efforts concluded in October 2013, but Microsoft continues to keep a support subscription arrangement with Hortonworks. 1 Microsoft and Hortonworks offer the following distinct solutions based on HDP. (See Exhibit 74.) HDInsight: This is a cloud-hosted service available to Azure subscribers that uses Azure clusters to run HDP and integrates with Azure storage. 115 Hortonworks Data Platform (HDP) for Windows: This is a complete package that can be installed on Windows Server to build fully-configurable Big Data clusters based on Hadoop. HDP for Windows can be installed on physical on-premises hardware or in virtual machines in the cloud. 115 Microsoft Analytics Platform System (APS): This is a combination of the massively parallel processing (MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based Big Data technologies. Microsoft Analytics Platform System uses the HDP to provide an on-premises solution that contains a region for Hadoop-based processing, together with PolyBase, which is a connectivity mechanism that integrates the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. APS allows data in Hadoop to be queried and combined with on-premises relational data, and data to be moved into and out of Hadoop. 115 Hortonworks, Inc. (HDP) 66

67 Exhibit 74: HDInsight Within the Microsoft Data Platform Source: In our view, Hortonworks' partnership with Microsoft benefits the company in two ways. First, the availability of HDP on Windows Server expands Hortonworks' potential addressable market to Windows-only enterprises, as well as provides further optionality for on-premise cluster deployments. Expanding the number of potential enterprise customers and making it easier to start using HDP drives demand for Hortonworks' support subscription and professional services offerings. Second, Hortonworks is paid by Microsoft to provide support for the company's HDInsight Azure service. The service is a key component of the Microsoft's Azure-based Big Data offerings. Users spin up Hadoop clusters in minutes, as with any other Azure service. Pricing is based on usage and number of nodes. At its core, HDInsight uses a variety of HDP components, such as Ambari, Hbase, Hive, Oozie, and Storm. Due to the integration between HDP and HDInsight, users can move Hadoop data between on-site datacenters and the Azure cloud (see Exhibit 75), and using the Microsoft Analytics Platform System, users can query both on-premise and cloud data at the same time. HDInsight can also be integrated into other Azure apps, as well as directly exporting data into Microsoft's BI tools such as Excel and Power BI. 1,116,117 Hortonworks, Inc. (HDP) 67

68 Exhibit 75: Azure HDInsight and HDP Hybrid Deployment Models Source: Microsoft sells a Microsoft-branded offering of the Hortonworks Data Platform to its enduser customers. Hortonworks receives a fee for providing support subscription offerings to Microsoft. Revenue from Microsoft accounted for 55.3% of Hortonworks' total revenue for the year ended April 30, 2013, 37.8% of the company's total revenue for the eight months ended December 31, 2013, and 22.4% of Hortonworks' total revenue for the nine months ended September 30, Teradata On March 3, 2011, Teradata, one of Hortonworks' stockholders, announced the acquisition of Aster Data, an advanced analytics database and platform provider. The ncluster analytic platform, Aster Data's flagship product, comprises an analytics engine integrated with a massively parallel hybrid row-and column-oriented database. 118 (See Exhibit 76.) Exhibit 76: Aster Data ncluster Diagram Source: Aster Data. Aster Data ncluster, an MPP hybrid row and column analytical database architecture, runs on commodity hardware clusters for large-scale data management and advanced analytics by taking advantage of the MapReduce framework for parallelization and scalable data processing. In particular, Aster Data uses a different approach to implementing MapReduce by running it in-database and combining it with SQL to allow developers to take advantage of MapReduce within standard SQL. This architectural approach Hortonworks, Inc. (HDP) 68

69 combined with its ability to run application processing in the database makes Aster Data ncluster well suited for complex analysis on massive datasets, especially those in the petabyte range. 118 Aster Data's combination of MapReduce parallelization, SQL-MapReduce, and indatabase processing of both data and procedural code enables ncluster to serve as a powerful platform for large-scale data warehousing and analytics that allows more data types and sources to be included within an analytic application framework. (See Exhibit 77.) Aster Data's implementation of MapReduce and its pre-packaged library of SQL- MapReduce functions make ncluster particularly suitable for analytic applications that analyze click-streams, mine online customer transaction data for interesting patterns of behavior, or analyze connections and social networks for marketing, fraud detection, and behavior analysis. 119 Exhibit 77: The Best Approach by Workload and Data Type Teradata, Aster, and Hadoop Source: Teradata. On June 26, 2013, Teradata and Hortonworks announced a strategic partnership to allow Teradata to resell and offer support for Hortonworks Data Platform, whereby Teradata typically performs level one support for its end-user customers. 1 The Teradata Portfolio for Hadoop offers the Teradata Appliance for Hadoop and the Teradata Aster Big Analytics Appliance. The Teradata Appliance for Hadoop provides a tightly integrated hardware and software appliance optimized for enterprise-class data storage and management. 119 Teradata Appliance for Hadoop is an enterprise platform that is pre-configured and optimized for Big Data storage and refining. A purpose-built, integrated hardware and software solution for data at scale, the appliance runs Teradata Open Distribution for Hadoop (TDH), which is built on Hadoop from Hortonworks. The appliance can hold up to 205 TB uncompressed data per cabinet, with the entire system scaling up to 13 PB. The appliance is networked by Teradata's fabric-based computing, a high-throughput BYNET V5 on a 40GB/s InfiniBand interconnect for fast data exchange between Teradata, Teradata Aster, and Hadoop appliances. Teradata provides the following technologies to integrate data among Hadoop, Teradata, and the Teradata Aster Discovery Platform. 121 Hortonworks, Inc. (HDP) 69

70 Teradata QueryGrid: Teradata QueryGrid gives business analysts and data scientists access to Hadoop data and pushdown processing into Hadoop from a single Teradata or Teradata Aster database query. Teradata QueryGrid uses Teradata analytics platforms, Hive, HCatalog, and the HDFS file system to concentrate its power on accessing and analyzing data without special tools or IT intervention. 121 Teradata Connector For Hadoop (TDCH): TDCH is a set of APIs that support highperformance parallel bi-directional data movement between Teradata systems and the Hadoop ecosystem of products, and can function as an end-user tool with its own command link interface. It can also serve as a building block for integration with other end-user tools, such as Sqoop, through a JAVA API. 121 Teradata Aster File Store (AFS): AFS enables ingestion of multi-structured data into the Teradata Aster Discovery Platform. AFS is binary compatible with HDFS and is a complementary technology to share data between a Hadoop data lake and data used for interactive analytics on the Teradata Aster Discovery Platform. 121 Exhibit 78: Teradata Workload-Specific Platforms Source: Teradata. The joint approach between Hortonworks and Teradata extends the value of Teradata with the scale of Hadoop. (See Exhibit 79.) As HDP provides a data platform for capturing, processing, and refining data, the Teradata Aster Discovery Platform integrates SQL, MapReduce, and the library of analytic functions, allowing business analyst to uncover new insights through interactive analytics. 122 Hortonworks, Inc. (HDP) 70

71 Exhibit 79: Teradata and Hortonworks Unified Big Data Architecture Source: Hortonworks. Hortonworks receives a fixed dollar amount per customer transaction from Teradata based on volume, regardless of the amount that Teradata bills to its end-user customers. In April 2012, Hortonworks received a nonrefundable prepayment of $9.5 million from Teradata as consideration for the development services expected to be performed by Hortonworks over the three-year term of the agreement. As of September 30, 2014, this prepayment had a remaining balance of $6.4 million. For the years ended April 30, 2012 and 2013, the eight months ended December 31, 2013 and the nine months ended September 30, 2014, revenue from Teradata was $0, $394,000, $682,000 and $1.1 million, respectively. Either party may terminate the agreement under certain circumstances, including if the other party breaches a material term of the agreement and fails to cure the breach within 30 days. Moreover, Teradata may terminate the agreement, without cause, upon 60 days' prior written notice and Hortonworks may terminate the agreement, without cause, at the end of the initial term, upon 120 days' prior written notice, and thereafter at any time upon 120 days' prior written notice. The initial term of the agreement will continue until June 30, 2016 and will automatically be extended until terminated by either party pursuant to the terms of the agreement. Revenue from Teradata accounted for 3.6% of Hortonworks' total revenue for the year ended April 30, 2013, 3.8% of total revenue for the eight months ended December 31, 2013 and 3.3% of Hortonworks' total revenue for the nine months ended September 30, Hortonworks, Inc. (HDP) 71

72 Company Overview Company Background Corporate History In 2005 while working at Yahoo!, Eric Baldeschwieler (former CTO at Hortonworks) challenged Owen O'Malley (co-founder of Hortonworks), Doug Cutting, Arun Murthy (cofounder of Hortonworks), and several others to solve how to store and process the data on the Internet in a simple, scalable, and economically-feasible way. With oversight from Baldeschwieler and Raymie Stata (former CTO at Yahoo!), the team turned to the Apache Software Foundation and began to work with the open-source community on what became known as Apache Hadoop specifically HDFS and MapReduce. 12 Hortonworks was founded in June 2011 when its early senior management team partnered with a core group of Hadoop developers at Yahoo! with the goal of expanding and developing the platform. The company has since expanded from its original 24 engineer founders to over 420 employees, operating out of its headquarters in Palo Alto, California, with additional offices in the United Kingdom and South Korea. 12 Hortonworks had 292 customers as of September 30, 2014, after only eight quarters of sales activity. (See Exhibit 80.) Exhibit 80: Selected Key Customers Source: Company data, Credit Suisse. In 2012, Hortonworks launched its primary product, the Hortonworks Data Platform. HDP is an enterprise-grade Hadoop-based platform for Big Data management, designed to address some of the limitations of traditional Hadoop. Hortonworks released HDP on an open-source basis, and provides paid support and professional services to enterprises utilizing HDP. The company employs a substantial number of core committers to enterprise-grade Hadoop projects within the Apache Software Foundation's open-source ecosystem. This ensures that Hortonworks' support offering is backed up by developers that are on the cutting edge of Hadoop. Simply put, because the company contributes so significantly to the development and expansion of Hadoop, no one understands Hadoop better than Hortonworks. 12 Furthermore, by virtue of its heavy influence on Hadoop development, Hortonworks can ensure that the innovation roadmap drives toward enterprise-grade quality and predictability, as well as filling any major feature gaps that enterprises would otherwise have to work around. For example, in May 2014, Hortonworks acquired XA Secure, a provider of enterprise-ready data security and governance software, and is converting the acquired technology into a security layer included in the open-source HDP. By continually refining the Hadoop platform with enterprise in mind, Hortonworks enhances the appeal of the Hadoop ecosystem and thus drives further demand for its own paid services. 1 Hortonworks has also partnered with a number of enterprise technology providers including HP, Microsoft, Rackspace, Red Hat, SAP, and Teradata, along with a continuous relationship with Yahoo! since the inception of the company. These strategic partnerships Hortonworks, Inc. (HDP) 72

73 ensure smooth interoperation between HDP and partner' products as well as serving to further Hortonworks' enterprise credibility. In addition, HP, Microsoft, and Teradata resell Hortonworks' support and professional services offerings. 1,12 (See Exhibit 81.) Exhibit 81: Selected Strategic Partners HDInsight & HDP for Windows Only Hadoop Distribution for Windows Azure & Windows Server Native integration with SQL Server, Excel, and System Center Extends Hadoop to.net community Teradata Portfolio for Hadoop Seamless data access between Teradata and Hadoop (SQL-H) Simple management & monitoring with Viewpoint integration Flexible deployment options Resells Hortonworks Data Platform support subscription offerings, whereby Teradata typically performs level one support for its end-user customers. Hortonworks receives a fixed dollar amount per customer transaction from Teradata based on volume, regardless of the amount that Teradata bills to its end-user customers Bring Enterprise Apache Hadoop to the Open Hybrid Cloud Make Hadoop easier for developers, analysts, data architects, and operators Engineer solutions for seamless customer experience and ensure integrated customer support Instant Access + Infinite Scale SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP Enables analytics apps (BOBJ) to interact with Hadoop Source: Company data, Credit Suisse. Headcount & Management Team Hortonworks employs over 420 employees across its main Palo Alto office and satellite offices in the United Kingdom and South Korea. Hortonworks' management team has experience spanning open-source-focused companies such as Red Hat, SpringSource, JBoss, and Black Duck, as well as traditional enterprise companies such as VMware, Oracle, Microsoft, and Quest Software. (See Exhibit 82.) The company's board includes members with ties to Yahoo! and HP, among other enterprises. 1,12 Exhibit 82: Hortonworks Management Source: Company data, Credit Suisse. Platform Overview Hortonworks Data Platform (HDP) Hortonworks Data Platform (HDP) is Hortonworks' enterprise-ready distribution of Hadoop, which incorporates a number of key services including YARN. Hortonworks developed Hortonworks, Inc. (HDP) 73

74 HDP with a focus on making the disruption more palatable for the enterprise. HDP consists of 100% open-source technology, which Hortonworks distributes for free and supplements through paid subscription support and training services. Hortonworks tests HDP with large-scale, high-stability deployments in mind, key criteria for enterprise adoption of Hadoop as a mainstream platform. The key services bundled in HDP serve a variety of needs, including core data access, data governance and integration, security, and data operations management. (See Exhibit 83.) Rather than needing to separately implement and connect these services themselves, enterprises can use HDP to more quickly set up a Hadoop environment. 1,12 Exhibit 83: Hortonworks Data Platform (HDP) Source: Hortonworks. Data Access: HDP includes a variety of platforms for batch, interactive, online, and real-time processing all intended to be run on YARN. There are four frameworks that are supported. First, MapReduce-based legacy Hadoop 1.0 applications can continue to be used for batch processing. Second, the Apache Tez API and framework enable applications to use interactive processing on HDFS data, as well as supporting batch processing with reduced latency on the same engine. Third, the upcoming Apache Slider API, made available as a technical preview in HDP in June 2014, is designed for use with "always on" real-time or online services. In addition, Slider enables non- YARN-aware distributed applications to utilize YARN, often without code changes. Lastly, the YARN Native framework enables custom or packaged applications to take close control of Hadoop cluster resources that are directly managed by YARN. HDP includes a variety of YARN-enabled services, such as Apache Pig (simplified MapReduce), Hive (SQL queries), Storm (fast, low-latency data processing), HBase (NoSQL database on top of HDFS), Accumulo (cell-level data access control), and Solr (rapid index and search of HDFS data). The platform also supports YARN-ready partner applications from vendors such as Informatica, SAS, Oracle, and HP. 12 Governance and Integration: HDP includes a number of data workflow, lifecycle, and governance tools designed to manage the flow of data in and out of a Hadoop system in a simple and reliable way. These tools ease the integration of Hadoop into the rest of an enterprise's data architecture, facilitating adoption of Hadoop by the enterprise. Apache Falcon automates and simplifies dataset processing for retention, recovery, and data pipelines. Apache Oozie manages Hadoop workflow scheduling, and is able to combine multiple component jobs into single logical units. Apache Sqoop is designed to efficiently import and export data on a bulk scale between external structured data stores and Hadoop. Sqoop is compatible with common relational databases including Teradata, Oracle, and MySQL. Apache Flume handles streaming log data flows from multiple sources into Hadoop in real time. 1,12 Hortonworks, Inc. (HDP) 74

75 Operations: HDP uses Apache Ambari and Zookeeper to manage the operations of the actual Hadoop cluster within an enterprise's data infrastructure. Ambari simplifies the management of the cluster itself with provisioning, lifecycle, and configuration management, and monitoring tools. Zookeeper is a service which coordinates processes between nodes in the cluster by maintaining a hierarchy of data registers, similar to a file system, which is replicated across a set of machines for redundancy. 12,123 Security: Ensuring Hadoop's security is a key factor in driving adoption by enterprises. Several security features are already integrated into Hadoop, such as built-in Kerberos authentication and access control commands in Apache Hive. The Apache Knox system operates as a gateway that provides a single point of access and authentication for a Hadoop cluster. Hortonworks also recently acquired XA Secure, a provider of enterprise-ready data security and governance software, and is extending its centralized security administration and enforcement features across the Hadoop ecosystem. 1,12 HDP is also designed to be seated into an existing enterprise IT architecture, utilizing existing data sources, interfacing with standard BI applications, and coexisting with other data systems. 12 (See Exhibit 84.) Exhibit 84: Examples of Partner Ecosystem/Integrations Source: Hortonworks. The Hortonworks Data Platform offers linear scale storage and compute across batch, interactive, and real-time access methods with no proprietary extensions and is available on-premise, off-premise, or from an appliance across Windows and Linux. (See Exhibit 85.) Hortonworks, Inc. (HDP) 75

76 Exhibit 85: Hortonworks Data Platform (HDP) The ONLY Completely-Open Hadoop Provider to Available On-Premise, Off-Premise, or from an Appliance across Windows and Linux Source: Hortonworks. Subscription/Services Overview Hortonworks compiles and makes available for free its own distribution of open-source products and sells services and support, offering users the latest versions of Hadoop as well as patches for Hadoop bug fixes, Hortonworks' experience with the Hadoop platform, and a purely open-source Hadoop deployment. Customers pay Hortonworks fees for (1) subscription support, (2) professional and consulting services, and (3) training services. (See Exhibit 86.) Exhibit 86: Subscription/Services Overview Mission Critical Hadoop Support Subscriptions HDP Enterprise and HDP Enterprise Plus Consulting Architecture, implementation, migration, cluster tuning, best practices Training Public and on-site classes Architect and design Development Implementation Production Source: Company data, Credit Suisse. Support Subscriptions Annual or multi-year support subscriptions are the primary source of revenue for Hortonworks. A support subscription entitles a customer to direct support for their implementation of HDP as well as updates, bug fixes, and patches for Hadoop applications. Support services are intended to help enterprises at any stage of their Hadoop deployment, from proof-of-concept to post-launch new product development. Two editions of the support subscription are offered: Enterprise and Enterprise Plus. Enterprise Plus subscriptions offer support for more HDP components. Both editions provide for 24/7 support with one-hour response time for the highest-severity issues. Support subscriptions Hortonworks, Inc. (HDP) 76

77 are generally paid for in advance and are priced according to the number of Hadoop nodes the customer has deployed, the amount of data under management, or the extent of the services provided to the customer. Hortonworks also derives revenue from support subscription offerings that are resold by partners such as HP, Microsoft, and Teradata. 1,12 Professional Services Hortonworks also offers paid professional services, which consist of consulting and training. These offerings serve to aid adoption of HDP and thus drive further demand for Hortonworks' support subscription services. Services are sometimes bundled in with support subscription in multiple-element arrangements. While professional services are lower margin than support subscription, offering services is essential to develop the Hadoop ecosystem, given its relatively early stages of adoption by the enterprise. 1,12 Sales/Distribution Overview Hortonworks' sales organization consists of a direct sales team and reseller partners who work in collaboration with its direct sales team to identify new sales prospects, sell its subscriptions and professional services and provide post-sale support. The company's direct field sales organization specifically targets large enterprise and government customers across a broad range of geographies and industry verticals. Key verticals that the company targets include online services, retail/e-commerce, financial services, manufacturing, media/entertainment, and telecommunications. 1 Hortonworks is currently rapidly expanding its sales organization, with a substantial amount of hiring beginning in Q2 and Q3 of (See Exhibit 87.) Hortonworks expects to continue to grow its sales headcount across all of its markets, with an emphasis on broadening its geographical coverage beyond its existing primarily U.S.-centric customer base. It takes approximately nine months for a new sales representative to ramp up to full productivity, and as such, a large portion of the sales force will be relatively green for some time to come. Even currently ramped reps still have a short track record, as Hortonworks only launched HDP in mid Exhibit 87: Hortonworks Sales Representative Hiring vs. Subscription Customers Q1 Q2 Q3 Q4 Q1 Q2 Q3 Subscription Customers Total Sales Reps Ramped Sales Reps Source: Company data, Credit Suisse The direct inside sales organization focuses on medium-sized enterprises and smaller organizations, with inbound traffic aided by Hortonworks' HDP Sandbox, a distribution of HDP designed to be set up in minutes for tutorial purposes. Sales engineers supplement Hortonworks, Inc. (HDP) 77

78 sales representatives with deep technical expertise across pre-sales technical support, proof-of-concept work, and solutions engineering, as well as being information liaisons between customers and Hortonworks' product development groups. Hortonworks' business development team coordinates between direct field sales and strategic and reseller partners in order to maintain a closer relationship with key accounts. 1 YARN Ready Program To further expand the ecosystem and accelerate the adoption of Hadoop 2.0/YARN, Hortonworks launched the YARN Ready Program in June 2014 as part of the Hortonworks Partner Certification Program. The objective of the YARN Ready program is to provide partners with the assurance that their YARN Ready tools and applications are fully compatible with the Hortonworks Data Platform. The program includes tools, guides, sample code, access to the technical resources, and a simple mechanism for certification to build deep integration with YARN and Hortonworks Data Platform. Some of the YARN Ready partners include HP, Informatica, Microsoft, Oracle, SAP, Splunk, Tableau, and Teradata, etc. 124 (See Exhibit 88.) Exhibit 88: Selected YARN Ready Partners Source: Hortonworks. Today's enterprises are looking to go beyond batch processing and integrate existing applications with Hadoop to realize the benefits of real-time processing and interactive query capabilities. Specifically created to capitalize on the capabilities of Hadoop 2.0, the YARN Ready Program includes tools, guides, sample code, access to the technical resources, and a simple mechanism for certification, indicating that new and existing tools and applications have been deeply integrated with YARN. 12 As more organizations move from single-application Hadoop clusters to a versatile, integrated Hadoop 2.0 data platform hosting multiple applications, YARN is strategically positioned as the true integration point of today's enterprise data layer. At the architectural center of Hadoop, YARN provides access to the core elements of the platform. Tools and applications that are YARN Ready have been certified to deeply integrate with the Hortonworks Data Platform. 12 Accounting Overview Revenue Recognition Hortonworks generates revenue primarily under multiple-element arrangements that include support subscription offerings combined with consulting and/or training services. Hortonworks, Inc. (HDP) 78

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

E-guide Hadoop Big Data Platforms Buyer s Guide part 1 Hadoop Big Data Platforms Buyer s Guide part 1 Your expert guide to Hadoop big data platforms for managing big data David Loshin, Knowledge Integrity Inc. Companies of all sizes can use Hadoop, as vendors

More information

Bringing the Power of SAS to Hadoop Title

Bringing the Power of SAS to Hadoop Title WHITE PAPER Bringing the Power of SAS to Hadoop Title Combine SAS World-Class Analytics With Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities ii Contents Introduction... 1 What

More information

Hybrid Data Management

Hybrid Data Management Kelly Schlamb Executive IT Specialist, Worldwide Analytics Platform Enablement and Technical Sales (kschlamb@ca.ibm.com, @KSchlamb) Hybrid Data Management IBM Analytics Summit 2017 November 8, 2017 5 Essential

More information

5th Annual. Cloudera, Inc. All rights reserved.

5th Annual. Cloudera, Inc. All rights reserved. 5th Annual 1 The Essentials of Apache Hadoop The What, Why and How to Meet Agency Objectives Sarah Sproehnle, Vice President, Customer Success 2 Introduction 3 What is Apache Hadoop? Hadoop is a software

More information

Microsoft Big Data. Solution Brief

Microsoft Big Data. Solution Brief Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,

More information

Intro to Big Data and Hadoop

Intro to Big Data and Hadoop Intro to Big and Hadoop Portions copyright 2001 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Reproduced with permission of SAS Institute Inc., Cary, NC, USA. SAS Institute Inc. makes no warranties

More information

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT WHITEPAPER OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT A top-tier global bank s end-of-day risk analysis jobs didn t complete in time for the next start of trading day. To solve

More information

Hortonworks Connected Data Platforms

Hortonworks Connected Data Platforms Hortonworks Connected Data Platforms MASTER THE VALUE OF DATA EVERY BUSINESS IS A DATA BUSINESS EMBRACE AN OPEN APPROACH 2 Hortonworks Inc. 2011 2016. All Rights Reserved Data Drives the Connected Car

More information

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud

Datametica. The Modern Data Platform Enterprise Data Hub Implementations. Why is workload moving to Cloud Datametica The Modern Data Platform Enterprise Data Hub Implementations Why is workload moving to Cloud 1 What we used do Enterprise Data Hub & Analytics What is Changing Why it is Changing Enterprise

More information

Apache Hadoop in the Datacenter and Cloud

Apache Hadoop in the Datacenter and Cloud Apache Hadoop in the Datacenter and Cloud The Shift to the Connected Data Architecture Digital Transformation fueled by Big Data Analytics and IoT ACTIONABLE INTELLIGENCE Cloud and Data Center IDMS Relational

More information

Cognitive Data Warehouse and Analytics

Cognitive Data Warehouse and Analytics Cognitive Data Warehouse and Analytics Hemant R. Suri, Sr. Offering Manager, Hybrid Data Warehouses, IBM (twitter @hemantrsuri or feel free to reach out to me via LinkedIN!) Over 90% of the world s data

More information

Analytics Platform System

Analytics Platform System Analytics Platform System Big data. Small data. All data. Audie Wright, DW & Big Data Specialist Audie.Wright@Microsoft.com Ofc 425-538-0044, Cell 303-324-2860 Sean Mikha, DW & Big Data Architect semikha@microsoft.com

More information

Modernizing Your Data Warehouse with Azure

Modernizing Your Data Warehouse with Azure Modernizing Your Data Warehouse with Azure Big data. Small data. All data. Christian Coté S P O N S O R S The traditional BI Environment The traditional data warehouse data warehousing has reached the

More information

MapR: Solution for Customer Production Success

MapR: Solution for Customer Production Success 2015 MapR Technologies 2015 MapR Technologies 1 MapR: Solution for Customer Production Success Big Data High Growth 700+ Customers Cloud Leaders Riding the Wave with Hadoop The Big Data Platform of Choice

More information

Engaging in Big Data Transformation in the GCC

Engaging in Big Data Transformation in the GCC Sponsored by: IBM Author: Megha Kumar December 2015 Engaging in Big Data Transformation in the GCC IDC Opinion In a rapidly evolving IT ecosystem, "transformation" and in some cases "disruption" is changing

More information

USING BIG DATA AND ANALYTICS TO UNLOCK INSIGHTS

USING BIG DATA AND ANALYTICS TO UNLOCK INSIGHTS USING BIG DATA AND ANALYTICS TO UNLOCK INSIGHTS Robert Bradfield Director Dell EMC Enterprise Marketing ABSTRACT This white paper explains the different types of analytics and the different challenges

More information

Investor Presentation. Fourth Quarter 2015

Investor Presentation. Fourth Quarter 2015 Investor Presentation Fourth Quarter 2015 Note to Investors Certain non-gaap financial information regarding operating results may be discussed during this presentation. Reconciliations of the differences

More information

BIG DATA AND HADOOP DEVELOPER

BIG DATA AND HADOOP DEVELOPER BIG DATA AND HADOOP DEVELOPER Approximate Duration - 60 Hrs Classes + 30 hrs Lab work + 20 hrs Assessment = 110 Hrs + 50 hrs Project Total duration of course = 160 hrs Lesson 00 - Course Introduction 0.1

More information

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11 Top 5 Challenges for Hadoop MapReduce in the Enterprise Whitepaper - May 2011 http://platform.com/mapreduce 2 5/9/11 Table of Contents Introduction... 2 Current Market Conditions and Drivers. Customer

More information

IBM Business Perspective 2012

IBM Business Perspective 2012 IBM Business Perspective 2012 Patricia Murphy Vice President, Investor Relations 2009 IBM Corporation Certain comments made in this presentation may be characterized as forward looking under the Private

More information

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop Simplifying the Process of Uploading and Extracting Data from Apache Hadoop Rohit Bakhshi, Solution Architect, Hortonworks Jim Walker, Director Product Marketing, Talend Page 1 About Us Rohit Bakhshi Solution

More information

Investor Presentation. Second Quarter 2016

Investor Presentation. Second Quarter 2016 Investor Presentation Second Quarter 2016 Note to Investors Certain non-gaap financial information regarding operating results may be discussed during this presentation. Reconciliations of the differences

More information

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia MapR: Converged Data Pla3orm and Quick Start Solu;ons Robin Fong Regional Director South East Asia Who is MapR? MapR is the creator of the top ranked Hadoop NoSQL SQL-on-Hadoop Real Database time streaming

More information

GET MORE VALUE OUT OF BIG DATA

GET MORE VALUE OUT OF BIG DATA GET MORE VALUE OUT OF BIG DATA Enterprise data is increasing at an alarming rate. An International Data Corporation (IDC) study estimates that data is growing at 50 percent a year and will grow by 50 times

More information

Common Customer Use Cases in FSI

Common Customer Use Cases in FSI Common Customer Use Cases in FSI 1 Marketing Optimization 2014 2014 MapR MapR Technologies Technologies 2 Fortune 100 Financial Services Company 104M CARD MEMBERS 3 Financial Services: Recommendation Engine

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform An open-architecture platform to manage data in motion and at rest Highlights Addresses a range of data-at-rest use cases Powers real-time customer applications Delivers robust

More information

DataAdapt Active Insight

DataAdapt Active Insight Solution Highlights Accelerated time to value Enterprise-ready Apache Hadoop based platform for data processing, warehousing and analytics Advanced analytics for structured, semistructured and unstructured

More information

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

E-guide Hadoop Big Data Platforms Buyer s Guide part 3 Big Data Platforms Buyer s Guide part 3 Your expert guide to big platforms enterprise MapReduce cloud-based Abie Reifer, DecisionWorx The Amazon Elastic MapReduce Web service offers a managed framework

More information

SAS & HADOOP ANALYTICS ON BIG DATA

SAS & HADOOP ANALYTICS ON BIG DATA SAS & HADOOP ANALYTICS ON BIG DATA WHY HADOOP? OPEN SOURCE MASSIVE SCALE FAST PROCESSING COMMODITY COMPUTING DATA REDUNDANCY DISTRIBUTED WHY HADOOP? Hadoop will soon become a replacement complement to:

More information

IBM Business Perspective Patricia Murphy Vice President, Investor Relations

IBM Business Perspective Patricia Murphy Vice President, Investor Relations IBM Business Perspective 2011 Patricia Murphy Vice President, Investor Relations Certain comments made in this presentation may be characterized as forward looking under the Private Securities Litigation

More information

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica Accelerating Your Big Data Analytics Jeff Healey, Director Product Marketing, HPE Vertica Recent Waves of Disruption IT Infrastructu re for Analytics Data Warehouse Modernization Big Data/ Hadoop Cloud

More information

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Global Headquarters: 5 Speen Street Framingham, MA USA P F Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com WHITE PAPER Why Linux Is Good for ISVs Sponsored by: Red Hat and Intel Julie Tiley August 2005 IDC

More information

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake

Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake White Paper Guide to Modernize Your Enterprise Data Warehouse How to Migrate to a Hadoop-based Big Data Lake Motivation for Modernization It is now a well-documented realization among Fortune 500 companies

More information

Insights to HDInsight

Insights to HDInsight Insights to HDInsight Why Hadoop in the Cloud? No hardware costs Unlimited Scale Pay for What You Need Deployed in minutes Azure HDInsight Big Data made easy Enterprise Ready Easier and more productive

More information

EXAMPLE SOLUTIONS Hadoop in Azure HBase as a columnar NoSQL transactional database running on Azure Blobs Storm as a streaming service for near real time processing Hadoop 2.4 support for 100x query gains

More information

THE IMPACT OF OPEN SOURCE SOFTWARE ON DEVELOPING IoT SOLUTIONS

THE IMPACT OF OPEN SOURCE SOFTWARE ON DEVELOPING IoT SOLUTIONS THE IMPACT OF OPEN SOURCE SOFTWARE ON DEVELOPING IoT SOLUTIONS EXECUTIVE SUMMARY Worldwide IoT spending is projected to surpass $1 trillion in 2020, with annual growth of 15 percent over the next several

More information

Sr. Sergio Rodríguez de Guzmán CTO PUE

Sr. Sergio Rodríguez de Guzmán CTO PUE PRODUCT LATEST NEWS Sr. Sergio Rodríguez de Guzmán CTO PUE www.pue.es Hadoop & Why Cloudera Sergio Rodríguez Systems Engineer sergio@pue.es 3 Industry-Leading Consulting and Training PUE is the first Spanish

More information

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies DLT Stack Powering big data, analytics and data science strategies for government agencies Now, government agencies can have a scalable reference model for success with Big Data, Advanced and Data Science

More information

Preliminary Results for the year ended 31 December March 2014

Preliminary Results for the year ended 31 December March 2014 WANdisco plc Preliminary Results for the year ended 31 December 2013 20 March 2014 2013 Strategic Update David Richards CEO Powering Big Data Highlights Financial - Bookings increased 86% year-on-year

More information

Access, Transform, and Connect Data with SAP Data Services Software

Access, Transform, and Connect Data with SAP Data Services Software SAP Brief SAP s for Enterprise Information Management SAP Data Services Access, Transform, and Connect Data with SAP Data Services Software SAP Brief Establish an enterprise data integration and data quality

More information

Business is being transformed by three trends

Business is being transformed by three trends Business is being transformed by three trends Big Cloud Intelligence Stay ahead of the curve with Cortana Intelligence Suite Business apps People Custom apps Apps Sensors and devices Cortana Intelligence

More information

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud DAMA Datametica The Modern Data Platform Enterprise Data Hub Implementations What is happening with Hadoop Why is workload moving to Cloud 1 The Modern Data Platform The Enterprise Data Hub What do we

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper

EXECUTIVE BRIEF. Successful Data Warehouse Approaches to Meet Today s Analytics Demands. In this Paper Sponsored by Successful Data Warehouse Approaches to Meet Today s Analytics Demands EXECUTIVE BRIEF In this Paper Organizations are adopting increasingly sophisticated analytics methods Analytics usage

More information

Hortonworks Powering the Future of Data

Hortonworks Powering the Future of Data Hortonworks Powering the Future of Simon Gregory Vice President Eastern Europe, Middle East & Africa 1 Hortonworks Inc. 2011 2016. All Rights Reserved MASTER THE VALUE OF DATA EVERY BUSINESS IS A DATA

More information

Next Generation Services for Digital Transformation: An Enterprise Guide for Prioritization

Next Generation Services for Digital Transformation: An Enterprise Guide for Prioritization IDC Executive Brief Sponsored by: Computacenter Authors: Chris Barnard, Francesca Ciarletta, Leslie Rosenberg, Roz Parkinson March 2019 Next Generation Services for Digital Transformation: An Enterprise

More information

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration KnowledgeSTUDIO Advanced Modeling for Better Decisions Companies that compete with analytics are looking for advanced analytical technologies that accelerate decision making and identify opportunities

More information

Hadoop Stories. Tim Marston. Director, Regional Alliances Page 1. Hortonworks Inc All Rights Reserved

Hadoop Stories. Tim Marston. Director, Regional Alliances Page 1. Hortonworks Inc All Rights Reserved Hadoop Stories Tim Marston Director, Regional Alliances EMEA Page 1 @timmarston Page 2 Plans for Hadoop Adoption (Gartner, May 2015) Start within 1 year 11% Start within 2 years 7% Already doing 27% No

More information

E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA

E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA E-Guide THE EVOLUTION OF IOT ANALYTICS AND BIG DATA E nterprises are already recognizing the value that lies in IoT data, but IoT analytics is still evolving and businesses have yet to see the full potential

More information

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand Paper 2698-2018 Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand ABSTRACT Digital analytics is no longer just about tracking the number

More information

Big Data The Big Story

Big Data The Big Story Big Data The Big Story Jean-Pierre Dijcks Big Data Product Mangement 1 Agenda What is Big Data? Architecting Big Data Building Big Data Solutions Oracle Big Data Appliance and Big Data Connectors Customer

More information

CREATING A FOUNDATION FOR BUSINESS VALUE

CREATING A FOUNDATION FOR BUSINESS VALUE CREATING A FOUNDATION FOR BUSINESS VALUE Building initial use cases to drive predictive and prescriptive analytics ABSTRACT This white paper highlights three initial big data use cases that can help your

More information

Spark and Hadoop Perfect Together

Spark and Hadoop Perfect Together Spark and Hadoop Perfect Together Arun Murthy Hortonworks Co-Founder @acmurthy Data Operating System Enable all data and applications TO BE accessible and shared BY any end-users Data Operating System

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Who we are Experts At Your Service Over 50 specialists in IT infrastructure Certified, experienced, passionate Based In Switzerland 100% self-financed Swiss company Over CHF8 mio.

More information

Luxoft and the Internet of Things

Luxoft and the Internet of Things Luxoft and the Internet of Things Bridging the gap between Imagination and Technology www.luxoft.com/iot Luxoft and The Internet of Things Table of Contents Introduction... 3 Driving Business Value with

More information

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics Nov.

Outline of Hadoop. Background, Core Services, and Components. David Schwab Synchronic Analytics   Nov. Outline of Hadoop Background, Core Services, and Components David Schwab Synchronic Analytics https://synchronicanalytics.com Nov. 1, 2018 Hadoop s Purpose and Origin Hadoop s Architecture Minimum Configuration

More information

Analytics in Action transforming the way we use and consume information

Analytics in Action transforming the way we use and consume information Analytics in Action transforming the way we use and consume information Big Data Ecosystem The Data Traditional Data BIG DATA Repositories MPP Appliances Internet Hadoop Data Streaming Big Data Ecosystem

More information

WELCOME TO. Cloud Data Services: The Art of the Possible

WELCOME TO. Cloud Data Services: The Art of the Possible WELCOME TO Cloud Data Services: The Art of the Possible Goals for Today Share the cloud-based data management and analytics technologies that are enabling rapid development of new mobile applications Discuss

More information

Big Data: A BIG problem and a HUGE opportunity. Version MAY 2013 xcommedia

Big Data: A BIG problem and a HUGE opportunity. Version MAY 2013 xcommedia Big Data: A BIG problem and a HUGE opportunity. Version 1.0 22 MAY 2013 xcommedia 2013 www.xcommedia.com.au Page 1 Introduction The volume and amount of data in the world has been increasing exponentially

More information

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Roger Ding Cloudera February 3rd, 2018 1 Agenda Hadoop History Introduction to Apache Hadoop

More information

Realising Value from Data

Realising Value from Data Realising Value from Data Togetherwith Open Source Drives Innovation & Adoption in Big Data BCS Open Source SIG London 1 May 2013 Timings 6:00-6:30pm. Register / Refreshments 6:30-8:00pm, Presentation

More information

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW TOPICS COVERED 1 2 Fundamentals of Big Data Platforms Major Big Data Tools Scaling Up vs. Out SCALE UP (SMP) SCALE OUT (MPP) + (n) Upgrade

More information

Microsoft Azure Essentials

Microsoft Azure Essentials Microsoft Azure Essentials Azure Essentials Track Summary Data Analytics Explore the Data Analytics services in Azure to help you analyze both structured and unstructured data. Azure can help with large,

More information

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect 2005 Concert de Coldplay 2014 Concert de Coldplay 90% of the world s data has been created over the last two years alone 1 1. Source

More information

Angat Pinoy. Angat Negosyo. Angat Pilipinas.

Angat Pinoy. Angat Negosyo. Angat Pilipinas. Angat Pinoy. Angat Negosyo. Angat Pilipinas. Four megatrends will dominate the next decade Mobility Social Cloud Big data 91% of organizations expect to spend on mobile devices in 2012 In 2012, mobile

More information

DRIVE YOUR OWN DISRUPTION

DRIVE YOUR OWN DISRUPTION DRIVE YOUR OWN DISRUPTION Unleash new growth potential in Industrial Equipment with an intelligent supply chain GET 360 DEGREES ALL-AROUND SMART As your products get smart, your supply chain must get smarter

More information

By: Shrikant Gawande (Cloudera Certified )

By: Shrikant Gawande (Cloudera Certified ) By: Shrikant Gawande (Cloudera Certified ) What is Big Data? For every 30 mins, a airline jet collects 10 terabytes of sensor data (flying time) NYSE generates about one terabyte of new trade data per

More information

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS) ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS) Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how Dell EMC Elastic Cloud Storage (ECS ) can be used to streamline

More information

Modern Data Architecture with Apache Hadoop

Modern Data Architecture with Apache Hadoop Modern Data Architecture with Apache Hadoop Automating Data Transfer with Attunity Replicate Presented by Hortonworks and Attunity Executive Summary Apache Hadoop didn t disrupt the datacenter, the data

More information

Operational Hadoop and the Lambda Architecture for Streaming Data

Operational Hadoop and the Lambda Architecture for Streaming Data Operational Hadoop and the Lambda Architecture for Streaming Data 2015 MapR Technologies 2015 MapR Technologies 1 Topics From Batch to Operational Workloads on Hadoop Streaming Data Environments The Lambda

More information

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP Analyze Big Data Faster and Store it Cheaper Dominick Huang CenterPoint Energy Russell Hull - SAP ABOUT CENTERPOINT ENERGY, INC. Publicly traded on New York Stock Exchange Headquartered in Houston, Texas

More information

Big Data Analytics for Retail with Apache Hadoop. A Hortonworks and Microsoft White Paper

Big Data Analytics for Retail with Apache Hadoop. A Hortonworks and Microsoft White Paper Big Data Analytics for Retail with Apache Hadoop A Hortonworks and Microsoft White Paper 2 Contents The Big Data Opportunity for Retail 3 The Data Deluge, and Other Barriers 4 Hadoop in Retail 5 Omni-Channel

More information

StackIQ Enterprise Data Reference Architecture

StackIQ Enterprise Data Reference Architecture WHITE PAPER StackIQ Enterprise Data Reference Architecture StackIQ and Hortonworks worked together to Bring You World-class Reference Configurations for Apache Hadoop Clusters. Abstract Contents The Need

More information

Getting Big Value from Big Data

Getting Big Value from Big Data Getting Big Value from Big Data Expanding Information Architectures To Support Today s Data Research Perspective Sponsored by Aligning Business and IT To Improve Performance Ventana Research 2603 Camino

More information

Insights-Driven Operations with SAP HANA and Cloudera Enterprise

Insights-Driven Operations with SAP HANA and Cloudera Enterprise Insights-Driven Operations with SAP HANA and Cloudera Enterprise Unleash your business with pervasive Big Data Analytics with SAP HANA and Cloudera Enterprise The missing link to operations As big data

More information

The Value- Driven CFO. kpmg.com

The Value- Driven CFO. kpmg.com The Value- Driven CFO kpmg.com 2 Leading the Way in a Data-Driven Enterprise Several years of global uncertainty have made even the toughest executives flinch, and that s certainly true for chief financial

More information

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study

H2O Powers Intelligent Product Recommendation Engine at Transamerica. Case Study H2O Powers Intelligent Product Recommendation Engine at Transamerica Case Study Summary For a financial services firm like Transamerica, sales and marketing efforts can be complex and challenging, with

More information

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse

IBM Db2 Warehouse. Hybrid data warehousing using a software-defined environment in a private cloud. The evolution of the data warehouse IBM Db2 Warehouse Hybrid data warehousing using a software-defined environment in a private cloud The evolution of the data warehouse Managing a large-scale, on-premises data warehouse environments to

More information

Four IoT Platform Must-Haves That Can Accelerate Your IoT Deployment

Four IoT Platform Must-Haves That Can Accelerate Your IoT Deployment Four IoT Platform Must-Haves That Can Accelerate Your IoT Deployment INTRODUCTION Connect Things to Apps with Speed, Ease, and Scale At the center of the Internet of Things is the massive volume of data

More information

Big data is hard. Top 3 Challenges To Adopting Big Data

Big data is hard. Top 3 Challenges To Adopting Big Data Big data is hard Top 3 Challenges To Adopting Big Data Traditionally, analytics have been over pre-defined structures Data characteristics: Sales Questions answered with BI and visualizations: Customer

More information

Analytics for All Your Data: Cloud Essentials. Pervasive Insight in the World of Cloud

Analytics for All Your Data: Cloud Essentials. Pervasive Insight in the World of Cloud Analytics for All Your Data: Cloud Essentials Pervasive Insight in the World of Cloud The Opportunity We re living in a world where just about everything we see, do, hear, feel, and experience is captured

More information

Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop

Bringing Big Data to Life: Overcoming The Challenges of Legacy Data in Hadoop 0101 001001010110100 010101000101010110100 1000101010001000101011010 00101010001010110100100010101 0001001010010101001000101010001 010101101001000101010001001010010 010101101 000101010001010 1011010 0100010101000

More information

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN

From Information to Insight: The Big Value of Big Data. Faire Ann Co Marketing Manager, Information Management Software, ASEAN From Information to Insight: The Big Value of Big Data Faire Ann Co Marketing Manager, Information Management Software, ASEAN The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT

More information

On-Premise or Cloud? The Choice is Yours

On-Premise or Cloud? The Choice is Yours Empowering ERP Asset Management Solutions On-Premise or Cloud? The Choice is Yours Premier EAM add-on software providers offer both delivery methods By Chris McIntosh and Sean Licata, VIZIYA Corp. Enterprise

More information

IBM High Performance Services for Hadoop

IBM High Performance Services for Hadoop IBM Terms of Use SaaS Specific Offering Terms IBM High Performance Services for Hadoop The Terms of Use ( ToU ) is composed of this IBM Terms of Use - SaaS Specific Offering Terms ( SaaS Specific Offering

More information

In-Memory Analytics: Get Faster, Better Insights from Big Data

In-Memory Analytics: Get Faster, Better Insights from Big Data Discussion Summary In-Memory Analytics: Get Faster, Better Insights from Big Data January 2015 Interview Featuring: Tapan Patel, SAS Institute, Inc. Introduction A successful analytics program should translate

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

Financial Model. Mark Loughridge Senior Vice President and Chief Financial Officer, Finance and Enterprise Transformation

Financial Model. Mark Loughridge Senior Vice President and Chief Financial Officer, Finance and Enterprise Transformation Financial Model Mark Loughridge Senior Vice President and Chief Financial Officer, Finance and Enterprise Transformation 2015 Roadmap Base revenue growth ~2% excluding divestitures Shift to faster growing

More information

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved.

Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved. Welcome! 2013 SAP AG or an SAP affiliate company. All rights reserved. 1 SAP Big Data Webinar Series Big Data - Introduction to SAP Big Data Technologies Big Data - Streaming Analytics Big Data - Smarter

More information

with Dell EMC s On-Premises Solutions

with Dell EMC s On-Premises Solutions 902 Broadway, 7th Floor New York, NY 10010 www.theedison.com @EdisonGroupInc 212.367.7400 Lower the Cost of Analytics with Dell EMC s On-Premises Solutions Comparing Total Cost of Ownership of Dell EMC

More information

GPU ACCELERATED BIG DATA ARCHITECTURE

GPU ACCELERATED BIG DATA ARCHITECTURE INNOVATION PLATFORM WHITE PAPER 1 Today s enterprise is producing and consuming more data than ever before. Enterprise data storage and processing architectures have struggled to keep up with this exponentially

More information

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses

Nouvelle Génération de l infrastructure Data Warehouse et d Analyses Nouvelle Génération de l infrastructure Data Warehouse et d Analyses November 2011 André Münger andre.muenger@emc.com +41 79 708 85 99 1 Agenda BIG Data Challenges Greenplum Overview Use Cases Summary

More information

Predictive Analytics Reimagined for the Digital Enterprise

Predictive Analytics Reimagined for the Digital Enterprise SAP Brief SAP BusinessObjects Analytics SAP BusinessObjects Predictive Analytics Predictive Analytics Reimagined for the Digital Enterprise Predicting and acting in a business moment through automation

More information

Confidential

Confidential June 2017 1. Is your EDW becoming too expensive to maintain because of hardware upgrades and increasing data volumes? 2. Is your EDW becoming a monolith, which is too slow to adapt to business s analytical

More information

IBM Business Perspective 2013

IBM Business Perspective 2013 IBM Business Perspective 2013 Patricia Murphy Vice President, Investor Relations 2009 IBM Corporation Certain comments made in this presentation may be characterized as forward looking under the Private

More information

Preface About the Book

Preface About the Book Preface About the Book We are living in the dawn of what has been termed as the "Fourth Industrial Revolution" by the World Economic Forum (WEF) in 2016. The Fourth Industrial Revolution is marked through

More information

Embracing the Hybrid Cloud using Power BI in CSP. Name Role Group

Embracing the Hybrid Cloud using Power BI in CSP. Name Role Group Embracing the Hybrid Cloud using Power BI in CSP Name Role Group Agenda Cloud Vision & Opportunity What is Power BI Power BI in CSP Power BI in Action Summary Microsoft vision for new era Unified platform

More information

DIGITAL TRANSFORMATION SOLUTIONS

DIGITAL TRANSFORMATION SOLUTIONS DIGITAL TRANSFORMATION SOLUTIONS BUSINESS AND TECHNOLOGY ARE CHANGING We are in the initial stages of a new era and the next industrial revolution, popularly termed Industry 4.0. What does that mean for

More information

Table of Contents. Are You Ready for Digital Transformation? page 04. Take Advantage of This Big Data Opportunity with Cisco and Hortonworks page 06

Table of Contents. Are You Ready for Digital Transformation? page 04. Take Advantage of This Big Data Opportunity with Cisco and Hortonworks page 06 Table of Contents 01 02 Are You Ready for Digital Transformation? page 04 Take Advantage of This Big Data Opportunity with Cisco and Hortonworks page 06 03 Get Open Access to Your Data and Help Ensure

More information

Financial Discussion. James Kavanaugh Senior Vice President and Chief Financial Officer IBM

Financial Discussion. James Kavanaugh Senior Vice President and Chief Financial Officer IBM Financial Discussion James Kavanaugh Senior Vice President and Chief Financial Officer IBM 1 IBM 2018 Investor Briefing Our differentiated value proposition is driven by innovative technology, industry

More information