2012 SNIA Analytics and Big Data Summit. Insert Your Company Name. All Rights Reserved.

Size: px
Start display at page:

Download "2012 SNIA Analytics and Big Data Summit. Insert Your Company Name. All Rights Reserved."

Transcription

1

2 A Working Definition of Big Data Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Wikipedia, 4/26/2011

3 A Better Definition of Big Data The intersection of scale-out data analysis tools with scale-out data storage. Rob Peglar, May 2011

4 As Good as it Gets Definition of Big Data I don t want to have to run file system check on thing, ever. All the Storage Admins I Know, June 2011

5 What is Big Data in a Datacenter File-Based Data & NAS Access & >100TB File System File-Based Enterprise File-based Enterprise IT Applications Apps Home Directories Virtualization File Archiving Vertical Line-of-Business NetApp Markets M&E Life Sciences R&D Engineering Internet/ Web 2.0 Gov t Higher Ed Oil & Gas

6 By 2012, 80% of all storage capacity sold will be for file-based data Source:

7 Big Data Companies Coming out of the Woodwork Cloudera Mu Sigma Infochimps Riptano Pervasive IRI Jive (bought Proximal Labs) Karmasphere Infobright npario Qlik Datasift MetrixLab Alpine Data EMC (bought Greenplum) IBM (bought Netezza) HP (bought Vertica) Teradata (bought Aster)

8 What s the Big Deal about Big Data? McKinsey calls it the next frontier for innovation, competition and productivity McKinsey Global Institute, May 2011 Fueled by an explosion of smart devices handhelds, tablets, cameras Human-oriented devices Non-human-oriented devices sensors, embedded CPUs Social networking messages & data grow exponentially Twitter feeds, Facebook updates, LinkedIn messages Increasingly, business is conducted digitally or digitized Big Data is global any source to any target

9 Social Media Not as Easy as Some Think

10 What s the Big Deal about Big Data? Some research by McKinsey - McKinsey Global Institute, May 2011 $6000 worth of HDD can store all recorded original music But not all the copies of it! 5 billion mobile phones in use in 2010 and growing Moving to multiple devices per person; ~7 billion people now on earth 30 billion content pieces shared by Facebook users per month in 2011 Digital data is growing globally at 40% per annum Compare to IT budgets which are growing at 5% per annum Estimate is 2012 is 28 EB by enterprises and 36 EB by consumers Total data stored in 2011 is 295 exabytes (accumulated in history) 1,300 exabytes/yr (1.3 ZB) of data transferred on the Internet by 2016

11 What s the Big Deal about Big Data? More research by McKinsey - McKinsey Global Institute, May 2011 Estimated value of healthcare data is $300B just in US E.g. CDC public health warnings, cancer genomics, drug design Tapping into value could reduce US HC spend by 8% I.e. stay within normal inflation instead of hyper-inflation $600B est. commercial value of consumer location data E.g. from smartphones, tablets, GPS devices, etc. 140,000 new data analyst/data scientist positions and 1.5 million more data managers needed to tap into value Transactional data, positioning data, captured data Consumption meters, usage tracking, embedded devices creating

12 Big Data Applications and Management Big data is nearly all file-based, not block-based Hadoop is an application written to analyze big data open source, Java-based Big data can mean billions to trillions of files Each file can be gigabytes to terabytes in size Directed graph analysis, Collaborative Filtering, A/B testing, Associative Rule Learning, Classification, Natural Language processing, Data Mining, Pattern Matching, Sentiment Analysis, Comparative Effectiveness, Clinical Decision Support are examples of big data techniques This means petabytes to exabytes of data Enterprises ingesting > 1PB data per day within 5 yrs LCF to SLAC data transfer goal = 1 PB in eight hours over ESnet

13 Big Data Applications and Management Popular systems for Big Data and its analysis: BigTable (Google, built on GFS structured big data) Cassandra open-source DBMS for distributed data Dynamo (Amazon, distributed data system) Hadoop the Big Data system of choice for many Map/Reduce software framework for data reduction Pig software for analysis of very large datasets Stream processors for real-time event data (sensors) All these systems rely on massive collections of files, read/written sequentially into compute clusters

14 Social Networking Analysis Courtesy of NSF Workshop on Social Modeling

15 The Internet in 60 seconds from GoGlobe.com

16 Big Data s Impact on Business Big data allows companies to experiment digitally What if scenarios simulations extrapolations Big data can allow companies to segment populations Based on analysis of individual s contributed data Financial services & insurance have huge potential Each client s characteristics can be digitally analyzed Consumer products & retail have huge potential Loyalty program data growing exponentially Security and management are top challenges

17 How do you manage and design for Big Data? Big data necessitates a scale-out architecture Must grow with ingestion rates & provide archive space Big data must be protected on ingestion But not necessarily backed up Much big data is temporal ingest, crunch, archive Big data is optimally managed as a single filesystem No links, no stubs, no multiple mount points, no cataloging Typical/traditional RAID does not match big data Big data is typically write-once, processed sequentially GB/sec for data; IOPS for metadata; scale linearly

18 The Conundrum Petabytes are not the challenge Exabytes are the real challenge until around 2024 Zettabytes are the challenge 2024 and beyond 1 TB systems in 2000; 1 PB in 2008; 1 EB in 2016 The architecture of systems for big data is key Patterson, Gibson, Katz: RAID paper (1988) 4 TB drives coming early 13; 6 & 8 TB in PB HAMR promising; ~60 TB drives in 2016? RAID + unstructured a bad match drive BER To meet the challenge we must do file-level encoding

19 THANK YOU