Bad Data Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows

Size: px
Start display at page:

Download "Bad Data Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows"

Transcription

1 Sponsored by: Bad Data Is Polluting Big Data Enterprises Struggle with Real-Time Control of Data Flows A Global Survey of Big Data Professionals June Sponsored by:

2 Executive Summary This report finds that companies who are leveraging big data are rarely capable of controlling their data flows. Almost 9 out of 10 companies report bad data polluting their data stores and shockingly nearly 3/4 indicate there is bad data in their stores currently. The findings also reveal a chasm between the problem detection capabilities data experts have today and what they desire. This translates into a lack of real-time visibility and control of data flows, operations, quality and security. The big data market is still maturing, especially as relates to data in motion and as evidenced by lack of best practices or consistent processes to clean and manage data quality. For companies who use big data to optimize current business operations or to make strategic decisions, it is critical that they ensure their big data teams have real-time visibility and control over the data at all times. 2

3 Key Findings 87% state bad data pollutes their data stores while 74% state bad data is currently in their data stores Ensuring data quality was the most common challenge cited, by 68% of respondents, and only 34% claimed to be good at detecting divergent data 72% responded that they hand code their data flows while 53% claimed they have to change each pipeline at least several times a month Tremendous gaps exist between today s big data flow management tools capabilities and what is needed Only 10% of respondents rated their performance as good or excellent across 5 key data flow operational performance areas 72% desire a single pane of glass solution to manage all data flows 81% state there is a significant operational impact when they upgrade big data components 3 Sponsored by:

4 METHODOLOGY AND PARTICIPANTS 4 Sponsored by:

5 Goals and Methodology Research Goal Methodology The primary research goal was to capture how companies manage the flow of big data. The research also investigated and documented current tools capabilities, data quality and efforts to maintain big data pipelines and infrastructure Big data professionals worldwide were invited to participate in a survey on the topic of big data and ensuring data flow operations and data quality. The survey was administered electronically and participants were offered a token compensation for their participation. Participants A total of 314 participants that manage big data operations completed the survey. 5 Sponsored by:

6 Companies Represented Industry Size Technology Financial Services Manufacturing Healthcare 10% 12% 18% 18% More than 10,000 30% 500-1,000 25% Education 6% Services 6% Government 6% Telecommunications 5% Energy and Utilities 5% Transportation 5% Retail 4% Non-Profit 1% Media and Advertising Hospitality and Entertainment Food and Beverage 1% 1% 1% 5,000-10,000 16% 1,000-5,000 29% Other 2% 0% 5% 10% 15% 20% 6 Sponsored by:

7 Participant Demographics Role Location IT staff responsible for implementing and operating data infrastructure (e.g. database IT manager responsible for delivering data initiatives 52% 56% Australia or New Zealand 3% Mexico, Central America, or South America 4% Middle East or Africa 2% Asia 2% IT executive with data initiatives in my portfolio BI or Analytics Technology Owner (e.g. data architect, head of data platform) 17% 34% Europe 14% Business stakeholder who uses data to make decisions 8% Business analyst 6% 0% 10% 20% 30% 40% 50% 60% United States or Canada 75% 7 Sponsored by:

8 DETAILED FINDINGS 8 Sponsored by:

9 Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation Ensuring the quality of the data (accuracy, completeness, consistency) 68% Complying with security and data privacy policies 60% What challenges does your company face when managing your big data flows? Keeping data flow pipelines operating effectively Building pipelines for getting data into the data store Upgrading big data infrastructure components (Kafka, Hadoop, etc.). Adapting pipelines to meet new requirements 32% 40% 47% 52% We have no challenges 1% 0% 10% 20% 30% 40% 50% 60% 70% 80% 9 Sponsored by:

10 87% State Bad Data Pollutes Their Data Stores No 13% Yes 87% Does bad data occasionally get into your data stores? 10 Sponsored by:

11 74% State Bad Data is Currently in Their Data Stores No 26% Yes 74% Do you believe there is any bad data in your data stores currently? 11 Sponsored by:

12 77% of Companies Still Use Hand Coding to Build Big Data Flows Coding with Python, Java, etc. or low-level frameworks such as Sqoop, Flume or Kafka 77% How does your company build big data flow pipelines today? Using ETL or data integration tools 63% Using big data ingestion tools such as StreamSets, NiFi, etc. 27% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 12 Sponsored by:

13 53% Change Data Flow Pipelines At Least Several Times a Month 35% 30% 25% 31% 26% On average, how often are changes or fixes made to typical data flow pipeline? 20% 15% 10% 5% 3% 19% 12% 8% 0% Several times a day Several times a week Several times a month Several times a quarter Several times a year Less often than several times a year 13 Sponsored by:

14 85% State Unexpected Structure and Semantic Changes Have Substantial Impact on Dataflow Operations When data structure or semantics unexpectedly change, how big is the impact on the operation of your big data flows (failures, slowdowns, data corruption, etc.)? 31% 54% 11% 2%2% 0% 20% 40% 60% 80% 100% Significant impact Moderate impact Minor impact Structure and semantic changes have no effect on our big data flows Data structure and semantic changes never occur 14 Sponsored by:

15 More Than Half of Companies Lack Real Time Information About Data Flow Quality A specific data flow pipeline has stopped operating 16% 46% 29% 9% 1% How would you assess your ability to detect each of the following issues in real-time? Data flow throughput is degrading or latency is growing Error rates are increasing The values of incoming data are diverging from historical norms 7% 7% 5% 37% 37% 29% 37% 38% 43% 17% 1% 16% 1% 20% 3% Excellent Good Average Poor None Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data store 18% 33% 30% 13% 6% 0% 10%20%30%40%50%60%70%80%90%100% 15 Sponsored by:

16 Only 12% Rated Their Performance as Good or Excellent Across All Five Key Data Flow Metrics Number of Key Data Flow Metrics Participants Represented as Good or Excellent 19% 17% 19% 20% 12% 12% Five Key Data Flow Metrics 1. A specific data flow pipeline has stopped operating 2. Data flow throughput is degrading or latency is growing 0 Metrics 1 Metrics 2 Metrics 3 Metrics 4 Metrics All 5 Metrics 3. Error rates are increasing 4. The values of incoming data are diverging from historical norms 5. Identify personally information within the data flows 16 Sponsored by:

17 Substantial Value In Real-Time Data Flow Detection Capabilities A specific data flow pipeline has stopped operating 42% 42% 14% 3% In your opinion, how valuable would it be to be able to detect each of these issues in real-time? Data flow throughput is degrading or latency is growing Error rates are increasing The values of incoming data are diverging from historical norms 28% 33% 23% 46% 49% 46% 20% 3% 17% 4% 26% 4% Very valuable Valuable Average value Limited value Not valuable Identify personally information within the data flows 40% 35% 18% 6% 0% 20% 40% 60% 80% 100% 17 Sponsored by:

18 Gap Between Current Pipeline Real-Time Visibility Capabilities and Stated Value A specific data flow pipeline has stopped operating 62% Real-time ability 16% 46% 29% 9% Excellent/ Very valuable Good/ Valuable 84% Average/ Average value Poor/ Limited value Assessed value 42% 42% 14% 3% None/ Not valuable 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 18 Sponsored by:

19 Chasm Between Today s Data Flow Throughput Metrics and What is Needed Data flow throughput is degrading or latency is growing 44% Real-time ability 7% 37% 37% 17% 1% B. Data flow throughput is degrading or latency is growing 77% Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value Assessed value 28% 49% 20% 3% 1% None/ Not valuable 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 19 Sponsored by:

20 Significant Gap Between Error Rate Visibility Value and Current Capabilities Error rates are increasing 44% Real-time ability 7% 37% 38% 16% Excellent/ Very valuable Good/ Valuable 79% Average/ Average value Poor/ Limited value Assessed value 33% 46% 17% 4% None/ Not valuable 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 20 Sponsored by:

21 Chasm Between Value of Detecting Divergent Data and Current Capabilities The values of incoming data are diverging from historical norms 34% Real-time ability 5% 29% 43% 20% 3% Excellent/ Very valuable Good/ Valuable 69% Average/ Average value Assessed value 23% 46% 26% 4% 1% Poor/ Limited value None/ Not valuable 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 21 Sponsored by:

22 Large Gap Between Data Privacy Value and Current Capabilities Identify personal information within the data flows 51% Real-time ability 18% 33% 30% 13% 6% Excellent/ Very valuable Good/ Valuable 75% Average/ Average value Poor/ Limited value Assessed value 40% 35% 18% 6% 2% None/ Not valuable 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 22 Sponsored by:

23 72% Desire A Single Pane of Glass Solution To Manage All Data Flows How valuable is it to have a single control panel for comprehensive visibility and management across all of your data flows? 24% 48% 24% 4% 0% 20% 40% 60% 80% 100% Very valuable Valuable Average value Limited value 23 Sponsored by:

24 50% State that Data Cleansing at the Source is the Most Effective Quality Practice Which of the following do you consider to be the most effective approach to ensuring data quality? Data scientists or business analysts cleanse data before using it 23% Cleanse and update data once it is in the store 27% Cleanse data as it flows in from the source 50% 24 Sponsored by:

25 81% State There is Significant Operational Impact to Upgrading Big Data Components What is the operational impact of upgrading big data components (ingest technologies, message queues, data stores, search stores, etc.)? 17% 64% 17% 2% Heavy impact Moderate impact Minor impact No impact 0% 20% 40% 60% 80% 100% 25 Sponsored by:

26 For more information About Dimensional Research Dimensional Research provides practical marketing research to help technology companies make smarter business decisions. Our researchers are experts in technology and understand how corporate IT organizations operate. Our qualitative research services deliver a clear understanding of customer and market dynamics. For more information, visit About StreamSets Place holder For more information, visit 26 Sponsored by:

27 APPENDIX 27 Sponsored by:

28 Tremendous Gaps Exist Between Currant Big Bata Flow Management Tool Capabilities and What is Needed Ability to Detect Area in Real-Time Compared Against Stated Value To Detect in Real-Time A specific data flow pipeline has stopped operating 42% 42% 14% 3% Stated Value 16% 46% 29% 9% 1% Current Ability Data flow throughput is degrading or latency is growing 28% 49% 20% 3% 1% Stated Value 7% 37% 37% 17% 1% Current Ability Error rates are increasing 33% 46% 17% 4% 0% Stated Value 7% 37% 38% 16% 1% Current Ability The values of incoming data are diverging from historical norms 23% 46% 26% 4% 1% Stated Value Personally identifiable information (credit card numbers, social security numbers) is being inappropriately placed in a data store 5% 18% 29% 40% 33% 43% 35% 30% 18% 20% 13% 3% 6% 2% 6% Current Ability Stated Value Current Ability 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Excellent/ Very valuable Good/ Valuable Average/ Average value Poor/ Limited value None/ Not valuable 28 Sponsored by:

29 Various Approaches To Managing Data Quality Indicates a Lack of Best Practice Cleanse and update data once it is in the store 55% Which of the following approaches for ensuring data quality does your company utilize? Cleanse data as it flows in from the source Data scientists or business analysts cleanse data before using it 43% 54% 0% 10% 20% 30% 40% 50% 60% 29 Sponsored by:

30 Many Must Perform Maintenance and Troubleshooting on Data Flows Routinely Approximately, what percentage of data flow changes and fixes are made for day-to-day maintenance and troubleshooting purposes? 40% 35% 30% 25% 20% 15% 10% 5% 0% 36% 27% 24% 10% 3% More than 80% 60% - 80% 40% - 60% 20% - 40% Less than 20% 30 Sponsored by: