Audit and Validation Testing For Big Data Applications

Size: px
Start display at page:

Download "Audit and Validation Testing For Big Data Applications"

Transcription

1 Audit and Validation Testing For Big Data Applications Ravi Shukla, Specialist Senior Deloitte Consulting Pvt. Ltd.

2 Abstract In today s world, we are awash in a flood of data. Across a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Big Data analytics now drives a whole range of applications that impact us on a daily basis, such as retail, manufacturing, healthcare, mobile services and financial services. Organizations are seeing big data analytics as means to reduce cost and improve co-ordination, quality and outcomes. For them to better manage their businesses, organizations are ensuring that their data, present in different systems are migrated to a Distributed File System. Good data is helpful in providing insights. Businesses, when armed with this, can improve the day-today decisions they make. If the accuracy of data is low at the beginning of the process, it leads to lack of insight, and hence, the decisions it influences are also likely to be poor. Therefore, organizations must realize the criticality of data and understand that quality is more important that quantity. Most people prioritize only on gathering information without giving importance to the accuracy of information and if/how it could be used for further processing. In Big data testing, QA engineers verify the successful processing of petabytes of data using commodity cluster and other supportive components. It demands a high level of testing skills as the processing is very fast. The congruence of all these results in immense focus on testing data migration activities which is the buzz word in many industries these days. This paper attempts to highlight the significance of Audit & Validation testing approach in the big data application landscape.

3 What is Big Data? Big data is a term that describes the large volume of data, both structured and unstructured, that inundates a business on a day-to-day basis. This data is so large that it is difficult to process using traditional database and software techniques. Big data can help organizations improve operations by helping in making better decisions and providing accurate insights leading to strategic business moves. While the term Big data is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analysts articulated the now-mainstream definition of big data as the three Vs: Fig. Three V s of Big Data 1. Volume: Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would ve been a problem but new technologies (such as Hadoop) have eased the burden. 2. Velocity: Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. 3. Variety: Data comes in all types of formats from structured, numeric data in traditional databases to unstructured text documents, , video, audio, stock ticker data and financial transactions.

4 Testing challenges in Big Data Analytics Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway. Geoffrey Moore, Big Data Author and Consultant Data is the lifeline of an organization and is getting bigger with each day. In 2011, experts predicted that Big Data will become the next frontier of competition, innovation and productivity. Today, businesses face data challenges in terms of volume, variety and sources. Structured business data is supplemented with unstructured data, and semi-structured data from social media and other third parties. Finding essential data from such a large volume of data is becoming a real challenge for businesses, and quality analysis is the only option. Fig. Audit & Validation Key Challenges QA teams face multiple challenges in testing Big Data. These are detailed below: Large Volumes of diversified data Testing any large volume of data is the biggest challenge in itself. A decade ago, a data pool of 10 million records was considered gigantic. Today, businesses have to store Petabyte or Exabyte data,

5 extracted from various online and offline sources, to conduct their daily business. Testers are required to audit such voluminous data to ensure that they are a fit for business purposes and consumption. Data Analysis For the Big Data testing strategy to be effective, testers need to continuously monitor and validate the 3Vs of Data Volume, Variety and Velocity. Understanding the data and its impact on the business is the real challenge faced by any Big Data tester. It is not easy to measure the testing efforts and strategy without proper knowledge of the nature of available data. Testers need to understand business rules and the relationship between different subsets of data. Inefficient data The data that is available for big data applications is extracted from a wide variety of sources. This data is generally complex and potentially inaccurate. The data needs to be tested in order to ensure its efficiency and accuracy before being loaded onto the target big data systems. Need of Technical Expertise Technology is growing, and everyone is struggling to understand the algorithm of processing Big Data. Big Data testers need to understand the components of the Big Data ecosystem thoroughly. Today, testers understand that they have to think beyond the regular parameters of automated testing and manual testing. Big Data, with its unexpected format, can cause problems that automated test cases fail to understand. Creating automated test cases for such a Big Data pool requires expertise and coordination between team members. The testing team should coordinate with the development team and marketing team to understand data extraction from different resources, data filtering and pre and post processing algorithms. Additional costs and resources The big data testing process involves spending time and money on having additional set of resources working on performing and testing data validation and verification activities. If the testing process is not standardized and strengthened for re-utilization and optimization of test case sets, the test cycle / test suite would go beyond the intended and in turn causes increased costs, maintenance issues and delivery slippages. Test cycles might stretch into weeks or even longer in manual testing.

6 Audit & Validation process How it can be a solution? To help resolve above challenges, having an Audit and Validation process allows the verification and validation of data flowing into the big data systems. Under the A&V process, the approach is to have processes setup that perform similar transformation logic as that of development team responsible for extracting data from different sources and loading large datasets into target systems. The extracts generated are compared with the one generated by the development team using the proposed Audit and Validation process which categorize the testing on the basis of: Auditing test results Validates the extract criteria and tests if all the data has been extracted from source systems and loaded into the big data system (target). Validating test results- Verifies the transformation logic of data during conversion. This is to ensure that the transformation rules have been applied correctly over the source data to be extracted for load to target. Objective of A&V: The objective of the Audit and Validation testing process is to ensure that the data that is being migrated is both Validated and verified. Business needs Check Points Solution Validation Data Verification Fig. Audit & Validation checkpoints Once the implementation is done, we need to verify the data against the implemented solution. In Parallel to this, we also need to ensure that the data that is migrated satisfies the business need. We can do this by validating the data against the business case. These critical steps of verification and validation is done through the Audit and Validation Process.

7 The Audit & Validation involves following: QA team setting up processes that implement similar transformation logic as the development team. Audit Process: This involves comparing the number of records which are extracted by the A&V and development teams. This process involves detecting 2 kinds of errors: o More records are pulled than the expected number of records o Lesser records are pulled than the expected number of records This process is performed after the data is extracted and is before the data is loaded. Validation Process: The validation process involves comparing the data (field by field) which is loaded by the development team against the data extracted by the Audit and Validation team. This testing of data results in validation of the quality of data as well as the fallouts that occur while loading the data. The A&V process flow can be classified into 2 steps: twitter s videos Data Sources Semi-structured data Data Ingestion Audit & Validation Process AV Health Report Main Node Data Node 1 Data Storage Layer Real Time Processing Statistical Analytics Predictive Modeling Text Analytics Data Visualization Layer fitbit Data Node 2 Semantic Analytics Fig. Audit & Validation process flow in a Big Data application # 1: The first level of testing is performed as soon as the data is extracted # 2: The second level of testing is performed after the data is loaded into the big data system

8 Relevance of Audit & Validation process You can t manage what you don t measure. Peter Drucker For organizations handling large amounts of big data, managers can now measure, and know, radically more about their businesses, and directly translate that knowledge into improved decision making and performance. When there is a huge amount of Organizational data movement, a quality check to ensure if the huge data has moved as expected from source systems to target big data application becomes imperative. Amongst the different testing techniques for big data testing that includes performance and functional testing, there is a growing importance on the data quality testing which can be achieved with greatest results through building and having a well-constructed Audit and Validation process. The Audit and validation approach allows a check point to ensure data getting loaded into the big data systems such as Hadoop is accurate and consistent with data sent across from different source systems. Heterogenous Data Audit & Validation Refined & Validated Data Fig. Relevance of Audit & Validation process

9 Case Study Netflix, the world s leading internet television network, uses big data analytics (Amazon Kinesis System) to analyze billions of bytes of data across more than 150,000 application instances daily in real time, enabling it to optimize user experience, reduce costs, and improve application resilience. Netflix is said to account for one third of peak-time internet traffic in the USA. Data from users is collected and monitored in an attempt to understand viewing habits. But its data isn t just big in the literal sense. It is the combination of this data with cutting edge analytical techniques that makes Netflix a true Big Data company. The key to Netflix success has always been to predict what its customers will enjoy watching. Big Data analytics is the fuel that fires the recommendation engines designed to serve this purpose. Netflix uses big data analytics for: Predicting viewing habits Improving Search Quality Finding next smash hit series Ensuring end users high quality experience Recommendation Engines Improved ratings Netflix has made use of Audit and validation techniques that ensure that all data from various sources such as devices, program searches are collected and loaded into the target systems. The availability of such data ensures correct data analytics for predicting user viewing habits and ensuring recommendations based on users search criteria.

10 Benefits of Audit & Validation Audit and validation implementation provides the following capabilities to the testing team: Fig. Benefits of Audit & Validation Independent validation ensuring accurate quantity of records are available in target big data systems for analysis Ensures data quality Results in accurate data analysis. In doing so, it results in more data-driven customer-centric marketing, which provides the opportunity to deliver more targeted messages and develop a oneto-one relationship with customers. Allows organizations to perform risk analysis Reduces maintenance costs as the massive amount of data available for analysis allows organizations to spot issues and predict when they might occur. The results in a much more costeffective replacement strategy for the utility and less downtime

11 Conclusion All in all, Big Data testing has much prominence for today s businesses. If right test strategies are embraced and best practices are followed, defects can be identified in early stages and overall testing costs can be reduced while achieving high Big Data quality. The Audit and validation process empowers testing teams to accurately determine if there are any inconsistencies regarding data flowing into system, and help organization take corrective measures in case discrepancies are identified. A&V provides a capability to analyze astonishing data sets quickly and cost-effectively. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise.

12 References & Index Hype Cycle around Big Data Analytics. Published by Forbes 4.

13 Author Biography Ravi Shukla Software professional with 12 years experience in Industry having worked primarily on Data warehousing projects and healthcare domain. Ravi is currently working as a Test Program Manager for a leading Healthcare provider based in California, USA. He is an avid traveler, loves hiking and playing volleyball.

14 THANK YOU!