Getting Started: Modeling the Structure and Operations of Big Data

Size: px
Start display at page:

Download "Getting Started: Modeling the Structure and Operations of Big Data"

Transcription

1 Getting Started: Modeling the Structure and Operations of Big Data Session BG2, February 11, 2019 Deepesh Chandra, Associate Partner & Senior Expert Pierre-Arnaud Klaskala, Associate Partner, Director of Product & Technology 1

2 Conflict of Interest Deepesh Chandra, Associate Partner & Senior Expert Pierre-Arnaud Klaskala, Associate Partner, Director Of Product & Technology Have no real or apparent conflicts of interest to report. 2

3 Learning Objectives Provide a technical overview of big data analytics Describe big data storage, frameworks, and other critical aspects of usable healthcare data structures Explore uses of healthcare structured/unstructured data and metadata Discuss transforming legacy data into trusted and actionable data structures Assess data analytics, data visualization, and business intelligence and their roles in big data 3

4 Contents Introduction Big data components Building trusted and usable data structures Analytics and visualization in big data Key learnings 4

5 The opportunity represented by advanced analytics and digital in healthcare, and the urgency to act The Challenge 1 The Current State 2 The Opportunity 3 $3.0T 1.9% 0.5% Spent on healthcare in 2015 in US >18% of GDP Health care spending in US grows 1.9 basis points faster than GDP growth (OECD historical rate) Annual growth in healthcare labor productivity in US over this same period Despite massive investment in IT, the industry still lags in maturity of AA and digital capabilities 12 th Out of 13 industries in the McKinsey Advanced Analytics maturity index 8 th Out of 9 industries in the McKinsey Digitization maturity index 11 th Out of 13 industries in terms of readiness to adopt and employ AI In 2017, 20% of all local VC investment in SF went 20% into the AI, Big Data & Analytics sub-sector SOURCE: 1 OECD Policy Implications of the New Economy (2001); Global Insight WMM ;Espicom: World Pharmaceutical Fact Book 2008; International Monetary Fund. World Economic Outlook Database. October 2009; Espicom: World Pharmaceutical Fact Book 2008; McKinsey< 2 McKinsey Global Institute AI the Next digital frontier, The age of analytics: competing in a data-driven world 3 Fuel by McKinsey 5

6 Effective healthcare advanced analytics and digital transformations require work across the entire analytics workflow Analytics-to-insights Insights-to-impact Source of value Data ecosystem Modeling insights Workflow integration Adoption Technology and infrastructure Organization and governance 6 SOURCE: McKinsey analytics SOURCE: McKinsey Analytics; McKinsey Global Institute analysis

7 Big data, advanced analytics, and digital need to be combined to capture business opportunities ingest, manage, Integrate, and analyze large and complex data Big data enable more sophisticated predictive and prescriptive analytics, and work against large, incomplete, or unstructured data Advanced analytics Digital Application of modern (digital) technologies to core business processes, 7

8 Healthcare data spans the spectrum of data complexity 80% of all data is unstructured 1 AND it s growing at CAGR of 36% 2 Structured Semi-Structured Unstructured Sensors and fitness trackers Scheduling data Clinical notes , PDF, PPTX, DOCX EDI communications Social media Audio recording 1 - Source: International Data Corporation, EMC Corporation, Harmony Healthcare IT 2 - Source: International Data Corporation Medical images Healthcare claims 8

9 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies New opportunities create requirements that traditional data stacks cannot meet improving business transparency enabling new business insights Master blueprint for a data architecture transformation increasing business agility lowering cost of IT and operations 9

10 Contents Introduction Big data components Building trusted and usable data structures Analytics and visualization in big data Key learnings 10

11 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies What is a data lake? Data lake Persist all raw source data in a common place (including history) Stores relational data as well as media, s, PDFs and more (unstructured) Allows to search and integrate data without knowing exact schema of data Easily connects with data discovery tools to explore data A data lake is NOT a data warehouse No facility to generate reports No harmonization or integration of data Data may be wrong or inaccurate Provides data storage and processing at extremely low cost 11

12 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies The data lake is the first step of the analytics journey and the center of the big data stack Data Lake Analytics Garage Rapid Prototyping App Factory Collection of a comprehensive and valid data set Analytics Garage with a variety of tools for analyzing the data Fast development of a prototype based on convincing ideas Development of successful proto-types as solutions Transfer/ clean up/ expand Analyze/ test/ optimize Visualize/ test/ improve Develop/ automate/ operate Data transfer Workplace Backups External data sources Workflow automation 12

13 Landing zone, data lake and analytics environment constitute the central elements of the data lake architecture Architecture Data sources Landing zone Data flow Plain data without tagging A Landing zone Plain data with basic tagging B Data lake D Raw data fully tagged C Advanced Analytics Environment Prepared data Data for Analysis 13 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies

14 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies The data lake is structured into different zones that distinguish raw and production data Landing zone Data Lake Governance Data catalogue Taxonomy Lineage Access management III Raw zone: Tagging describes data Preparation Production zone Retention management API API File storage File storage 3 Relational DB Graph DB I II Advanced Analytics Environment 14

15 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies The production zone is comprised of further subzones for specialized production purposes Landing zone Data Lake Governance III Data catalogue Taxonomy Lineage Access management Retention management API Raw zone: Tagging describes data Preparation Production zone I Use case II Corporate III Satellite analytics zone production zones I II API API API API Advanced analytics env. Analytics workbench Analytical apps DWH 15

16 Big data reference architecture Data Sources Structured Data Electronic Medical Records Billing and Charge Data PO/Supply Chain HR and operational data Unstructured Data Medical images External Sources Web Logs Social Media Batch Processing Near Realtime/Realtime Processing Data Ingestion 1 Batch Ingestion Streaming Ingestion Extract & Load Extract & Load Data Lake Enterpris e Data Lake Big Data Preparation Tool Transient Landing Zone Stream processing layer Data preparation Layer Curated Zone ODS Layer (Warm Data) Cleansed, Validated Customer data Multi-Domain MDM Data Access Layer Real time Views Hot path to support streaming use cases Delivery Golden Hub to Records Source System Analytics Layer Collaborati ve Data science Platform Streaming Analytics Hosting, Security, Monitoring and Scheduling Serving Layer Data marts Real time analytical decisions Frontend Layer Customer 360 degree Platform Business Intelligence Dashboards Analytical Apps Meta data management, Data Governance, Data Lineage

17 The big data and analytics tool vendor landscape is immensely diverse and highly dynamic SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies Data Sources Data Ingestion Data Lake Data preparation Layer Analytics Layer Serving Layer Frontend Layer Big Data Preparation ODS Layer (Warm Data) Data marts Structured Data Electronic Medical Records Billing and Charge Data PO/Supply Chain HR and operational data Unstructured Data Medical images External Sources Web Logs Social Media Batch Processing Near Realtime/Realtime Processing Extract & Load Extract & Load Enterprise Data Lake Stream processing layer Cleansed, Validated Customer data Real time Views Golden Records Delivery Hub to Source System Streaming Analytics Real time analytical decisions Hot path to support streaming use cases Hosting, Security, Monitoring and Scheduling Meta data management, Data Governance, Data Lineage

18 Contents Introduction Big data components Building trusted and usable data structures Analytics and visualization in big data Key learnings 18

19 SOURCE: Digital McKinsey - Building best-in-class Data Management Architecture Key data governance processes and supporting tools Dimensions Key things to have Data governance Data owners defined Data governance body Define data governance process Data quality Data quality tool deployed, covering data profiling, matching, cleansing, monitoring Tools Metadata mgmt Master data mgmt Business glossary Metadata management software Data lineage ETL code generation automated MDM tool Integration with other systems and processes 19

20 SOURCE: Digital McKinsey - Building best-in-class Data Management Architecture Data quality diagnostic criteria Quality "Good" Values are presented fully and sufficiently (filled-in for 90% and above) "Satisfactory" Insignificant gaps (<30%) in at least one attribute Poor" >30% of gaps in at least one attribute Correctness No outliers (>500% of the median) 1 More than 1% of outliers with a delta of more than 500% of the median Time completeness Number of entries per month from the start of data acquisition deviates by less than 50% from median 1 Number of entries deviates from the mean by more than 50% in at least one of the periods Number of entries deviates from the mean more than 2 times in at least one of the periods Normalization 2 Table refers to clear directories There is a unique key Data are stored in a big table, no directories available Key is not available 1 except for pre-agreed cases 2 optional criterion for organizing data in Vertica or DB2 20

21 Example of end product data quality diagnostics 21

22 SOURCE: Digital McKinsey - Data catalogs as metadata management solution Data catalog tools usually come with 8 core functionalities Data catalog capabilities 1. Metadata repositories 2. Business glossary 3. Data lineage 4. Impact analysis 5. Rules management 6. Semantic frameworks 7. Metadata ingestion 8. Collaboration 22

23 Contents Introduction Big data components Building trusted and usable data structures Analytics and visualization in big data Key learnings 23

24 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies Analytics and visualization are fed from the data lake Data Lake Analytics Garage Rapid Prototyping App Factory Collection of a comprehensive and valid data set Analytics Garage with a variety of tools for analyzing the data Fast development of a prototype based on convincing ideas Development of successful proto-types as solutions Transfer/ clean up/ expand Analyze/ test/ optimize Visualize/ test/ improve Develop/ automate/ operate Data transfer Workplace Backups External data sources Workflow automation 24

25 SOURCE: Digital McKinsey Big Data and Advanced Analytics Compendium: "From Garage to Factory" Big Data architecture and technologies A typical big data stack has a range of coding and visualization tools Clients Plain coding Graphical coding Exploration and Visualization Supporting infrastructure services Options + Others Options for compute engines Application server compute (analyst workbench) Plain compute (analyst backend) Ext. APIs Database compute (data lake) Options Sparkling Water Specific Use Cases MapReduce Server IVc IVa IVb IVc 25

26 Contents Introduction Big data components Building trusted and usable data structures Analytics and visualization in big data Key learnings 26

27 SOURCE: McKinsey Analytics; McKinsey Global Institute analysis We believe that effective healthcare advanced analytics and digital transformations require work across the entire analytics workflow Analytics-to-insights Insights-to-impact Source of value Data ecosystem Modeling insights Workflow integration Adoption Technology and infrastructure Organization and governance 27

28 Five insights into building a great big data analytic platform #1 - Ensure everything you do starts delivering impact within six months #2 - Use existing data to build in bite-size chunks #3 - Deploy analytics only to solve tangible business problems #4 - Invest twice as much in your talent, culture, and processes as in tools #5 - Democratize data across your business to catalyze innovation from within 28

29 Questions Deepesh Chandra Associate Partner & Senior Expert Pierre-Arnaud Klaskala Associate Partner, Director Of Product & Technology Please complete the online session evaluation! 29