The real life of Data Scientists. or why we created the Data Science Studio

Size: px
Start display at page:

Download "The real life of Data Scientists. or why we created the Data Science Studio"

Transcription

1 The real life of Data Scientists or why we created the Data Science Studio

2 About me Marc Batty Co-founded Dataiku in 2013, Chief Customer Officer Formerly Unilog American Express Logica Business Consulting Data (small) Data (big) Data (smart) Data (open) Follow and

3 What is all this fuss about data scientists?

4 No, they are not unicorns!

5 The A-team Tech Stats & Analytics Comm. & storytelling Business

6 So what s different now? Yeah, I already have a [ ] research analyst [ ] business analyst [ ] data miner [ ] data analyst [ ] statistician [ ] engineer [X] somebody-who-fills-my-slides-with-figures-andgraphs

7 Data as a factor of production

8 Big Data Smart cities Logistic Optimization Fraud Detection Optimized Targeting Churn Reduction Customer Segmentation Parking Prediction Predictive Maintenance

9 The Data Deluge is an opportunity More Data > Clever Math

10 Data Scientists, always on the hunt for new signals

11 Reproducibility Transparency They embrace best practices from research and software engineering Manifesto for Agile Software Development DRY KISS We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value: - Individuals and interactions over processes and tools - Working software over comprehensive documentation - Customer collaboration over contract negotiation - Responding to change over following a plan That is, while there is value in the items on the right, we value the items on the left more.

12 They build simple things first, add complexity when needed Be ready to throw some stuff away and iterate fast

13 Let the serendipity happen ü Bring strong sponsorship ü Break data silos ü Use a dedicated infrastructure (or go to the cloud!) ü Give data scientists a bit of time for exploring datasets * Borrowed from Kaggle s Just the Basics Beware of Big Data Barry on the way Listen, I ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy. Privacy is a thing about which I have no clue, but nonetheless I m compelled to steamroll even the most benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22 years of experience, we need stricter governance about the schemas and policies that inform the ways the data gets federated, so the model guys will stop trying to implement things that ll never work...

14 Data Scientists, be ready!

15 Work with a few different techs

16 Don t be fooled by (one) algorithm Exploratory Data Analysis Correlation Analysis CA Parametric and non parametric stat. tests No All my variables are numeric Supervised Learning* Generalized Linear Model Simple Decision Tree Yes GLM MDS... Yes Neural Networks Yes No Data Viz... Yes I have a distance matrix I value interpretability No Not Only I m looking variable by variable, or pairs PCA Yes I want to predict a variable Analytical Dataset I just want to explore No Generalized Additive Model Ensembles (Random Forest, Gradient Boosted Tree I m looking for clusters DP GMM Unsupervised Learning I know how many groups to look for Affinity Propagation, Mean Shift HCA Yes Yes Yes No Medium Dataset (<<100K) Yes No No Partitioning (Kmeans ) GMM Small Dataset (<<1K) No I can sample Support Vector Machines MARS K-Nearest Neighbors K-means + Gap Silhouette 2-steps clustering * Methods generally working for both classification & regression

17 And because all this can be daunting

18 What data scientists need is

19 data science studio For all profiles - A platform designed for the whole team - A smooth learning curve - Advanced parameters and integrated code editors

20 Create value with data-driven applications

21 Create value with data-driven applications Enrich / Combine / Compute

22 Create value with data-driven applications Enrich / Combine / Compute

23 Make your team the Data Science Super Star CODE VISUAL A web & collaborative platform for all the team members Visual interfaces + Development features in the same tool Easy to learn and progress

24 Ease and speed the whole workflow VISUAL DATA PREPARATION GUIDED MACHINE LEARNING The one-stop-shop tool for Data People Visual Data Preparation & Guided Machine Learning to boost your innovation cycle

25 Get ready for production Vue d ensemble Solution cible Données sources Build Run Projet Application API Acquire Prepare Explore Model Assess Deploy Advanced Scheduler Monitoring & Dashboarding Model Lifecycle Management API Builder REST API Self-contained environment Demande de PEC Score de fraude + décision Applications externes Logs API Analyse & modélisation de la fraude Application de production Moteur de scoring Data Science Studio Data Lab Data Science Server Production Data Science API Real Time Integration Applications internes Development turns directly into a reproductible workflow Ready to be scheduled and run everyday

26 Nos clients (40+) Web _ Analyse des parcours web et segmentation comportementale _ Anticipation du churn d abonnés _ Prévision des ventes Industrie & Infrastructure _ Maintenance préventive et diminution de l impact des pannes matérielles _ Optimisation logistique _ Smart Cities Banque & Assurance _ Détection de fraude _ Anticipation des risques (défaut de paiement ) _ Détection des moments de vie

27 Essayez DSS :