Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback

Size: px
Start display at page:

Download "Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback"

Transcription

1 Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback Patrick Baier Lorand Dali Zalando Payments June 2018

2 OUTLINE Who we are and what we do Why we should monitor Prediction Monitoring Our implementation

3 WHO WE ARE AND WHAT WE DO

4 WHO WE ARE Patrick Baier Data Scientist at Zalando (~ 3.5 years) PhD in Computer Science from Uni Stuttgart Lorand Dali Data Scientist at Zalando (~ 1.5 years) Diploma in Computer Science from the Technical University of Cluj Napoca 4

5 WHAT WE DO Detect and prevent payment fraud Shopping 5 Shipping Payment

6 WHAT WE DO Detect and prevent payment fraud Shopping 6 Shipping Payment

7 MODEL TRAINING ML Model (LR, RF, GBT, ) 7

8 RUNTIME SYSTEM pfraud x ML Model - REST service - Scala Play service with Spark bindings - Response time: <1 second 8

9 OUR TECH STACK 9

10 WHY WE SHOULD MONITOR

11 SCENARIO Let s deploy a model for fraud detection in an online shop! Steps we take: 1. Collect training data. 2. Train a model. 3. Deploy it to production. 11

12 COLLECT DATA Log data Database Data Warehouse Go through the systems and collect data for training 12

13 TRAINING DATA 13 # Feature-1 Time-to-order [s] Feature-N Label not-fraud fraud not-fraud not-fraud fraud

14 FEATURE DISTRIBUTION Time-to-order [s] Distribution of feature in training data

15 GO LIVE Microservice pfraud x ML Model Once we are live, we get features x sent over by a different microservice in real-time. 15

16 MONITORING pfraud x ML Model CPU usage Memory usage Latency. 16

17 CHANGE OF MOODS Some weeks later, people are angry: We fail to detect fraud, our business is ruined! What happened? 17

18 INVESTIGATION pfraud x ML Model CPU usage Memory usage Latency. 18

19 INVESTIGATION pfraud x ML Model Time-to-order

20 INVESTIGATION pfraud x ML Model Time-to-order Mean shifted from 200 to !

21 INVESTIGATION pfraud x ML Model Time-to-order The feature is sent to us in the unit of milliseconds (not in seconds)! All our predictions are corrupt!

22 PROBLEMS 1. We lost a lot of money. 2. We did not detect it in time. 3. We could have detected it in time and provided a fix. 22

23 CONCLUSIONS We need to make sure that the distributions of input features are (always) the same as in training. 23

24 PREDICTION MONITORING

25 FAILING FEATURES Monitor failing input features: feature name feature one feature two feature three feature four fraction

26 LIVE MONITORING Compare feature distributions and output probability: Feature distribution on live data Feature distribution on test data Quality Monitor 26

27 LIVE MONITORING 27

28 LIVE MONITORING Compare distributions with KS-distance: this vs previous this vs test previous vs test feature one feature two feature three feature four e feature name 28 prediction

29 WINDOWS SIZE How big should the window size for data aggregation be? t = 1h + - fast detection of anomalies suffers from short term seasonalities + t = 12h go live 29 - deals with short term seasonalities slow detection of anomalies t

30 EXECUTION SCHEDULE How often should we analyze? every hour? t = 12h t = 12h t = 12h go live 30 t

31 EXECUTION SCHEDULE How often should we analyze? every hour? t = 12h More often: + Detect anomalies more quickly - High complexity, higher costs t = 12h Less often: + Less complex, lower costs t = 12h - Delay of anomaly detection go live 31 t

32 LIVE MONITORING possible discoveries technical problems, seasonalities, change of behaviour, fraud wave, fraud patterns, deviation from expectations. 32

33 IMPLEMENTATION

34 DISTANCE BETWEEN TWO DISTRIBUTIONS sum and normalize to [0, 1] 34

35 WE USE THE CDF 35 cdf pdf percentiles histogram

36 USING TDIGEST TO OBTAIN CDF from an in-memory collection from a distributed collection 36

37 PREDICTION SERVING 37

38 PREDICTION SERVING SQS 38 (Lumberjack) Process

39 PROCESSING THE SQS MESSAGES 39

40 PUTTING IT TOGETHER IN AWS DATA PIPELINE Reports logs - process logs group by models get failed features create tdigests create histograms feature distances Alerts 40

41 FINAL NOTES 41 if you have a ML system deployed in production, then you have to monitor it somehow monitoring is especially important if performance feedback is delayed start simple and non-intrusive, keep the reports flexible automate as much as possible to measure how far you are with monitoring, go through the questions in this paper from Google: "What s your ML Test Score? A rubric for ML production systems"

42 THANK YOU! Patrick Baier & Lorand Dali