Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback
|
|
- Jasper Payne
- 5 years ago
- Views:
Transcription
1 Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback Patrick Baier Lorand Dali Zalando Payments June 2018
2 OUTLINE Who we are and what we do Why we should monitor Prediction Monitoring Our implementation
3 WHO WE ARE AND WHAT WE DO
4 WHO WE ARE Patrick Baier Data Scientist at Zalando (~ 3.5 years) PhD in Computer Science from Uni Stuttgart Lorand Dali Data Scientist at Zalando (~ 1.5 years) Diploma in Computer Science from the Technical University of Cluj Napoca 4
5 WHAT WE DO Detect and prevent payment fraud Shopping 5 Shipping Payment
6 WHAT WE DO Detect and prevent payment fraud Shopping 6 Shipping Payment
7 MODEL TRAINING ML Model (LR, RF, GBT, ) 7
8 RUNTIME SYSTEM pfraud x ML Model - REST service - Scala Play service with Spark bindings - Response time: <1 second 8
9 OUR TECH STACK 9
10 WHY WE SHOULD MONITOR
11 SCENARIO Let s deploy a model for fraud detection in an online shop! Steps we take: 1. Collect training data. 2. Train a model. 3. Deploy it to production. 11
12 COLLECT DATA Log data Database Data Warehouse Go through the systems and collect data for training 12
13 TRAINING DATA 13 # Feature-1 Time-to-order [s] Feature-N Label not-fraud fraud not-fraud not-fraud fraud
14 FEATURE DISTRIBUTION Time-to-order [s] Distribution of feature in training data
15 GO LIVE Microservice pfraud x ML Model Once we are live, we get features x sent over by a different microservice in real-time. 15
16 MONITORING pfraud x ML Model CPU usage Memory usage Latency. 16
17 CHANGE OF MOODS Some weeks later, people are angry: We fail to detect fraud, our business is ruined! What happened? 17
18 INVESTIGATION pfraud x ML Model CPU usage Memory usage Latency. 18
19 INVESTIGATION pfraud x ML Model Time-to-order
20 INVESTIGATION pfraud x ML Model Time-to-order Mean shifted from 200 to !
21 INVESTIGATION pfraud x ML Model Time-to-order The feature is sent to us in the unit of milliseconds (not in seconds)! All our predictions are corrupt!
22 PROBLEMS 1. We lost a lot of money. 2. We did not detect it in time. 3. We could have detected it in time and provided a fix. 22
23 CONCLUSIONS We need to make sure that the distributions of input features are (always) the same as in training. 23
24 PREDICTION MONITORING
25 FAILING FEATURES Monitor failing input features: feature name feature one feature two feature three feature four fraction
26 LIVE MONITORING Compare feature distributions and output probability: Feature distribution on live data Feature distribution on test data Quality Monitor 26
27 LIVE MONITORING 27
28 LIVE MONITORING Compare distributions with KS-distance: this vs previous this vs test previous vs test feature one feature two feature three feature four e feature name 28 prediction
29 WINDOWS SIZE How big should the window size for data aggregation be? t = 1h + - fast detection of anomalies suffers from short term seasonalities + t = 12h go live 29 - deals with short term seasonalities slow detection of anomalies t
30 EXECUTION SCHEDULE How often should we analyze? every hour? t = 12h t = 12h t = 12h go live 30 t
31 EXECUTION SCHEDULE How often should we analyze? every hour? t = 12h More often: + Detect anomalies more quickly - High complexity, higher costs t = 12h Less often: + Less complex, lower costs t = 12h - Delay of anomaly detection go live 31 t
32 LIVE MONITORING possible discoveries technical problems, seasonalities, change of behaviour, fraud wave, fraud patterns, deviation from expectations. 32
33 IMPLEMENTATION
34 DISTANCE BETWEEN TWO DISTRIBUTIONS sum and normalize to [0, 1] 34
35 WE USE THE CDF 35 cdf pdf percentiles histogram
36 USING TDIGEST TO OBTAIN CDF from an in-memory collection from a distributed collection 36
37 PREDICTION SERVING 37
38 PREDICTION SERVING SQS 38 (Lumberjack) Process
39 PROCESSING THE SQS MESSAGES 39
40 PUTTING IT TOGETHER IN AWS DATA PIPELINE Reports logs - process logs group by models get failed features create tdigests create histograms feature distances Alerts 40
41 FINAL NOTES 41 if you have a ML system deployed in production, then you have to monitor it somehow monitoring is especially important if performance feedback is delayed start simple and non-intrusive, keep the reports flexible automate as much as possible to measure how far you are with monitoring, go through the questions in this paper from Google: "What s your ML Test Score? A rubric for ML production systems"
42 THANK YOU! Patrick Baier & Lorand Dali