Predicting Corporate Influence Cascades In Health Care Communities

Similar documents
MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Using Decision Tree to predict repeat customers

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Video Traffic Classification

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

Intro Logistic Regression Gradient Descent + SGD

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data

Data Mining Applications with R

Predictive analytics [Page 105]

YASHAJIT SAHA & ABHISHEK SHARMA, SUBJECT MATTER EXPERTS, RESEARCH & ANALYTICS ADVANCED ANALYTICS: A REMEDY FOR COMMERCIAL SUCCESS IN PHARMA.

New restaurants fail at a surprisingly

SPM 8.2. Salford Predictive Modeler

PAST research has shown that real-time Twitter data can

Feature Selection of Gene Expression Data for Cancer Classification: A Review

3 Ways to Improve Your Targeted Marketing with Analytics

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

DATA ANALYTICS WITH R, EXCEL & TABLEAU

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

WE consider the general ranking problem, where a computer

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

MDM offers healthcare organizations an agile, affordable solution To deliver high quality patient care and better outcomes

Classification Model for Intent Mining in Personal Website Based on Support Vector Machine

Chapter 8 Analytical Procedures

Application of Machine Learning to Financial Trading

Predicting Stock Prices through Textual Analysis of Web News

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

A logistic regression model for Semantic Web service matchmaking

Add Sophisticated Analytics to Your Repertoire with Data Mining, Advanced Analytics and R

Brian Macdonald Big Data & Analytics Specialist - Oracle

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

Leaf Disease Detection Using K-Means Clustering And Fuzzy Logic Classifier

Provider Network Analytics:

E-Commerce Sales Prediction Using Listing Keywords

MEANINGFUL USE CRITERIA PHYSICIANS

Runs of Homozygosity Analysis Tutorial

LexisNexis Health Care data delivered within PlayMaker CRM sales platform

ASSIGNMENT SUBMISSION FORM

3DCNN for False Positive Reduction in Lung Nodule Detection

Predicting Yelp Ratings From Business and User Characteristics

IBM SPSS & Apache Spark


Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

GLMs the Good, the Bad, and the Ugly Ratemaking and Product Management Seminar March Christopher Cooksey, FCAS, MAAA EagleEye Analytics

Lees J.A., Vehkala M. et al., 2016 In Review

Using Twitter to Predict Voting Behavior

SAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017

A Parametric Bootstrapping Approach to Forecast Intermittent Demand

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Metamodelling and optimization of copper flash smelting process

Churn Prevention in Telecom Services Industry- A systematic approach to prevent B2B churn using SAS

ECONOMIC MACHINE LEARNING FOR FRAUD DETECTION

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Evaluation of Logistic Regression Model with Feature Selection Methods on Medical Dataset

Advanced Issues in Commercial Compliance. Tenth Annual Pharmaceutical Regulatory and Compliance Congress and Best Practices Forum November 11, 2009

Incorporating Predictive Models for Operational Intelligence

BACKPROPAGATION NEURAL NETWORK AND CORRELATION-BASED FEATURE SELECTION FOR EARNING RESPONSE COEFFICIENT PREDICTION

Active Learning for Logistic Regression

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Insights from the Wikipedia Contest

ARCHITECTURES ADVANCED ANALYTICS & IOT. Presented by: Orion Gebremedhin. Marc Lobree. Director of Technology, Data & Analytics

Leveraging the Social Breadcrumbs

in Biomedicine A Gentle Introduction to Support Vector Machines Volume 1: Theory and Methods

Condition Assessment of Power Transformer Using Dissolved Gas Analysis

Predictive and Causal Modeling in the Health Sciences. Sisi Ma MS, MS, PhD. New York University, Center for Health Informatics and Bioinformatics

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

EVALUATION OF LOGISTIC REGRESSION MODEL WITH FEATURE SELECTION METHODS ON MEDICAL DATASET 1 Raghavendra B. K., 2 Dr. Jay B. Simha

Churn Prediction Model Using Linear Discriminant Analysis (LDA)

Payments Pulse Data Payments Trends Survey Results

Approaches for Identifying Consumer Preferences for the Design of Technology Products: A Case Study of Residential Solar Panels

Dependency Graph and Metrics for Defects Prediction

Very Short-Term Electricity Load Demand Forecasting Using Support Vector Regression

Recent HARVIST Results: Classifying Crops from Remote Sensing Data

PRODUCT DESCRIPTIONS AND METRICS

A Spreadsheet Approach to Teaching Shadow Price as Imputed Worth

10/10/2013. Hawaii Regional Annual Conference Conflicts of Interest Never End. Conflicts of Interest: Disclosure, Monitoring and Auditing

Analysis of Associative Classification for Prediction of HCV Response to Treatment

Logistic Regression for Early Warning of Economic Failure of Construction Equipment

Creating Pre-Patient Relationships through Employer Outreach

A Two-stage Architecture for Stock Price Forecasting by Integrating Self-Organizing Map and Support Vector Regression

Working with Health IT Systems is available under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported license.

Getting Started with HLM 5. For Windows

Solving Transportation Logistics Problems Using Advanced Evolutionary Optimization

Data Analytics with MATLAB Adam Filion Application Engineer MathWorks

BelVis PRO Enhancement Package (EHP)

Predicting user rating on Amazon Video Game Dataset

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Predictive Modelling for Customer Targeting A Banking Example

Optum Intelligent EDI. Achieve higher first-pass payment rates and help your organization get paid quickly and accurately.

Predicting gas usage as a function of driving behavior

Application of Association Rule Mining in Supplier Selection Criteria

Prediction of Personalized Rating by Combining Bandwagon Effect and Social Group Opinion: using Hadoop-Spark Framework

Deciding who should receive a mail-order catalog is among the most important decisions that mail-ordercatalog

A Novel Intelligence Recommendation Model for Insurance Products with Consumer Segmentation

Analysis of Microarray Data

An Application of Neural Networks in Market Segmentation

Bot Insight is here. Improve your company s top-and-bottom-line with powerful, real-time RPA Analytics Go be great.

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

Transcription:

Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice made by a physician is highly dependent on the physician s personal experience, for example speciality, professional ages and locations. At the same time, physicians also like to refer to other physician s decision and leverage their knowledge, like learn from more experienced researchers or physicians. Thus, they also rely on other information they have obtained via the community and social interaction. By applying machine learning methods, we attempted to explore how a physician s personal experience at the individual level will drive them to make a prescription decision and if there is any influence from the various corporations making non medical payments to the physicians. We have narrowed down some useful factors in the given dataset and proposed some future directions for a more concrete predictive model. Part B Materials and Methods 1. Data Sources General Payment Data Detailed Dataset 2013 Reporting Year : This data set contains Physician Financial Relationships. It has 685,296 healthcare providers, 1,554 healthcare companies, and 2,098 teaching hospitals, 14.8 million general payment records, 825,947 research related payments and 9,768 ownership related payment, aggregated 3.6 million payment records. Medicare Provider Utilization and Payment Data: 2013 Part D Prescriber: This data set contains line items of prescriptions doctors prescribed. It has 23,650,520 prescription types and 3,500 drug types. 2. Preprocessing and Feature Extraction The feature extraction process had to go through extensive mining of large dataset to compose vectors that can provide meaningful input to learning algorithms. We first filtered out information that would not be useful. We started with filtering out records of payments that were made to an institution and not a physician. We also filtered out records where the payment did not involve any activity related to a drug manufactured by the company. Next, we had to join the dataset between the payments and the prescription files. Initial attempts to join based on physician's profile did not work as the identifiers are different. After some review of the data, we joined the datasets using the drug name match. The difficulty is that there could be 0 to 5 drugs associated with a payment and this could match either the drug name or the generic name in the prescription record. Carefully matching these conditions, we ve established two cases as follows: 1. A physician was paid by the company and also prescribed the same drug as was involved in the payment. This is represented by label=1. 2. A physician was paid by the company and did not prescribe the drug (in entire prescription data) that was involved in the payment. This is represented by label=0. We initially started with Physician specialty vectors included in our feature set, about 540 of them. During the feature analysis stage, we discovered that these were having adverse effect in generalizing the problem, so we remove them. Essentially, it means that non medical payment influence is not significantly dependent on the specialty of the physician. The overall application flow can be depicted as follows:

The features we have extracted for the learning algorithm are as follows: Columns Data 1 Same "business" state as company If the company making the payment is in the same state as the physician is practicing in, its 1 else 0. 2 6 Same "license" state as company there are 5 license states for a physician in data and this represents if the physician has license in the state which is same as the company making the payment. 6 12 Paid amount we have categorized the payments in the range of >$100, >$500, >$1,000, >$10,000, >$25,000, >$50,000. If a physician received payment in excess of $50,000 all columns will be 1 for that physician. 13 16 Drug claim count (per prescription) the ranges captured here are >=100, >250, >500, >750 and the values are applied same as in the case of paid amount. 17 21 Drug day supply (per prescription) the range captured here are >50, >100, >500, >1000, >1500 and the values are applied same as in the case of paid amount. 22 26 Drug cost (per prescription) The range captured here are >$100, >$1000, >$10000, >$25000, >$50000 and the values are applied same as in the case of paid amount. 27 Physician ownership indicator This is an indicator denoting if physician has some interest in ownership rights with respect to the payment being made. 28 42 Nature of Payment There are 14 different types of payment categories (like food & beverages, travel, conference or consulting fee etc). For each payment transaction, we have captured its types and assign 0 or 1 value to it in the feature vector. 3. Related Work ProPublica [3] has attempted to investigate many aspects of publicly released data files by Medicare. However, the investigation has mostly been on the reporting perspective and not so predictive aspects. In [4] and related areas, work has focused on a particular drug prescription based on patient symptoms and other drug facts. 4. Technology We have utilized a 5 node Spark Cluster in AWS to perform feature extraction over the large dataset. We performed PCA using the dataset using Spark APIs. We have used Weka for classifier algorithms and used Matlab for classifier algorithms and to generate plots for bias and variance analysis. 5. Classification Methods For this project, we are using two machine learning algorithms for evaluation. The first is logistic regression and the second one is support vector machine. These two methods will briefly be discussed in this section. Logistic regression We chose to use this because in the data we saw category outcome variable which violates the assumption of linearity in normal regression. The dependent variable in our datasets is limited (0 not prescribe / 1 prescribe), and logistic regression is a type of analysis where dependent variable is dummy variable. The formula for binary logistic regression: P (Y=1 X= x) = 1 / (1 + e (w0 + w1*x1 + w2*x2 +...) ) = 1 / (1 + e z ) Log likelihood is: l( w ) = Σ (y i * log P (x i ; w ) + (1 y i ) * log(1 P (x i, w ))) Using gradient ascent method can achieve the Maximum Likelihood Estimation. Support Vection Machine SVM is a state of the art classification algorithm. It is explicitly based on theoretical model of learning and has a lot of advantages, like not affected by local minima, does not suffer from the curse of dimensionality. It maximize the margin around the separating hyperplane, and it only specified by a subset of training samples, the support vectors. By adding the regularization term, it is quite tolerant to errors in prediction.

Optimization for SVM is given as, min w 2 + C * Σ (&) subject to, y i * ( w t * X + b ) >= 1 & Part C Results and Analysis We are using Weka to train the dataset and generate the learning models. We have a total of 77,599 examples in the dataset and split it into a training set with 55699 examples (71.78%) and a test set with 21900 examples (28.22%). We also used Matlab to generate the bias/variance plots for data and algorithm analysis. Using our extracted features set and the two classifier algorithms, we trained the data to get the following hypothesis performance. Logistic Regression Analysis It took 172.99 seconds to training the Logistic Regression model on the full training set. The model correctly classified 18845 test examples, which is 86.0502%, and incorrectly classified 3055 examples, which is 13.9498%. 0 0.766 0.05 0.936 0.766 0.843 0.898 1 0.95 0.234 0.811 0.95 0.875 0.898 Weighted Avg. 0.861 0.144 0.872 0.861 0.859 0.898 Actual Negative 8176 2492 Actual Positive 563 10669 SVM Analysis It took 172.99 seconds to training the SVM classifier model on the full training set. The model correctly classified 18736 test examples, which is 85.5525%, and incorrectly classified 3164 examples, which is 14.4475%. 0 0.736 0.031 0.958 0.736 0.832 0.853 1 0.969 0.264 0.794 0.969 0.873 0.853 Weighted Avg. 0.856 0.15 0.874 0.856 0.853 0.853 Actual Negative 7852 2816 Actual Positive 348 10884 Bias Variance Analysis : Using SVM with a linear kernel achieved a 14% error rate. However, we wanted to see if we could make predictions with better accuracy. Based on our bias/variance analysis, we observed we had a bias problem as depicted in Fig 1, and we attempted a set of approaches to reduce the error rate.

To address the bias issue, we tried 1. Better modeling the dataset with a more complex model like a non linear kernel to encapsulate the interactions between the features. 2. Collecting more attributes for the dataset that would be more indicative of the likelihood of prescriptions. 3. Improving the features set by removing noise by using PCA. Fig 1 Bias Variance analysis for Linear Kernel Non linear Kernel Although, the train set was better modeled with convergence of 12% error, using a nonlinear kernel did not improve the generalization/test error, still at ~14%. On analysis, we found it having variance problem and the model is overfitting the training set. 0 0.745 0.033 0.956 0.745 0.837 0.856 1 0.967 0.255 0.8 0.967 0.876 0.856 Weighted Avg. 0.859 0.147 0.876 0.859 0.857 0.856 Actual Negative 7952 2716 Actual Positive 370 10862 Fig 2 Bias Variance analysis for Polynomial Kernel of Degree 2 Fig 3 Bias Variance analysis for Polynomial Kernel of Degree 3

PCA Analysis We used PCA and obtained the most meaningful 68 attributes from the feature set and removed the other noisy features. The 500+ column physician specialty features were found to be redundant and negatively impacting the hypothesis. After the PCA analysis, we ran SVM using a linear kernel. Removing these resulted in improvement in reducing the test error by ~1% (from 14.4475% to 13.4715 %). 0 0.761 0.035 0.955 0.761 0.847 0.863 1 0.965 0.239 0.808 0.965 0.88 0.863 Weighted Avg. 0.865 0.139 0.88 0.865 0.864 0.863 Actual Negative 8145 2562 Actual Positive 388 10803 Part D Conclusion and Future Work Dealing with two such large datasets, there was significant pre processing and feature extraction. We originally had many features, but not all the features were useful. Using PCA, we removed some noisy features. To test our hypothesis, we used two classification algorithms, logistic regression and SVM. These two algorithms achieved similar results in our experiments using a linear model, although SVM takes more computation time. For further analysis, we used the SVM model. The experimental results indicate the our original hypothesis holds. Non medical payments seemingly have an influence in the physician community, especially based on geo location information. However, with a 14% error rate, it is difficult to definitively say this. In order to gain better prediction quality and to mitigate our bias issue, we will need to collect more features possibly in the form of detailed drug facts and patient symptom records. We believe physicians tend to recommend well established and known drugs more often. Also, we think patient symptoms and preferences can play a significant role in prescription decision. These facts are currently not represented in our dataset. Finally, based on our bias/variance analysis, our models represents the current dataset well and using a more complex model will cause overfitting for training data. Bibliography 1. https://openpaymentsdata.cms.gov/ 2. https://www.cms.gov/research Statistics Data and Systems/Statistics Trends and Reports/Medicare Provider Charge Da ta/part D Prescriber.html 3. http://projects.propublica.org/checkup/ 4. DOI: 10.1007/978 3 642 15431 7_16 Conference: Artificial Intelligence: Methodology, Systems, and Applications, 14th International Conference, AIMSA 2010, Varna, Bulgaria, September 8 10. 2010. Proceedings