Unravelling Airbnb Predicting Price for New Listing

Similar documents
The Airbnb Community in the Czech Republic

Predicting user rating on Amazon Video Game Dataset

Rank hotels on Expedia.com to maximize purchases

Predicting Corporate Influence Cascades In Health Care Communities

Predicting Airbnb Bookings by Country

Prediction of Google Local Users Restaurant ratings

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Overview of the Airbnb Community in Barcelona and Catalonia in 2016

The Impact of an AirBnB Host s Listing Description Sentiment and Length On Occupancy Rates

New restaurants fail at a surprisingly

Using Decision Tree to predict repeat customers

3 Ways to Improve Your Targeted Marketing with Analytics

Predicting user rating for Yelp businesses leveraging user similarity

Predicting Yelp Restaurant Reviews

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Data Mining Applications with R

CREDIT RISK MODELLING Using SAS

Insights from the Wikipedia Contest

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

Reducing Inventory by Simplifying Forecasting and Using Point of Sale Data

CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE)

The Airbnb Community: Denmark

Jialu Yan, Tingting Gao, Yilin Wei Advised by Dr. German Creamer, PhD, CFA Dec. 11th, Forecasting Rossmann Store Sales Prediction

Building the In-Demand Skills for Analytics and Data Science Course Outline

Retail Sales Forecasting at Walmart. Brian Seaman WalmartLabs

Intro Logistic Regression Gradient Descent + SGD

Chapter 13 Knowledge Discovery Systems: Systems That Create Knowledge

CHAPTER 8 T Tests. A number of t tests are available, including: The One-Sample T Test The Paired-Samples Test The Independent-Samples T Test

Marketing Strategy: Based on First Principles and Data Analytics

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

Data mining and Renewable energy. Cindi Thompson

Treatment of Influential Values in the Annual Survey of Public Employment and Payroll

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

RC View DAT Dispatch: Quick Sheet

Online Algorithms and Competitive Analysis. Spring 2018

IBM SPSS Statistics Editions

SAP Predictive Analytics Suite

AutoClerk User Guide. Tape Chart, Marketing, Yield Management

Examination of Cross Validation techniques and the biases they reduce.

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

Predicting Yelp Ratings From Business and User Characteristics

AcaStat How To Guide. AcaStat. Software. Copyright 2016, AcaStat Software. All rights Reserved.

Gasoline Consumption Analysis

Predictive Modelling for Customer Targeting A Banking Example

The Impact of Agile. Quantified.

Netflix Optimization: A Confluence of Metrics, Algorithms, and Experimentation. CIKM 2013, UEO Workshop Caitlin Smallwood

Chapter 5 Evaluating Classification & Predictive Performance

Real Estate Appraisal

Evaluation of Machine Learning Algorithms for Satellite Operations Support

Weka Evaluation: Assessing the performance

The Dummy s Guide to Data Analysis Using SPSS

Predictive Analytics With Oracle Data Mining

Environmental correlates of nearshore habitat distribution by the critically endangered Māui dolphin

Getting Started with OptQuest

+? Mean +? No change -? Mean -? No Change. *? Mean *? Std *? Transformations & Data Cleaning. Transformations

System Secureholiday.net 5.0. On-line booking On-line Availabilities. User Guide

APPLICATION OF SEASONAL ADJUSTMENT FACTORS TO SUBSEQUENT YEAR DATA. Corresponding Author

CFSI Financial Health Score TM Toolkit: A Note on Methodology

Getting Started with HLM 5. For Windows

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Knowledge-Guided Analysis with KnowEnG Lab

Package goseq. R topics documented: December 23, 2017

Keyword Performance Prediction in Paid Search Advertising

7 Steps to Data Blending for Predictive Analytics

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Who Matters? Finding Network Ties Using Earthquakes

ASSIGNMENT SUBMISSION FORM

Credit Card Marketing Classification Trees

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

NHS Improvement An Overview of Statistical Process Control (SPC) October 2011

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database

ANALYSING QUANTITATIVE DATA

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Liftago On-Demand Transport Dataset and. Market Formation Algorithm Based on Machine Learning

ASSIGNMENT SUBMISSION FORM

Tutorial Segmentation and Classification

Predicting Startup Crowdfunding Success through Longitudinal Social Engagement Analysis

How much is my car worth? A methodology for predicting used cars prices using Random Forest

Ranking Potential Customers based on GroupEnsemble method

Analysis of Factors Affecting Resignations of University Employees

Foundational banking. Guidance for conversation on banking services. Trainers notes for very basic banking with clients

SPM 8.2. Salford Predictive Modeler

Amadeus Hotel Store. User guide 16 March Taking hotel consolidator content to a new level with Transhotel

Solving Transportation Logistics Problems Using Advanced Evolutionary Optimization

Market Sentiment Helps Explain the Price of Bitcoin

Bid Like a Pro. Optimizing Bids for Success with AdWords. AdWords Best Practices Series

TDWI strives to provide course books that are contentrich and that serve as useful reference documents after a class has ended.

Brian Macdonald Big Data & Analytics Specialist - Oracle

Hotel IT CHAM Manual. User Guide. Online Hotel channel management System. Hotel IT Solution 1

E-Commerce Sales Prediction Using Listing Keywords

Runs of Homozygosity Analysis Tutorial

Credit Risk Models Cross-Validation Is There Any Added Value?

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

ENGG1811: Data Analysis using Excel 1

Engagement Portal. Employee Engagement User Guide Press Ganey Associates, Inc.

Transcription:

Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 paridhic@andrew.cmu.edu Aniket Jain H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 avjain@andrew.cmu.edu Rahul Baijal H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 rbaijal@andrew.cmu.edu

TABLE OF CONTENTS Executive Summary... 3 Problem Statement... 4 Motivation... 4 Hypothesis... 4 Exploring the Data... 4 Data Source... 4 Exploratory Data Analysis... 4 Methodology... 6 Hypothesis 1: Predicting the optimum Price... 6 Hypothesis 2: Likelihood of availability... 9 Segregating the listings into high/low availability Clustering + Naïve Bayes... 9 Results and Conclusion... 9 Future Scope... 10 References... 10 2

EXECUTIVE SUMMARY This project analyzes Airbnb listings in the city of San Francisco to better understand how different attributes such as bedrooms, location, house type amongst others can be used to accurately predict the price of a new listing that is optimal in terms of the host s profitability yet affordable to their guests. This model is intended to be helpful to the internal pricing tools that Airbnb provides to its hosts. Furthermore, additional analysis is performed to ascertain the likelihood of a listing s availability for potential guests to consider while making a booking. The analysis begins with exploring and examining the data to make necessary transformations that can be conducive for a better understanding of the problem at large while helping us make hypotheses. Moving further, machine learning models are built that are intuitive to use to validate the hypotheses on pricing and availability and run experiments in that context to arrive at a viable solution. The project then concludes with a discussion on the business implications, associated risks and future scope. The presentation of the project can be found at: https://youtu.be/kt0y1ymtdxm 3

PROBLEM STATEMENT Motivation Airbnb is a privately-owned accommodation rental website which allows house owners to rent out their properties to guests looking for a place to stay. Serving as an aggregator for both the house owners and the guests, Airbnb s total valuation exceed 31 Billion dollars in May 2017, with 4.5 million properties listed in 191+ countries. Airbnb offers complete independence to its hosts to price their properties, with only minimal pointers that allow hosts to compare similar listings in their neighborhood in order to come up with a competitive price. Hosts may incorporate a premium price for any additional amenities they may find necessary. With the number of hosts using Airbnb increasing, coming up with the right price to remain competitive in a host s neighborhood is imperative. On the flip side, guests using Airbnb are often plagued with non-availability of accommodation due to a variety of reasons. This may include seasonal rush, abrupt booking cancellations, hosts preferences etc. We propose to analyze Airbnb s publicly available listing information for the past three years to try and validate our hypotheses to alleviate some of Airbnb s issues. Hypothesis 1. The hosts on Airbnb experiment and charge an optimal price. So, for a new listing can we analyze similar listings in the past to recommend an optimal the host should charge for the new listing? Goal: Hosts get a decent idea how much to charge for a new listing that has no reviews. 2. It is uncertain when a listing is hosted or when it becomes unavailable. Hosts/Guests change plans. So, having the information about availability in the past can we recommend guests with a likelihood of a listing being available? Goal: Guests can decide whether to move on or wait for a listing to become available depending on the likelihood. EXPLORING THE DATA Data Source The data is scraped from Airbnb s website [1], and hence, is in snapshot format for the years 2015, 2016 and 2017. Data is made available by insideairbnb.com. Data is available for multiple cities all over the world, although our focus for training and testing will be completely restricted to the cities of San Francisco and New York. The data has been analyzed, cleansed, and aggregated where appropriate to facilitate public discussion. Exploratory Data Analysis The distribution of listing s price is skewed towards the lower price of USD 100 (Figure 1). Although most of the listings fall in this ballpark, roughly 5% of the listings have prices as high as USD 30000. 4

These outliers may influence the regression lines if a linear regression is fit on this data. We would perform an exercise to remove these outliers. Another interesting attribute of the listing is the number of bedrooms which varies from 0 to 15 (Figure 2). Although more than 97% of the listings have bedrooms below 4 but these outlier listings can also affect the model trained using this attribute. Such listings are rare, and we do not have a large set of data points to be able to confirm a pattern for these listings. There can be a correlation Figure 1: Price Distribution for Listings between the number of bedrooms and high prices or they can be independent too. But having listings with very high price may skew the regression fits which is why we will try to remove these outliers. As people travel over the weekends more where they may want to book an Airbnb and also through personal experience, we had a hypothesis that hosts charge a higher price over weekends than on weekdays. On exploring the data to find if this is true, we found very minimal difference between them and in fact on the contrary, weekend prices were lower than weekday prices (Figure 3). However, surprised by this finding, we explored the counts of entries of weekend and weekday and found the dataset to be highly imbalanced towards weekday prices. This could be Figure 2: Distribution of # Bedrooms because Airbnb owners use their properties over the weekend themselves and hence less listings are available over the weekend. We balanced the dataset and found the weekend prices to be higher in this case but again the difference was insignificant. Weekday Median Price (USD) Weekend Median Price (USD) Imbalanced 186 180 Balanced 164 169 Figure 3: Weekday Weekend Price Difference For understanding if there are particular neighborhoods which have higher prices than others, we made a heatmap (Figure 4) of prices over latitude and longitude pairs. The heatmap can be seen which Figure 4: Price Heatmap of San Francisco indicates central San Francisco to have a higher price rather than suburban areas but again areas near seashores have higher prices. [2] Distribution of certain numerical variables like price, cleaning fee, security deposit and charges for extra people were not normal because of the outliers and hence, we log transformed them to be able to use algorithms which assume normal distribution of the predictors such as linear regression. This change in distribution is depicted for the price variable below: 5

There are no strong correlations (> 0.5 Pearson Correlation). However, Price is slightly positively correlated with the following features of a listing: People Accommodated, Bathrooms, Bedrooms, Cleaning Fee METHODOLOGY Figure 5.1: Price Distribution for Listings Figure 5.2: Log of Price Distribution for Listings HYPOTHESIS 1: PREDICTING THE OPTIMUM PRICE We started the prediction of price for a listing by fitting a linear regression and analyzing which predictors are given high importance and check the residuals of the plot. The RMSE of the prediction came out to be around USD 131 while the mean percentage error is around 28.02%. On a closer look at the distribution of the prediction error (Fig. 6), we can see heavy concentration within an error of USD 30. Checking the percentage of listings in different ranges of errors. Prediction Error (USD) Percentage <= 5 8% <=10 15.93% <=20 30.63% <=30 44.89% Figure 7: Distribution of Prediction Error Figure 6: Distribution of Prediction Error This indicates that almost 50% of the predictions had less than USD 30 error and therefore, some listings are easy to predict while others are little harder. (Fig. 7) Based on these results and suggestions from our professor we thought we should separate cases of low error and just focus on the low error cases or easy ones. We need to classify each listing into whether it will be easy to predict or hard to predict before deciding whether we will be predicting it. To achieve this, a random forest classifier was trained [3]. The features that were deemed important according to this classifier are given in the figure on the right. The order indicates decreasing importance of predictors (Figure 8). 6 Figure 8: Important features

On checking how the model differentiates between easy and hard to predict samples, we could see some predictors as good differentiators like below. Cleaning fee if less than 4 is less likely to be easy to predict (Fig 9.1). However, some features though deemed important did not provide an obvious differentiating line like below. Every month has similar frequency of easy and hard samples (Fig 9.2). Figure 9.1: Easy vs Hard Cleaning Fee distribution Figure 9.2: Easy vs Hard Month distribution The model that was trained for easy cases had encouragingly low training error but discouragingly high test error. This implied overfitting in the training which was then followed by two suggestions: 1. Apply Cross Validation: Train and test against different folds to identify more robust set of easy. Then train the classifier to test against the unseen cases. 2. Upsample and Downsample: Until this point we had some listings for which we have a lot of price instances (as high as 4000) and others which have only 1 instance. This could lead to bias the model towards the highly frequent listings thinking there is a pattern underlying. Thus, we were advised to balance the dataset by upsampling the listings which were rare and downsampling the ones that were frequent. Methodology to upsample and downsample The idea is to have equal number of instances for each listing so that they have equal weights in the model. We had 7000 unique listings in our dataset. We decide to use 100 as the optimal number of instances required for each listing as that we will give us a dataset of around 700,000 samples which is optimal to work with. We needed to perform both upsampling and downsampling. Upsampling: For upsampling, we took a listings unique entry in the dataset and repeated them to reach the length of 100. Downsampling: For downsampling, we found the median of the dataset which gives a row in the dataset which is the center of the dataset looking at all columns together. Then we find the Euclidean distance of each row from this median row [4]. This is followed by sorting the rows according to their distance from the median and take the top 100 values. We used this approach to find most diverse combinations of features so that we cover a larger space and ignore samples which are very similar to each other in feature values and prices. After balancing the dataset, we wanted to tune the hyper parameters for the random forest regressor. We used sklearn s RandomizedSearchCV [5] and ran 100 iterations for a set of parameters. The grid of parameters used was: N_estimators - [100-200] Max_features - ['auto', 'sqrt'] 7

max_depth - [10-20, None] min_samples_split - [2, 5, 10] min_samples_leaf - [1, 2, 4] bootstrap - [True, False] The 100 iterations were run using 10-fold cross validation to find best parameters which can be used to train the dataset. Top 3 ranked parameters were: 1. Model with rank: 1 Mean validation score: 0.980 (std: 0.001) Parameters: {'n_estimators': 131, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': None, 'bootstrap': True} 2. Model with rank: 2 Mean validation score: 0.979 (std: 0.001) Parameters: {'n_estimators': 142, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': True} 3. Model with rank: 3 Mean validation score: 0.978 (std: 0.001) Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False} The RMSE of the prediction was USD 28 with a mean percentage error of 0.04%. The figure below shows the distribution of the prediction error of samples which had lower than USD 5. 99% listings had prediction error less than USD 5 (Fig 10). We also tried to check the performance of the model on increasing the number of trees used to train the model. The training and test error are averaged over 10 folds of 10-Fold CV (Fig 11). Figure 10: Distribution of Prediction Error Figure 11: Random Forest Regressor performance with change in number of trees 8

HYPOTHESIS 2: LIKELIHOOD OF AVAILABILITY Segregating the listings into high/low availability Clustering + Naïve Bayes Our motivation for applying clustering to our data was to ascertain the structure of given availabilities (#days/year that the property is available to be rented out) of property listings. The data presents 4 types of availabilities namely 30 days/year, 60 days/year, 90 days/year and 365 days/year. Our aim is to group each type into clusters of low and high availability (Fig 12). (Note: X-Axis is normalized over #days) For 30 days/year, clusters were made of equal size possibly because of the small range of days. For 30 days/year and 90 days/year, clusters were better defined. Further, we wished to classify each listing into high and low availability. Taking zip code and room type as features, we trained a multinomial Naïve Bayes model for classification. This model was 64.1% accurate, whereas the default model s accuracy was 62.8%. Future Recommendations: 1. Include more features to get sharper classification of high or low availability. 2. Obtain an optimum threshold over the Figure 12: Resulting Distribution of Clustering number of days to classify new listings as high or low available. This gives way to answer the question: If I host a new listing, how busy will it s booking will be in the next 30/60/90 days. RESULTS AND CONCLUSION The difference in the performance of same RandomForestRegressor on the balanced and imbalanced dataset confirms the hypothesis that high frequency of some samples was skewing the predictions and overfitting for them. Therefore, we decided to go ahead with the balanced dataset and RandomForestRegressor. Considering the initial hypothesis of whether we will be able to predict price of a new listing based upon its features, we would say that we will be able to predict the prices within USD 29. This is aligned with the project goals because the user can use our initial price to start with and then make necessary changes based on interaction with the customers. BUSINESS IMPLICATION AND RISK MITIGATION Using just publicly available data about the Airbnb listings we have looked at approaches building upon predicting an optimal price for a new listing. Hosts can make use of historical data to get an idea about how much other hosts are charging for similar listings. We believe this adds practical value to the client as they can not only get an idea of the prices they should charge but also gives a chance to mark up/down the prices. Airbnb might have price tips based on multiple sources of information. Although, we or the hosts do not know if they bias these prices based on personal interests. An example can be that Airbnb wants their listing prices varied so more customers are converted. So, the additional value this tool adds is an unbiased estimator. 9

Considering just for San Francisco, the neighborhoods play a less significant role in setting prices. The hosts should make the best of the proposed features such as time seasonality, listing characteristics, analysis of current prices for similar listings on Airbnb and finally make use of this tool to get the optimum price they should charge for their new listing. The assumption is that the publicly available prices are successful prices for hosts as they experiment over time and better their prices against customer affordability and maximize income. The risk is that this assumption can be wrong and the booking history for guests could provide more useful information as successful prices. The way to mitigate this risk is to do validate the similarity between the prices guests are charged and the prices the hosts are listing. FUTURE SCOPE We have features that were not used in this exercise because of time limitations. These are majorly text based features and would require natural language processing to convert into features. In continuation of the work, these features can be used: 1. Amenities (Text) 2. Street 3. Description (Text) 4. Reviews (Text) 5. Interactions between owner and guest (Text) 6. Ratings 7. Reviews_per_month Also, we identified that for each neighborhood, there are centres where the price is higher but as you move away from the centre, the price reduces. It is visible in the heatmap shown previously. REFERENCES 1. http://insideairbnb.com/get-the-data.html 2. GMAPS tutorial: https://github.com/pbugnion/gmaps 3. Random Forest Classifier: http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.randomforestclassifier.html 4. Dataframe Median: https://pandas.pydata.org/pandasdocs/stable/generated/pandas.dataframe.median.html 5. RandomizedSearchCV http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.randomizedsearchcv.html 10