CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE) MAY 2016 MICHAEL.DOPIRA@ DATA-TRACER.COM
TABLE OF CONTENT SECTION 1 Research background Page 3 SECTION 2 Study design Page 8 SECTION 3 Results Page 11 APPENDIX 1 Benchmark models Page 18 APPENDIX 2 Random forest Page 20 2
SECTION 1 Research background 3
US DOMAIN INDUSTRY ALONE IS ESTIMATED AT $2B US Domain Industry, 2015 Annual Premium Web Domain Sales, USD M $2B revenue Market is dynamically growing in line with world growing e commerce industry +17% 8250 employees 4539 businesses 2005 2007 2009 2011 2013 2015 Domain industry is growing. In 2005-2015 CAGR constituted17%. Moreover, it is expected to grow even further due to overall growth of web-based businesses. Explosive development of Chinese e commerce is the latest trend fueling the growth of web-domain secondary market. It is already industry of considerable size since its market value (just in US) achieved 2 billion of US dollars. 4 Sources: Domain Name Prices (Dnpric.es), IBISWorld, Quartz.
AVERAGE DOMAIN PRICE ON THE SECONDARY MARKET IS STEADILY GROWING Average price of sold domain 2013 2015 (index, base year-2006) Top 5 most expensive deals, USD M (excluding web-sites for adults) 156 163 +8% 181 201 212 Fund.com We.com $8.0 $10.0 Diamond.com $7.5 Z.com $6.8 2013/01 2013/06 2014/01 2014/06 2015/01 Slots.com $5.5 Average price for domain on the secondary market has been growing steadily since 2013. Number of free domains (especially short and attractive ones) is constantly declining, causing growth of the secondary market. Most demanded domains achieved seven-digit price tags. 5 Sources: DNJournal, Sedo.com; 1- National Association of Securities Dealers Automated Quotations; 2- The Domain Name Price Index
MACHINE LEARNING IS NECESSARY TO PREDICT DOMAIN PRICES Share of domain sales quantities in different price segments 62% up to $100 26% 11% $100-$1000 $1000-$10000 Domain Price 1% $10000+ Examples of web-sites with prices of less than $100 Domain Price rzwv.com $1 ulpq.com $3 xcoi.com $5 kxoy.com $10 pjov.com $20 vugz.com $40 mosf.com $80 ogev.com $100 Majority of domains have price below $100. However, it is extremely difficult to guess the price without application of machine learning technics. The problem is that lions share of domains priced less than $100 do not contain real words. 6
PROJECT FEATURES: Project objective: predict price for an arbitrary 4-letter domain offered on the secondary market Data used: over 120,000 domain sales since 2000 Predictors: 200+ features reflecting linguistic, topic interest and market place information Methods employed: non-parametric regression (Random Forrest) Results: predictive accuracy on the test dataset is 82.9% (measured by goodness of fit R 2 ) Possible next steps: development of general predictive model (to all types of domains) Out-of-the-box-solutions: inclusion of Google search data as well as letter combination popularity of Peter Norvig 7
SECTION 2 Study design 8
THE GOAL OF THE STUDY IS CREATION OF WEB- DOMAIN PRICE PREDICTING MODEL Linguistic characteristics Market place info Topic interest Advanced data mining tool Random Forest $$$ Web-domain price prediction 9
THREE TYPES OF INPUT FEATURES ARE USED Linguistic Market place Topic interest Consonant-vowel pattern Letter repetition pattern Letter place pattern Frequency of letter combination usage Undesirable letter availability Whether real word is contained Seller Date of the deal Price of the previous deal of the same domain Number of Google Searches (bid & competition) of the word contained in the domain Domain extension (.com,.org,.tele, etc.) Total number of variables in the dataset - 238 10
SECTION 3 Results 11
THE MODEL OF RANDOM FOREST HAS SUBSTANTIAL PREDICTIVE POWER Price Predicted Price Random forest performed well in domain price forecasting. The goodness of fit is 82.9%, which means that model explains 82.9% of variation in domain prices. Random forest s results were compared to linear regression and decision tree models as benchmarks, and its predictions appeared statistically more powerful (details can be found in the appendix). 12 Note: Scatter plot reflects feet for randomly selected sample of 100 observations for logarithmic prices
VARIABLES WITH THE HIGHEST PREDICTIVE POWER Partner (Seller) indicator Previous price Date indicator Consonant-vowel pattern Frequency of 2-letter combinations Google Searches of containing word Domain extensions Indicator of company, which has sold the domain name Price of the domain at the moment of last sale Year and month indicator Pattern describing place of consonant and vowel letters in the word Number of times two-letter combination appeared in the set of texts analyzed by Peter Norvig In case domain contain real word, current indicator reflects number of Google Searches for this word Indicator of domain extension 13
EXAMPLE OF PRICE PREDICTION ALGORITHM Thai.co Mams.com Yftm.com Is real word contained? yes yes no What is year of deal? 2014 2016 2011 What is the seller? Afternic Sedo GoDaddy Predicted Price, USD True Price, USD? As the final output the client would be given model, which returns predicted prices for domain once its characteristics are entered 14
EXAMPLE OF PRICE PREDICTION ALGORITHM Thai.co Mams.com Yftm.com Is real word contained? yes yes no What is year of deal? 2014 2016 2011 What is the seller? Afternic Sedo GoDaddy Predicted Price, USD 2239 3935 30.14 True Price, USD? As the final output the client would be given model, which returns predicted prices for domain once its characteristics are entered 15
EXAMPLE OF PRICE PREDICTION ALGORITHM Thai.co Mams.com Yftm.com Is real word contained? yes yes no What is year of deal? 2014 2016 2011 What is the seller? Afternic Sedo GoDaddy Predicted Price, USD 2239 3935 30.14 True Price, USD 2200 3850 30? As the final output the client would be given model, which returns predicted prices for domain once its characteristics are entered 16
PROJECT SUMMARY Market set-up which explains domain prices is pretty complex and depends on many factors. These factors cannot be easily observed and their effects on prices are not obvious. Low and medium price deals constitute lion s share of the market. However, accurate prediction of the price in this segment is rather challenging but lucrative. In order to take in account numerous factors simultaneously we used advanced machine learning technique Random Forest, which is robust to overfitting. Developed statistical model is flexible and, therefore, can be applied to other similar problems (e.g. prediction of price for domains of any length). The research is based on open source data The introduced analytical model shows good forecasting power (R 2 is 82.9%). 17
APPENDIX 1 Benchmark models 18
RANDOM FOREST IS BETTER THAN BENCHMARKS Goodness of Fit Cross-Validation* 87.3% 87.0% 82.9% 80.6% 77.3% 74.5% We may underline that decision tree performs almost as well as Random Forest for total sample prediction; But due to higher resistance to overfitting Random Forest produces more accurate estimates on the test dataset. Random Forest Decision Tree Linear regression model 19 Note: Cross-Validation means that goodness of fit is measured on the bases of test dataset (which was not used for model fittin g).
APPENDIX 2 Random forest 20
DECISION TREE IS BASIC ELEMENT OF THE RANDOM FOREST Illustrative example of the Decision Tree segment built on the training data GENERAL IDEA: Decision tree classifies cases into groups or predicts values of a dependent (target) variable based on values of independent (predictor) variables. Independent variables are chosen in the way that groups are separated the best. EXAMPLE EXPLANATION The model determines how combination of various factors affects price of the domain. In the example only one branch of the tree is displayed fully, and it reflects how average price of domain sold on SEDO platform changes with domain extension, price of previous sale and consonant vowel pattern. Extension: com Partner Sedo (Yes/No) Previous Price <$80 Previous Price >=$80 Extension: org Order of variables and size of the tree is determined statistically Extension: net The tree grows from every node on every level (only some branches are displayed here) Extension: other Pattern: cvcv* Pattern: vcvc Pattern: vccv Pattern: cvvc Pattern: ccvv average price: $252 average price: $212 average price: $150 average price: $140 average price: $90 21 *Note: c stands for consonant, v stands for vowel
THE DECISION TREE CAN BECOME QUITE LARGE AND COMPLICATED Illustrative example of the section of full Decision Tree built on the training data set When all predictors are used in the analysis the tree becomes very large. However, the single tree is not sufficiently robust method and Random Forest is preferred. 22
RANDOM FOREST IS AGGREGATION OF DECISION TREES Random data subset Random variable subset Random data subset Random variable subset Random data subset Random variable subset Decision tree 1 Decision tree 2 Decision tree N* Results of the individual decision trees (typically 200-1000 trees) are aggregated and average prices are computed. Importance of each variable is calculated. 23 Note:(*) Optimal number of trees is determined during analysis - usually about 500 trees are built
RANDOM FOREST IS SUITABLE TOOL FOR DOMAIN PRICE PREDICTION The model does not require data to have specific distribution Both categorical and scale variables can be used Weak predictors are effectively incorporated in the model The model is not prone to overfitting, the model is robust Predictive power of the model does not deteriorate when large number of predictors is used. Final output of the model is price, which can be used as predictor of future sale of the domain 24
If you have any questions, please contact us: Skype: michael.dopira Email: michael.dopira@data-tracer.com 25