Exploration of Google Trends for Data Analysis

Size: px

Start display at page:

Download "Exploration of Google Trends for Data Analysis"

Brittney Daniel
5 years ago
Views:

Yi Xu Registration number 4851404 Exploration of Google Trends for Data Analysis Supervised by

1 Yi Xu Registration number Exploration of Google Trends for Data Analysis Supervised by Dr Beatriz De la Iglesia University of East Anglia Faculty of Science School of Computing Sciences

2 Abstract Google Trends contains information on search term popularity over the years, at different locations, at different times, etc. Some of that information is beginning to be used to understand phenomena from flue outbreaks to stock market fluctuations. This study is about determining if there is an association in the UEA Application and Google Trend data and to determine a significant model that can predict the UEA Application data given the different predictors Google Trend data, Home, EU, and Overseas data. The UEA Application data was normalised using a formula so that it can be compared to Google Trend data and the other predictors. SPSS and MATLAB were used to analyse the data. Three methods were used in determining the best model. First is the Linear Regression with Google Trend data as predictor, then Multiple Regression with additional predictors such as Home, EU, and Overseas data, and using Time Series Analysis with the four predictors involved. The MAE and R Squared were used to determine the best model generated by the three methods. The results show that there is an association with Google Trend data and UEA Application data. The results further revealed that ARIMA (1,0,0) was the best model that can predict the UEA Application data using previous values of the UEA data and the Google Trend data. Acknowledgements This paper could not be written to its fullest without Dr. Beatriz De La Iglesia, who served as my supervisor, as well as being the person who challenged and encouraged me throughout my time spent studying under her. I express my gratitude to her again as she was abundantly helpful and offered invaluable assistance, support and guidance. She would have never accepted anything less than my best efforts, and for that, I thank her.

3 Contents 1 Introduction Background and Motivation Aim and objective Risk and Knowledge required Literature Review Introducing the uses of Google Trends in different areas Selection of Search Terms Using shorter terms Language independent Category of section terms Explore Google Trends Functionality Worldwide Set time Category setting Regional interest Data Collection UEA Application rate data Google Trends data Normalising data Comparison with competitor Universities Selecting a good model Analysis Distribution of the UEA Application, Google Trend, and Home, EU, and Overseas data Linear Regression Multiple Regression Analysis Reg: iii

4 5.4 Time Series Analysis Selecting the best model Discussion and Conclusions 37 References 40 Reg: iv

5 List of Figures 2.1 Harvard compare with Harvard University in Google Trends UEA compare with University of east anglia in Google Trends University of East Anglia compare with category set Job&Education University of East Anglia in Google Trends Worldwide setting to United Kingdom in Google Trends Set time function from October 2011 to October 2014 in Google Trends Regional interest The results of Regional interest after searched in Google Trends Search of the five university in Google Trends Regional interest of search term University of East Anglia Regional interest of search term University of Kent Regional interest of search term University of Essex Regional interest of search term University of Warwick Regional interest of search term University of Sussex Sequence Chart for UEA Application data Sequence Chart for Google Trend data UEA Application Rate for Home Student UEA Application Rate for EU Student UEA Application rate for Overseas Student of Google Trends Data Linear Regression Table Scatter Plot with Linear Trend for UEA Application and Google Trend data Multiple Regression Coefficients Sequence Chart for the Normalized UEA Application data and the predicted values from the Model Autocorrelation Function for UEA Application data Partial Autocorrelation Function for UEA Application data Model Statistics ARIMA Model Parameters Reg: v

6 5.15 Model fit sequence chart Comparison of MAE by Model Reg: vi

7 1 Introduction 1.1 Background and Motivation Google Trends is one of the best and most versatile tools available. It analyses a percentage of Google web searches to determine how many searches have been done for the terms you ve entered compared to the total number of Google searches done during that time. With a number of different features, it allows you to gain an understanding of the hottest search trends of the moment along with those developing in popularity over time. Search engines have taken on a bigger role in society and have given people easier access to valuable information. They have also dramatically reduced the time it takes to find relevant content and data. Forecasters use Google Trends to predict interest rates, election outcomes, stock market behaviour and even the weather. These predictions are then used to shape expectations and make decisions. Humphrey (2010) Intelligent algorithms are used by search engines to identify the most relevant material based on the search terms input by the user. This user input is potentially incredibly valuable. It can tell us about what people find important; what questions are people asking; what are they searching for etc. whilst giving us access to wealth of other useful information. Search engines allows businesses to capitalise on such user queries. UEA is rated as one of the top universities in UK according to Times Higher Education Student Experience Survey 2015, said by Grove (2015). It boasts a list of notable alumni ranging from international politics and government, Literary, Scientific, Arts, Media, and Business and Economics. The institution looks very hard at all aspects of the student learning experience to social programs by the students union and the environment in which students find themselves living. The objective of this project is to use Google Trends to test data on UEA applications to look at whether the service can have an upper hand on predicting the future by comparing the two sets of data and analysing them. Understanding the Google Trends data can help in predicting UEA Applications so that the institution can provide better services and learning experience in anticipating how many students would apply. We will be experimenting with specific data sets such as Google Trend Data, EU, Home, and Overseas data in order to find out the outcome Reg:

8 of such predictability. 1.2 Aim and objective he aim was to establish and utilise the relevant useful material in order for university institutions, such as UEA, to better analyse their markets. Predicting/analysing how the demand for courses change, taking into account various variables, such as different countries and time scales. The objectives of this paper are as follows: To determine if there is a trend in the UEA Application and Google Trend data. To determine if there is an association with the UEA Application data and the Google Trend data. To determine a significant model that can predict the UEA Application data given the different predictors, which is namely Google Trend data, Home, EU, and Overseas data. 1.3 Risk and Knowledge required We need to carefully consider various factors that can cause the problems within this research. 1. First we need to know how Google Trends work and what it can offer to us because this is the main object of the project, to gain a better understanding of the subject will help considerable to following how the data has been approach how to analysed the data it is important that to be success for this project. 2. There is a risk that the data may not necessarily be that accurate as Google includes a disclaimer that the data trends may contain inaccuracies for a number of reasons, including data sampling issues and a variety of approximations that are used to compute results. This means that you might not get an accurate idea of how much traffic you are really receiving. Reg:

9 3. Google Trends data is normalised in order to make it easier to compare search data over regions. UEA Application data is not normalised so we have to find a way on how to normalise this data so that these two sets can be comparable. 2 Literature Review The biggest search engine, Google, provides the Trends software. Google Trends is a public web application that enables the user to find popular key words or to compare search terms. It is a valued platform for comparisons of data from a national to a global scale Castro (2014). Google is one of the biggest searching engines in the world and has become incredibly powerful. The company aims to help people solve search problems and constantly strives to make its services even more efficient and intelligent. Through continued iteration on difficult and complex problems, they have been able to solve challenging issues and provide continuous improvements to a service that already makes finding information a fast and seamless experience for millions of people around the world. Their aim is to make information a little more accessible throughout their range of products, which include Gmail and Google Maps and Drive. Trends Help (2015). Within their company philosophy they aim to expand the power of search to others and to help people access relevant information in this age of information and big data. We can use Google to provide reliable information and find answers that are relevant to popular search terms from the analysis of such data. Using Google Trends we want to see whether or not it can be used to predict future trends. The project really aims at finding out the reliability of specific methodologies that rely on Google Trends data in order to predict future results. We will be experimenting with specific data sets in order to discover the outcome of such predictability. Google Trends is an online search tool that allows the user to see how often specific keywords, subjects and phrases have been queried over a specific period of time. It analyses a portion of Google searches to find out how many searches for specific terms have taken place, relative to the total number of searches performed via Google over the Reg:

10 same time. The data provided by Google Trends is updated daily, but a disclaimed from Google does say that it "may contain inaccuracies for a number of reasons, including data sampling issues and a variety of approximations that are used to compute results." We can query up to five different search terms or topics simultaneously with Google Trends. The results appear within a graph which Google call the Search Volume Index. The data can be exported into a.csv file. This file can be opened with Excel and other spreadsheet apps and programmes. A feature known as Hot Searches shows us a list of the day s top 40 search queries in the US. The Google Trends for Websites features looks at website traffic rather than results for specific terms. The data includes information about unique visitors as well as a regions column. The regions column shows us the percentage of visitors from a specific geographical area. Other useful columns include also visited and also searched for. These provide us with valuable information about the behaviour of site visitors and tell us where else they are likely to go to whilst browsing. Google launched Insights for Search in This service provides us with advanced features for showing search trends data. It is used by many search engine marketing teams, allowing them to fine-tune their keyword strategies and explore certain patterns and search cycles, whilst telling us about keyword popularity by location and time range. Mellon (2014). The Google data is not given in absolute volumes but is indexed to the highest observed search volume, which is set to 100. Consequently, it is not possible to ascertain the frequency of searches that taking place at any given time, but we can see only how the searches have changed over time. 2.1 Introducing the uses of Google Trends in different areas The development of Google Trends has been taken on by many researchers from numerous fields of studies. By utilising various methodology and analytical skills we can learn a lot from these literatures to benefit our project. With comparison of data and the use of Google Trends in mind, we will be mainly focusing on informative and useful literature that will allow us to look at the advantages and disadvantages outlined in each article. Google Trends has become the favoured tool for many analysts gathering data. Reg:

11 In 2009 the first ground-breaking paper about the introduction of Google Trends was published by Choi and Varian (2009). This pointed out that Google Trends is a good software tool for making short-term predictions about economic activities containing motor vehicles and parts, unemployment benefits, travel, car sales and retails sales. An updated version by Choi and Varian (2012) offered more detail, using analysis to show that, by using simple seasonal AR models, Google Trends variables tended to outperform models and has beat other tools by 5% to 20%. This indicated that short term predictions can be used for relevant research. More academic papers have been using Google Trends for research purpose and started to appear soon after Choi and Varian s work was published. Many analysts have been looking at Google Trend s and how it related to the medical industry. Researchers including Sudhakar, et al. (2014) suggest that the use of Google Trends to study health phenomena has been approached in many different ways but better methods are needed, as the reproducibility of some findings has been precluded. Katı and Selek (2012) used the search frequencies of medical terms on Google Trends to find that search terms with higher search rates over the years indicate that disease is more widespread although noise levels can be affected by certain events and special dates. Google Flu Trend was designed for the medical industry Copeland et al. (2013). The purpose of Google Flu Trends (GFT) is to use search keyword trends from Google.com to produce a daily estimate, or nowcast, of the occurrence of flu two weeks in advance. Patients may use search engines to look for keywords that reflect their symptoms or to aid initial self-diagnosis, while physicians may use search engines to identify available resources on the web. Seifter et al. (2010) Another influential study Ginsberg et al. (2009), published in Nature, demonstrated that Google Trends data could be used to detect flu outbreaks. Araz et al. (2014) state that Google Trends statistically can improve the performance of predicting influenza and illness rates. Trends can offer timely and accurate estimates of Emergency department volume during flu season, helping hospitals to plan their Emergency department resources accordingly and lower costs, optimizing relationships with suppliers and improving service quality. Lazer et al. (2014) suggest that Google Flu Trends can predict almost more Reg:

12 than double the proportion of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention, but they explained that, up to 2013, Google Flu Trends overestimated by a big margin, especially during the flu season and predicted wrongly for 100 out of 108 weeks starting at August 2011 due to the change of algorithm in We also need to keep that in mind that Google Trends has also changed their algorithm. In the paper of Vosen and Schmidt (2011), they suggest that Google Trends is a much more reliable tool for analysing and prediction-making rather than traditional surveybased methods. They say Google Trends has outperformed conventional alternatives in almost all the experiments that have been carried out, and if it used in a skilled manner by experienced professionals will offer more benefits to forecasters at private companies. More data modelling tools like ARIM, MCSI and CCI have been used later on to gain more accurate results. Schmidt and Vosen (2013) that indicate Google Trends exceeds the performance of prominent survey-based indicators and can thus serve as a very useful tool for professional forecasters of consumer spending. In the movie industry, Goel et al. (2010) used query volume to predict box-office revenue. Hand and Judge (2012) have also suggested that Google Trends search data is related to cinema visits levels. They suggest that searches for specific films do have a potential to increase the accuracy of cinema admissions forecasting models. Carneiro and Mylonakis (2009) both defined that Google Trends in the more advanced countries will have higher prevalence rates on searches for disease as advance country have more population on the search volumes. Shimshoni et al. (2009) found that over half of the most popular Google search queries entered with categories are almost 90% more accurate than individual searches. During tough economic times, many business have also taking advantage of the Google Trends prediction power. Carrière-Swallow and Labbé (2013) suggest that Google data is a promising source of information for nowcasting in the automobile industry. They also believed the short term models can aggregate demand in emerging markets. Ayoubkhani D, Office for National Statistics, UK (2012) found that Google Trends data appeared to have most potential for quality assurance in retail sales and the Reg:

13 overall results are mix which indicate on different searching terms. Du et al. (2015) found that using big data source like Google Trends will define the ability of marketers to monitor the evolution of online consumer tastes. In the education field, Vaughan and Romero-Frías (2014) found that university name and the their status had a strong relation between each other and showed, with examples, that people in different countries will need to be use different searching methods to be able to derive the most value from the data. More advanced searches on data mining from web search queries have occurred, with Vaughan and Chen (2015) comparing Google Trends with Baidu Index. The paper emphasises the importance of remembering that not every country uses the same search engine. It s crucial that we remember Google doesn t dominate every search market. 2.2 Selection of Search Terms The selection of search terms is an important starting point Vaughan and Romero-Frías (2014). Vaughn and Romero-Frias looked at two search terms which, without context, mean very similar (or the same) thing(s). In an example, they found out that the Search Harvard and Harvard University led to very different results. Harvard had the much higher search rate than Harvard University which gives an indication of the importance in selecting the most appropriate search term; and using various methods of term selection is very important if we are to see accurate and relevant results. We used Google Trends to look at how important the theory was. In figure 2.1 below shows the results these will help us understand more of the ideas of Vaughn and Romero-Frias. Reg:

We can see that the blue line represents the search term for Harvard and red line represents Harvard University. We can see a massive difference between the two terms. Figure 2.

14 We can see that the blue line represents the search term for Harvard and red line represents Harvard University. We can see a massive difference between the two terms. Figure 2.1: Harvard compare with Harvard University in Google Trends 2.3 Using shorter terms A search engine user might decide to search for a stock in Google by using either its ticker or company name. Identifying search frequencies by company name can cause problems for two reasons. Firstly, investors may search the company name for reasons not connected to investing. For example, one may search Best Buy for online shopping rather than to collect financial information about the firm. This problem is more severe if the company name has multiple meanings (e.g., Apple or Amazon ). Da et al. (2011). Sometimes we need to be careful when using shorter terms for universities. Vaughan and Romero-Frías (2014) point out, for example, that searching for universities like USC could result in search results that were very different e.g. University Student Council. This means there is the obvious need to take these factors into account. Such products or terms that are known by an abbreviation can act like homonyms and compromise the value of data and search results. The figure 2.2 shows the difference by using the shorter term and the full term. Reg:

15 The blue line represents UEA (abbreviation of University of East Anglia) and the red line represents the full name. The result shows that there is a big difference between the two terms. Our data will be affected if the terms are not used correctly. Figure 2.2: UEA compare with University of east anglia in Google Trends Reg:

16 2.4 Language independent Searching for queries is mostly language-dependent, and as Google is a worldwide search engine, if we were analysis data for different countries, or locations, we must think about the relevant search function/term for that country. Schmidt and Vosen (2013) both point out that English and Spanish search terms are different and that direct translations may be used in a different contexts. Said by Zhu et al. (2012), carried out a search of baseline query using the most frequently used word in the Chinese language ( de, which is a preposition word equivalent to of in English in both semantics and popularity). The purpose of charting such a trivial word is to obtain a measure that approximates the total search volume on both search engines throughout the period of time being studied, which is not publically known. The baseline query gives us control over artificial trends in substantive queries. These artificial trends can be caused by a mere increase in Internet users (this has happened in Shenzhen, elsewhere in China and many other countries) or interruptive events (e.g., Google s decision to remove search engine servers from mainland China to Hong Kong). They go on to compare the usage of the English and Spanish language in Google searches, concluding that in Ginsberg s paper Google Trend is better at explaining the present rather than predicting the future. A great importance is placed on limiting the search category and focusing on a narrow set of criteria in order to extract the most relevant data. They also claim that Google is the most commonly used search engine in each of the two countries (USA and Spain). This demonstrates that Google Trends would be the most appropriate platform to use given that it reflects the population the most accurately. When comparing different universities over the world, it is vitally important that we understand each country s culture well and understand how terms would be searched in that country. To make the data easier to compare and see more accurate results, fault data will have a bad effect into the analysing part. Reg:

17 2.5 Category of section terms It s also important to take into consideration how to use categories appropriately. Brynjolfsson et al. (2014) said that the advantages of using search volume based on Google s predefined categories include its ease of use and its ability to encompass multiple relevant search terms. The terms we re searching for are mainly focused on different university characteristics e.g. course ranking. These are determined, or categorised, in the education section. However, other relevant categories can influence our data collection and the accuracy of the project. Choi and Varian (2012) used different categories to test the outcome of a prediction. Topics assessed included vehicles and parts, claims for unemployment benefits, travel and consumer confidence. They found that in each of these categories. What we learn from this is that it s important to have real data in order to confirm or validate assumptions made when using our own data. It could therefore be misleading that the term apple is searched for disproportionately compared to any other fruit owing to the fact that it shares the same name as a popular company in that category. Therefore, we need to take into consideration the possible consequences of misleading results and non-related, but similar, search terms. The figure 2.3 shows with adding category in Google Trends with Jobs and Education with University of East Anglia. The blue line represents search University of East Anglia and the red line represents that with category function under Job and Education. Again, we can see the importance of searching by using the category of section terms. Reg:

18 Figure 2.3: University of East Anglia compare with category set Job&Education University of East Anglia in Google Trends 3 Explore Google Trends The W3school website shows us that, from 2011 to 2014, Google Chrome was the most used web browser in the world. The popularity of the browser has increased continuously since October 2011 when 32.3% of users were choosing Chrome for internet usage, and this rose to 64.4% by October w3schools (2015). The premise here is that, because Google Chrome is so widespread and popular, more people will be using Google as their primary search engine and the data collected will be more reflective of the population as a whole, with the company being given more and more useful data all the time. Its nearest competitors Internet Explorer (IE) and Mozilla Firefox are used much less, with Microsoft recently announcing plans to replace the much-criticised IE. Google Trends may not tell us what is going to happen in the future but it gives us an insight into the pattern of searching. It s quite similar to the graphs used to reflect the financial markets, although trends can be much more volatile and less well-behaved. Reg:

19 Despite it not being a good tool for prediction, it might be used to forecast the direction of trends in the near future Choi and Varian (2012). To understand the important functionality of the software in detail on how to search the terms we used. In this part of the project we will describe how the software retrieves the data and how Google have made it user-friendly. From a user s point of view, 3.1 Functionality We can see that the simple searching methods work fine, but we need to be more consistent when searching a particular subject as the project is to focus on one aspect. We can use Google Trends to its best ability because there are functions that allow us to limit our parameters and concentrate on the topic we are really looking for. By explaining each function offered in Google Trends we can get a better understanding of what we are searching for. 3.2 Worldwide Figure 3.1: Worldwide setting to United Kingdom in Google Trends By using Worldwide to search for only one country it is slightly more difficult to limit the level of noise from other countries that could have a negative, misleading effect on Reg:

the results of the intended search. By changing the parameter to United Kingdom we can reduce this margin for error. This means it will include one language and perhaps more context. 3.

20 the results of the intended search. By changing the parameter to United Kingdom we can reduce this margin for error. This means it will include one language and perhaps more context. 3.3 Set time Figure 3.2: Set time function from October 2011 to October 2014 in Google Trends We can also be precise by setting a particular time range to improve the accuracy of our data. In Google Trends a formula can automatically generate the data or graphs that are most appropriate or relevant for the user. Therefore we can use these constraints to match the data that we have i.e. October 2011 to October Category setting The category function is another setting that enables us to be more precise when searching for the correct terms. With this function, we can set the category of our searches specifically to: Jobs & Education which narrows the search queries to ones relating specifically for education - as required. Reg:

21 Figure 3.3: Regional interest 3.5 Regional interest Figure 3.4: The results of Regional interest after searched in Google Trends After searching a term in Google Trends, this function (under regional interest) will show results for each region/area and the percentage of those regions. We can use this data to compare and discover which regions attract the most searches and compare this Reg:

22 to the application rate from UEA to identify whether or not it has affected the application rate in that region. 4 Data Collection During data collection it is important to find relevant pieces of data from the vast swamp of information and to find out specifically what the data sets mean as well as what information can be extracted from where Kristoufek (2013). We know that Google Trends data is reported with a weekly (Sunday-Saturday) frequency therefore we can have the most up-to-date data allowing us flexibility to compare data with daily base weekly base or yearly bases. Raw search volume data is sensitive information that Google is not able publish publically- Google Trends does not provide concrete numbers, but they do provide the Search Volume Index of a keyword, which is seen as a processed version of the data Lui et al. (2011). Collecting primary governmental data such as an annual report can provide a thirdparty, independent view for comparison and analysis. Vaughan and Vaughan and Romero- Frías (2014) stressed the importance of real data in their paper and made a point about using QS - a professional body which produces world university rankings - in order for them to compare alternate data sources. They intended to compare data specifically between Spain and America. However, QS data did not show sufficient levels of data with regards to Spanish Universities. Thus, they used another guide called Docampo which included more information for Spanish Universities for better comparison. Specifically, they collected data on the student and faculty population sizes, which were used to normalise the Google Trends search volume data. Based on their methodology we could incorporate something similar. 4.1 UEA Application rate data We have been given data from UEA for their application rate a primary data. This data is the latest from years and features three categories: Home Student, EU student and Overseas student. On the data it shows how many applications were received during in each week over the three years. Reg:

23 4.2 Google Trends data Data from Google Trends is collected form Google website set the time range to from and download CSV file that contain the entire up-to-date data search rate for the compare with the application data. 4.3 Normalising data Normalising data is a crucial part of the project. It is important to have two data sets in an equal scale to be able to compare the data fairly and, most importantly, accurately. To do this, there are many methods that we can implement. According to Google, the method they use to normalise their data is to present it in a scale from Each point on the graph is divided by the highest point and multiplied by 100. When they don t have enough data it will be shown as 0. On our data we used the same methods as Google using the application rate data for comparison. In my case, I used the weekly results on the application and divided by the total number of applications and then multiplied by 100 to create an index comparable to Google s index. This means we can compare the two data sets. Another method that we used for normalising data is to use the scaling normalisation methods the scaling method is much more suitable as the formula is the closest to the Google Trends data. This is the formula X = (X X min) (X X max ) 100 Which X is the normalised data that should be between 0 and 100, after the data is normalised we can use our data for comparison with Google index data. X represent the current application data, X min is the smallest number in the application data and X max is the largest number in the application data. After putting all the number in the formula and times by 100 then we will have the normalised data results. Google Trends data is already normalised, so when we get a row data we should use this formula for normalisation. Reg:

24 4.4 Comparison with competitor Universities Comparison with competitor universities. The universities I have selected with a similar location and ranking (and are most like to be in the same pedigree with our university) are University of Essex (show in yellow) and University of Sussex (show in purple). These would be ideal as they are both universities with a long history. The other two I selected are University of Kent (show in red) and University of Warwick (show in green). The reason for this is that the Warwick University is ranked higher than UEA (show in blue) and Kent University is around the same ranking with UEA. Therefore I felt they will have similar links if we compare the data together. By using the function from Google Trends that sets the region search within UK to minimize the noise level down, making sure that all the search is relevant. Reg:

Figure 4.1: Search of the five university in Google Trends This graph above shows that all the university have a very similar link: the most searched period is before the start of term.

25 Figure 4.1: Search of the five university in Google Trends This graph above shows that all the university have a very similar link: the most searched period is before the start of term. The trends with all five universities have the same results but the difference between them is that the search rate on some universities are higher than others. I used the university league table from The Complete Univer- Reg:

26 sity Guide Independent Trusted (The Complete University Guide Independent Trusted, 2015). To provide the accurate data and listed the ranking of each University up to now to see if there is any link between the reputation and search rankings on the search rates. We can assume that the higher ranking universities should have a higher search rate as the reputation is higher. University of East Anglia Figure 4.2: Regional interest of search term University of East Anglia University of East Anglia rank 15, 2014 rank 20, 2013 rank 27, rank 27, 2011-rank 28 University of Kent Reg:

27 Figure 4.3: Regional interest of search term University of Kent University of Kent rank 22, 2014 rank 28, 2013 rank 33, rank 34, rank 38 University of Essex Figure 4.4: Regional interest of search term University of Essex University of Essex rank 39, 2014 rank 39, 2013 rank 39, rank 38, rank 37 University of Warwick University of Warwick 2015-rank7, 2014-rank 8, rank 6, rank 8, rank 7 Reg:

28 Figure 4.5: Regional interest of search term University of Warwick University of Sussex Figure 4.6: Regional interest of search term University of Sussex University of Sussex rank 38, 2014-rank 31, 2013-rank 21, 2012-rank 19, rank 19 Looking at the rankings of each university we can see that the higher universities will have more search rate but we can also see that they all have the similar trend line, which means the searches in a particular period are the same. We can assume that, because all the universities in UK have similar deadlines for application, the search rates on Google Trends are similar. Another interesting finding is that most of the search Reg:

29 rate in each of the university is in the region of each capital. For instance, for the University of East Anglia the top search is in Norwich, for the University of Kent it is in Kent, for University of Essex it is Tiptree, for University of Warwick it is Warwick and for University of Sussex it is Brighton. These results suggest that local residents pay more attention to their own university which indicates that out of all five universities the university will have more local students to other students from different areas. These assumptions will help the implementation of the home application when we compare in the analyses part of the project. 4.5 Selecting a good model Model selection is an important part of any statistical analysis. In this paper, apart from determining if there is a relationship between Google Trends data and UEA Application data, we will use different modelling methods to determine which model is the best to properly predict the values. There are different ways in comparing models. 1. Error measures in the estimation period: root mean squared error, and mean absolute error. 2. Residual diagnostics and goodness-of-fit tests: plots of actual and predicted values; plots of residuals versus time, versus predicted values, and versus other variables; residual autocorrelation plots, cross-correlation plots, and tests for normally distributed errors; measures of extreme or influential observations; tests for excessive runs, changes in mean, or changes in variance. 3. Qualitative considerations: intuitive reasonableness of the model, simplicity of the model, and above all, usefulness for decision-making! The root mean square error (RMSE) has been used as a standard statistical metric to measure model performance in different fields. It is more sensitive than other measures 1 to the occasional large error. The formula for RMSE is RMSE = n n i=1 (y i ŷ i ) 2, where y i is the observed values and ŷ i are the fitted values from the model. Reg:

30 The Mean Absolute Error (MAE) measures how far the predicted values are away from the observed values. The formula for the MAE is MAE = ( i = 1) N x pred x obs, where x pred is the predicted value, and x obs are the observed values. In this paper, I opted to use the Mean Absolute Error and then compare the plots of the actual and predicted values because these are easily accessible in SPSS and are simple to interpret. According to Willmott and Matsuura (2005), they suggested that the RMSE is not a good indicator of average model performance and might be a misleading indicator of average error, thus MAE would be a better metric for that purpose. Dr. Robert Nau said that one should put the most weight on the error measures in the estimation period most often the RMSE (or standard error of the regression, which is RMSE adjusted for the relative complexity of the model), but sometimes MAE or MAPE when comparing among models. Chai and Draxler (2014) conducted a study to compare RMSE and MAE. Their findings indicate that MAE is a more natural measure of average error, and (unlike RMSE) is unambiguous. Dimensioned evaluations and inter-comparisons of average model-performance error, therefore, should be based on MAE. By rule, we choose the best model with the lowest MAE. 5 Analysis The variables in the study include the following: 1. The Target Variable is the UEA Application data. 2. The predictors are the weekly Google Trend data collected set to time range from October 2011 to October 2014 and the three Application Rate data consisting of three categories. The Home, EU, and Overseas students from the past three years. The data was normalized to have the set the data in an equal scale to be able to compare the data fairly and accurately. Reg:

31 The formula for normalisation is: X = (X X min) (X max X min ) 100 A first order Autoregressive Model or AR(1) was tested to predict the Application Rate data. This means that we will test the fit of the model in predicting the Application Rate data. The auto regression is a regression model in which Y t is regressed against its own lagged values. In this study, this means that current Google Trend data is based immediately on the preceding value. The four predictors were also included in the time series model to determine whether there is a significant effect with the Google Trend data. SPSS and MATLAB were used to analyse the data. The following are the tools used in the analysis: 1. Sequence Chart shows the distribution of the UEA Application data, Google Trend, and the Application Rate (Home, EU, and Overseas) as well as their normalized versions. 2. Scatter Plot with Linear Model (R Square) shows the relationships or associations between two variables. In this paper, UEA Application data is plotted against the Google Trend Data to see if there is a Trend. The R Square shows the relationship and fit of the Google Trend and UEA Application data. The closer the value to 1, the better. 3. Multiple Linear Regression Analysis is used when we want to predict the value of a variable based on the value of two or more other variables. In this paper, this shows relationship of the target variable (UEA Application data) and the predictors (Google Trend data, Home, EU, and Overseas Application Rate). This analysis uses the following diagnostic tools: a) R Square is used to determine how close the data are to the fitted regression line. b) The Coefficients Table is used to determine the significant predictors (Google Trend data, Home, EU, and Overseas Application Rate) to our Target Variable UE Application data. Reg:

32 4. Time Series Analysis shows the relationship of successive values in the data represent consecutive measurements taken at equally spaced time intervals. a) Autocorrelation and Partial Autocorrelation Function to determine the significant lags in the sequence data. b) Model Fit Statistics (Ljung-Box Q) to determine if the model exhibits a fit. A p-value greater than 0.05 indicates that the residuals are random. Ljung-Box Tests determine if the data are independently distributed which means there are no serial correlations for the data being studied. Or in simple terms, the data is randomly distributed. Ljung and Box (1978). A general hypothesis test for Ljung-Box is defined as: H0: The residuals are random thus the model does not exhibit lack of fit. Ha: The residuals are not random thus the model exhibits lack of fit. Formula: Given a time series Y of length n, the test statistics is defined as: Q = n(n + 2) m k=1 ˆr 2 l n k where r k is the estimate autocorrelation of the series at lag k, and m is the number of lags being tested. This test is a diagnostic tool used to test the lack of fit of a time series model. The test examines m autocorrelations of the residuals. If the autocorrelations are very small, we conclude that the model does not exhibit significant lack of fit. A lack of fit test makes sure that the proposed model fits well with the set of observations. c) Stationary R Squared and MAE (Mean Absolute Error) are used as tests in comparing models. A higher value of Stationary R Squared and MAE means indicates a better model fit. d) ARIMA Model Parameters shows the significant predictors in the model. Araz et al. (2014) Reg:

33 5.1 Distribution of the UEA Application, Google Trend, and Home, EU, and Overseas data Figure 5.1: Sequence Chart for UEA Application data Figure 5.1 shows the sequence chart of UEA data. It can be seen that yearly in around the 40 th week there is a noticeable increase in UEA Applications. A noticeable decrease in UEA Applications can be seen around the last weeks (50 th to 53 rd ) of the year. Figure 5.2: Sequence Chart for Google Trend data Figure 5.2 shows the Google Trend Search data of the past three years. It can be seen that a significant decrease on the last weeks of the year 50 th to 53 rd similar to the UEA Reg:

34 Application data. Though we can also see a significant increase for the 17 th week that can t be seen on the UEA Application data. Figure 5.3: UEA Application Rate for Home Student Figure 5.3 shows the application rate for UK Home students over the past three years periods in UEA. As we can see, there is a trend each year at the very beginning of the graph, which is stating from September has the highest applicant and keeps dropping every month after that, then it will increase when another year approaches. This shows us that the data we have is reliable because is continuously for three years. We can apply this diagram to observe the difference with Google Trends data for the past three years and see if they have any correlation between them. With the data that UEA has been given this is a graph on EU student application rate. On the graph we can see there is some similarity to figure 2.3, which also have majority of applicants during the first 20 weeks and a drop after each week. But in terms of the numbers of the applicants compare to Home student is much lower with EU student. Therefor we can compare with the most application rate with overseas student that to predict if overseas student will have a similar results with both graph. In figure 5.5 it shows all the application in three years and with Overseas students Reg:

35 Figure 5.4: UEA Application Rate for EU Student Figure 5.5: UEA Application rate for Overseas Student we can clearly see that every year the busiest time for UEA to receives application in its first 5.7 weeks and decreases till the end of each year. This is because of students Reg:

36 applying to the university at the first 5.7 weeks and also postgraduate students in which many of them are international student to enter at the start of the January term. Figure 5.6: of Google Trends Data Three years of Google Trends data put together and is a comparison which to determine if they are correlate. We can see from the graph that there is a yearly trend with the data. A significant increase can be seen in the first 10 weeks and then a significant drop on the 12 th week and then an increase on the 15 th week up to the last week. 5.2 Linear Regression Figure 5.7 shows the Linear Regression wherein the predictor is the Google Trend data. The target variable is the normalized UEA Application rate data. The Unstandardized Coefficients (B) is the regression coefficients. In this paper, the Reg:

37 Figure 20: Linear Regression Table Model Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta 1 (Constant) Google Trend Data a. Dependent Variable: index UEA R Squared = t Sig Figure 5.7: 20 shows Linear the Regression Linear Regression Table wherein the predictor is the Google Trend data. The target variable is the normalized UEA Application rate data. equation is UEAApplicationdata = (GoogleTrenddata) The Standard Errors are the standard errors of the regression coefficients. They can be used for hypothesis testing and constructing confidence intervals. The Standardized Coefficients are the values for a regression equation if all of the variables are standardized to have a mean of zero and a standard deviation of one. 0. The t statistic test has the null hypothesis that a population regression coefficient B is The Sig also stands for p-value is used in testing the null hypothesis. This shows which of the predictors have a significant relationship with the target variable. In this paper, the significant predictor is the Google Trend data. Figure 5.8 represents the Normalized UEA Application data (Y) and the Google Trend data. The R Square value is 0.370, which is moderately low. The line represents the model and it shows us an increasing Trend wherein as Google Trend data increases, does the UE Application data. 5.3 Multiple Regression Analysis We have seen the relationship between the Google Trend Data and Normalized UEA Application data. Let s see if there will be an improvement in the model if we add the Home, EU, and Overseas data. Reg:

38 Figure 5.8: Scatter Plot with Linear Trend for UEA Application and Google Trend data. Figure 22: Multiple Regression Coefficients Coefficients a Model Unstandardized Coefficients Standardized Coefficients t Sig. 1 B Std. Error Beta (Constant) Google Trend Data Home EU Overseas a. Dependent Variable: indexuea Figure R Square 5.9: = Multiple Regression Coefficients Figure 22 shows the Multiple Regression Coefficients wherein the predictors are Home, EU, and Overseas FigureAcceptance 5.9 showsdata, the and Multiple the Google Regression Coefficients wherein the predictors are Home, EU, and Overseas Acceptance data, and the Google Trend data. The target variable is the normalized UEA Application rate data. The Unstandardized Coefficients (B) is the regression coefficients. In this paper, the equation is UEAApplicationdata = (GoogleTrenddata) Reg:

The Standard Errors are the standard errors of the regression coefficients. They can be used for hypothesis testing and constructing confidence intervals.

39 The Standard Errors are the standard errors of the regression coefficients. They can be used for hypothesis testing and constructing confidence intervals. The Standardized Coefficients are the values for a regression equation if all of the variables are standardized to have a mean of zero and a standard deviation of one. The t statistic test has the null hypothesis that a population regression coefficient B is 0. The Sig also stands for p-value is used in testing the null hypothesis. This shows which of the predictors have a significant relationship with the target variable. In this paper, the significant predictor is the Google Trend data. Figure 5.10: Sequence Chart for the Normalized UEA Application data and the predicted values from the Model Figure 5.10 shows the fit of the model with the UEA Application data. The results show almost a similar trend from both the UEA data and the predicted fit data. Some mishaps can be seen in the middle weeks of 2014 wherein the sequence for the predicted fit data increased drastically as opposed to the real scenario where UEA Application data decreased. Reg:

40 5.4 Time Series Analysis Figure 5.11: Autocorrelation Function for UEA Application data The ACF (Autocorrelation Function) pattern is used to determine time series models. Figure 5.6 shows the ACF of the UEA data. Results show a gradual decay in the ACF. Figure 5.12: Partial Autocorrelation Function for UEA Application data Reg:

41 Figure 5.8 shows the PCF for UEA data. The results show a significant spike on the first lag this indicates that the model might be a first order autoregressive model or AR(1). Figure 25: Model Statistics Model indexuea- Model_1 Number of Predictors Model Fit statistics Stationary R-squared R- squared Ljung-Box Q(18) MAE Statistics DF Sig. Number of Outliers Figure 5.13: Model Statistics Figure 25 shows the number of predictors Figure 5.13 shows the number of predictors that were in the generated model. Results show that one predictor was seen. The results show that the data are independently distributed (Ljung Box (17) = , p-value = 0.808) therefore we do not have to worry about serial correlation and the model does not exhibit lack of fit. The Stationary R Squared determines if the model is better than the baseline one. According to experts, a positive Stationary R Squared means that the model under consideration is better than the baseline model. In our case, the Stationary R Squared is 0.530, a positive value, which means that the model generated by SPSS, is better than the baseline model. Figure 26: ARIMA Model Parameters indexuea- Model_1 indexuea Google Trend Data No Transformation No Transformation Estimate SE t Sig. Constant AR Lag Numerator Lag a. For some models, some predictor series are not considered by expert modeller due to missing values found in the estimation period. Figure The results 5.14: show ARIMA that UEA Model Application Parameters data can be predicted The results show that UEA Application data can be predicted by its previous value (lag 1), and the current value of the Google Trend data (lag 0). Reg:

The Estimates are the coefficients in the Time Series: y t = 22.331 + 0.622y t 1 + 0.436x 1 + e t The Standard Error is just the measure of the accuracy of predictions made with a regression line.

42 The Estimates are the coefficients in the Time Series: y t = y t x 1 + e t The Standard Error is just the measure of the accuracy of predictions made with a regression line. The SE for the predictor (Google Trend data) are fairly low thus indicating an accurate prediction of the regression fit line. The t value is used to determine and compute the p-value. Figure 5.15: Model fit sequence chart Figure 5.15 shows the sequence chart for the fit (predicted value by the use of the model), observed (the base UEA data), and the Upper and Lower Limits (UCL and LCL). It can be seen that the model used with the Non-Indexed data and the Indexed data are the same. 5.5 Selecting the best model Figure 5.16 shows the MAE comparisons of each of the models. By rule, we select the lowest MAE among the models. We can see that the Time Series model shows the lowest MAE among the three. And by comparison the model also has the highest R Square than the others, which means the predicted values are closer to the observed values (UEA Application data). Therefor we conclude that ARIMA (1,0,0) is the right Reg: