BUSS1020 Quantitative Business Analysis

Size: px
Start display at page:

Download "BUSS1020 Quantitative Business Analysis"

Transcription

1 BUSS1020 Quantitative Business Analysis Week 1 - Introduction and Collecting Data Process of statistical analysis 1. Define the objective, and understand the data we need to collect. 2. Collect the required data type, using an appropriate process. 3. Organise the data: 3.1. Clean out any extraneous data points, missing data, and blatant outliers Prepare it in a form suitable for analysis Tabulate and summarise it. 4. Visualise the data with graphs and charts. 5. Analyse the data. Through better understanding of the data they generate and collect, business can make better decisions. Branches of statistics used in business a. Descriptive : collecting, summarising, presenting and organising data. b. Inferential : inferring conclusions about a population based on a sample. c. Predictive : predicting future outcomes based on a sample. Some basic vocabulary Variables, or attributes, are characteristics of an item or individual (e.g., age). Data are the observed values or outcomes of one or more variables (e.g., 25 years old). The operational definition of a variable is a universally accepted, clearly defined meaning of what that variable is (e.g. the operational definition of age would be the number of complete years a person has been alive for ). A population consists of all the items or individuals about which you want to draw a conclusion (the large group ). A census is an analysis and data collection of every item or individual in a population. A parameter is a numerical measure that describes a relevant characteristic of a population. (Think of parametric equations - they use a variable to describe a curve.) A sample is the portion of a population selected for analysis (the small group ). A statistic is a numerical measure that describes a characteristic of a sample. Often a statistic estimates a parameter.

2 Types of variables a. Categorical (e.g. car-drivers, bicycle-riders, public-transport-takers). b. Numerical: i. Discrete variables arise from a counting process (e.g. age, number of children, defects per hour). ii. Continuous variables arise from a measuring process and can be assigned any value, or a large range of possible values, within a given interval (e.g. height, stock price, time). Measurement scales (levels of data usage and usefulness) a. Nominal : labels used to distinguish different categories that have no order (e.g. employment classification as: teacher; construction worker; lawyer; doctor; other). b. Ordinal : labels used to classify and rank data points (e.g. the program was: not at all helpful; somewhat helpful; mostly helpful; extremely helpful). c. Interval : data are numerical and differences between values have a consistent meaning (e.g. temperature, calendar dates, scaled marks). d. Ratio : data are numerical with consistent meaning given to distances between values, plus the point 0 has a true meaning (e.g. weight, revenue, Facebook likes). Essentially, ratio-scaled variables cannot go below 0. Sources of primary and secondary data a. Organisations distributing data, such as stock prices, weather conditions, sports stats, and search engine results. b. A designed experiment where researchers control treatments given, such as testing for fertiliser effectiveness. c. A survey where researchers directly ask people questions, such as political polls. d. Observational studies where researchers observe a phenomenon, such as traffic volume. i. These studies usually involve time-series data, which measures the level of a particular variable at several equally-spaced points in time. e. Automated and streaming data, such as GPS data, browser history, metadata. i. This is often what we mean by big data. Big data has four characteristics of volume, velocity, variety, and veracity. In BUSS1020, we focus on structured data, which is stored in easy-to-use databases or tables. There is, however, a lot of unstructured, messy data floating around that cannot be easily stored (e.g. tweets, s, texts, podcasts, video streams). Unstructured data needs lots of preparation and organisation before it can be sufficiently analysed.

3 We use sampling because most of the time, we can t get the whole population. Sampling is less time-consuming and less costly than a census. To accurately sample, we need to first generate a sample frame. A sample frame is a list of items or individuals that are in the population and can be sampled (that is, they have meaningful data that we can organise, visualise and analyse). Biased results can occur if parts of the population are excluded (for example, how did pollsters get UK General Election 2015, Brexit and Trump wrong?) - we need to choose a representative sample that is as broad as practicably possible. Different types of samples we can take a. In non-probability samples, items are chosen without regard to their occurrence. i. A convenience sample is based on selecting items that are convenient. ii. A judgment sample is based on experts selecting the most appropriate items, taking convenience into account. iii. A self-selecting sample is where individuals choose to participate. iv. A quota sample is where pre-set quotas of groups are chosen, by convenience. b. Non-probability samples have the problem of not being representative, thus leading to biased results. c. In probability samples, items are chosen randomly, sometimes using known probabilities that closely match those in the population. i. A simple random sample (SRS) is where every item has an equal chance of being selected (i.e. drawing names out of a hat), with or without replacement. ii. A systematic sample is where you divide the sample frame, N, into n systems of k items. You then select one item from system 1, and select every k-th individual thereafter from

4 each system, up to system n. iii. A stratified sample is where you divide the sample frame into strata according to a characteristic. You then select a simple random sample from each stratum, with the size of this sample proportional to the relative sizes of each stratum. Then, the selected items are combined into one sample. 1. This can be used to ensure proportionate representation and that minorities are included. iv. A cluster sample is where you divide the population into several clusters, each representative of the population. You then select a simple random sample of clusters, and use the items in your selected clusters as your sample. 1. This is often used in election exit polls, where results are time-sensitive. d. When taking samples, you need to consider cost, efficiency, representation and anti-bias. Possible errors you could make in surveying a sample a. Coverage error or selection bias : some groups are excluded from the frame and have little to no chance of being selected. b. Non-response error or selection bias : people who choose not to respond may be different from those who do respond. c. Sampling error : variation from sample to sample is natural and expected. d. Measurement error : due to weakness in question design, respondent confusion, and the interviewer s effects on the respondent. Questions are ambiguous, unclear or leading.

5 Week 2 - Organising and Visualising Data Data is organised and visualised so we can reveal, communicate and gain insight from the patterns hidden within it. a. Organising One Variable Categorical Data A summary table indicates the amount or percentage of variables which fall within each category. We can see the relative frequency of each category, and compare differences between categories. b. Visualising One Variable Categorical Data A bar chart has one bar for each category, and each bar length represents the amount or percentage of values in that category. A pie chart is a shaded circle with one slice for each category, and the size of a slice represents the percentage of all values in that category. A Pareto chart contains both a vertical bar chart, with categories shown in descending order of frequency, and a line graph, which represents the cumulative total. Pareto charts highlight the most important among a (typically large) set of factors.

6 c. Organising Two Variable Categorical Data A contingency table cross-tabulates the responses of the categorical variables in question. It can show patterns or relationships between two or more categorical variables. d. Visualising Two Variable Categorical Data A side-by-side bar chart splits the data into several bar charts, each of which add up to 100%. e. Organising Numerical Data An ordered array is a sequence of data, in rank order, from the smallest to largest value. It shows the full range of values and may help identify outliers.