BUSS1020. Quantitative Business Analysis. Lecture Notes

Size: px
Start display at page:

Download "BUSS1020. Quantitative Business Analysis. Lecture Notes"

Transcription

1 BUSS1020 Quantitative Business Analysis Lecture Notes

2 Week 1: Unit Introduction Introduction Analytics is the discover and communication of meaningful patterns in data. Statistics is the study of the collection, analysis, interpretation, presentation and organization of data. DCOVA: Define the problem or objective and the data required (design). Collect the required data (in an appropriate manner) Organize the data: clean it, prepare it for analysis, tabulate and summarize it. Visualise the data Analyse the data At each step of a statistical analysis the type of the data must be known. The data type strongly influences how the analysis proceeds, the choice of methods, etc. This affects all parts of the DCOVA. Three Different Branches Of Statistics Are Used In Business: Statistics: methods that collect, describe, transform data into useful insights for decision makers. Descriptive: o Collecting, summarizing, presenting and organizing data. Predictive: o Using a model and data to make forecasts of outcomes Inferential o Using data collected from a small group to draw conclusions about a larger group o Estimation e.g. estimate the population average amount spent using the sample average spent o Hypothesis testing e.g. testing the claim that the population average amount spent in one group is large than that in another group. Basic Vocabulary: Variable: o Variables are characteristics of an item or individual. Data on a variable(s) is what you analyse when you use a statistical method. (In analytics, variables are often called attributes). Data: o Data are the observed values or outcomes of one or more variables. Operational Definition: o Variables should have universally accepted meanings that are clear to all associated with an analysis; the clearly defined meaning is the operational definition. Population:

3 o A population consists of all the items or individuals about which you want to draw a conclusion. The population is the large group. Sample: o A sample is the portion of a population selected for analysis. The sample is the small group. Parameter: o A parameter is the numerical measure that describes a relevant characteristic of a population. Statistic: o A statistic is a numerical measure that describes a characteristic of a s ample. Often a statistic estimates a parameter. 1. Defining Data: For every variable you wish to examine you must provide an operational definition, this should identify the value of the variable to ensue date is acceptable for analysis. Ensure that the defined data is clear e.g. is yearly sales for an individual, store or chain. Types Of Variables: o Categorical: variables collected through qualitative categories (Eg. Colour) o Numerical: variables collected quantitatively through numerical quantities. Discrete values achieved as integers from a counting process. Continuous values achieved from a measuring process can be assigned any value within a given interval Sometimes discrete variables with MANY outcomes are treated like continuous variables e.g. prices. 2. Measurement Scales For Variables: Nominal: classifies values into categories that do not have a distinct ranking or value e.g. gender. (lowest level of measurement) Ordinal: classifies values into distinct categories that has indicates a ranking. E.g. excellent, very good, fair Interval: classifying numerical data where the difference between values is meaningful. But there is no true zero e.g. temperature in Celsius zero degrees does not indicate the is not heat. Ratio: is an ordered scale where the differences between values is meaningful and there is a true zero e.g. $0 means there is no money. The highest level of measurement Usage Potential of Various Levels Of Data: o Nominal, within Ordinal, within Interval, within Ratio

4 3. Collecting Data: Once variables have been defined data can be collected. This is critical as if data that is collected in a flawed, bias or ambiguous manner will lead to inaccurate results. Sources of Data: Primary Sources: o The analyst collects the data survey, experiment Secondary Sources: o The analyst uses data already collected statistical resources, databases Types Of Data: o Data distributed by an organization stock prices, sports stats o Data from designed experiment outcomes market testing, quality control o Data from survey responses political polls, internet polls o Data from observational studies time taken for service, volume of traffic o Data from ongoing business activities / automated and streaming data Mobile phone data use, GPS data, electronic monitors, social media feeds Population v. Sample: Population: measures used to describe the population are called parameters. Sample: measures used to describe the sample are called statistics. Data Formatting: Structured: is data that follows some organisation principle or plan, typically a repeating pattern e.g. a stock ticker, excel-type tables Unstructured: follows no repeating pattern, is not storable in an excel file and is stored in many different locations e.g. audio files o Such data needs a lot of preparation and cleaning before any analysis can be done. o Cleaning: identify and remove errors, flag strange data possibly outliers, fillin missing data. Electric formats: data that is found on a computer or electronic device. 4. Types Of Sampling Methods Why Sample? Often we can t get the whole population Collecting information from a sample is less time-consuming and less costly than selecting every item in the population (census). An analysis of a sample is often less cumbersome and more practical than an analysis of the entire population.

5 Sampling begins with a sampling frame list of items that are in the population and can be sampled. o Includes: population lists, directories, customer databases, social media users, maps etc. o Inaccurate or biased results can result if parts of the population are excluded. Types Of Sampling Non-Probability Sampling: o Items or individuals are selected without knowing their probabilities of selection. The advantages are speed, convenience and low cost, however they cannot be used for statistical inference this offsets the advantages. Convenience sample: selected based only on being easy, inexpensive, quick to sample. Judgement sample: perceived experts or most appropriate items are selected, by convenience. Self-selected: individual choose to participate. Quota sample: pre-set quotes of groups chosen, by convenience. Probability Sampling: o Items or individuals are selected based on known probabilities. o Simple Random Sample (SRS): Every item in the frame has the sample probability of being selected. Often obtained with the help of a random number generator or via software. Note, this can be sampling with and without replacement make sure your N changes accordingly. o Systematic Sample: You divide your sample of N items into n groups of k samples: k = N/n The first item is randomly selected from the first group and the rest are the kth individual thereafter. o Stratified Sample:

6 You divide your sample of N into strata according to an important characteristic, and conduct a SRS in each strata, proportional to the size of each strata. These are then collected together. A common technique when sampling population of voters, stratifying across racial, socio-economic variables or other variables important to include. More efficient than either simple random sampling or systematic sampling because you are ensured of the representation of items across the entire population. o Cluster Sample: Population is divided into several clusters that contain several items and a representative of the population. An SRS of clusters is selected. All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique. A common application of cluster sampling involves election polls, where certain election districts are selected and fully sampled. Cluster sampling is usually more cost effective than SRS, however it requires a large sample size to produce results as precise as SRS and stratified. o Comparing Sampling Methods: SRS and Systematic Sampling: Simple, cheap to use, effective against many types of bias. But may not give the best representation of the population s underlying characteristics. Why? Stratified Sampling: Ensures representation of individual across the entire population, possibly in the right proportions. Effective against bias. Most efficient method, but costly. Why? Cluster Sampling: Quite cost effective Can be less efficient (need large samples to be able to provide significant insight) 5. Types Of Survey Errors: Types Of Survey Errors: o Coverage error or selection bias: Exists if some groups are excluded from the frame and have (little or) no chance of being selected (Truman election) o Non-Response error or non-response bias: When people choose not to respond it is problematic when those who do and do not are from different groups. o Sampling Error:

7 Reflects the variation of data from sample to sample - always exists unless n = all. o Measurement Error: Occurs if the question is ambiguous or poorly worded, if there is a design error, a respondent error or if the interviewer has a particular effect. Prevent unethicality questions lead in a certain direction, interviewer s manner pushes a response, respondents willfully provide false information.

8 Week 2: Organising And Visualising Data Data is organized and visualized so as to reveal, gain insight from and communicate the information, especially the main features and patterns, that are hidden within it. 1. Organising Categorical Data: Summary Table: A summary tallies the values as frequencies or percentages for each category. This allows you to see: o The relative frequency of each category o The differences between the categories E.g. fidelity investment they were going to cut but found they were the most loyal and profitable customers 2. Visualising Categorical Data: The Bar Chart: o A bar chart visualizes a categorical variable as a series of bars, with each bar representing the tallies for a single category. o The length of each bar represents either the frequency or percentage of values for a category and each bar is separated by space, called a gap. o The bar charts can be horizontal or vertical.